BRAIN CONNECTIVITY ANALYSIS USING INFORMATION THEORY
AND STATISTICAL SIGNAL PROCESSING
By
Zhe Wang

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Electrical Engineering — Doctor of Philosophy
2017

ABSTRACT
BRAIN CONNECTIVITY ANALYSIS USING INFORMATION THEORY
AND STATISTICAL SIGNAL PROCESSING
By
Zhe Wang
Connectivity between different brain regions generates our minds. Existing work on brain
network analysis has mainly been focused on the characterization of connections between
the regions in terms of connectivity and causality. Connectivity measures the dependence
between regional brain activities, and causality analysis aims to determine the directionality
of information flow among the functionally connected brain regions, and find the relationship
between causes and effects.
Traditionally, the study on connectivity and causality has largely been limited to linear
relationships. In this dissertation, as an effort to achieve more accurate characterization
of connections between brain regions, we aim to go beyond the linear model, and develop
innovative techniques for both non-directional and directional connectivity analysis. Note
that due to variability in the brain connectivity of each individual, the connectivity between
two brain regions alone may not be sufficient for brain function analysis, in this research, we
also conduct network connectivity pattern analysis, so as to reveal more in-depth information.
First, we characterize non-directional connectivity using mutual information (MI). In
recent years, MI has gradually appeared as an alternative metric for brain connectivity,
since it measures both linear and non-linear dependence between two brain regions, while
the traditional Pearson correlation only measures the linear dependence. We develop an
innovative approach to estimate the MI between two functionally connected brain regions
and apply it to brain functional magnetic resonance imaging (fMRI) data. It is shown that:

on average, cognitively normal subjects show larger mutual information between critical
regions than Alzheimer’s disease (AD) patients.
Second, we develop new methodologies for brain causality analysis based on directed
information (DI). Traditionally, brain causality is based on the well-known Granger Causality
(GC) analysis. The validity of GC has been widely recognized. However, it has also been
noticed that GC relies heavily on the linear prediction method. When there exists strong
nonlinear interactions between two regions, GC analysis may lead to invalid results. In this
research, (i) we develop an innovative framework for causality analysis based on directed
information (DI), which reflects the information flow from one region to another, and has no
modeling constraints on the data. It is shown that DI based causality analysis is effective
in capturing both linear and non-linear causal relationships. (ii) We show the conditional
equivalence between the DI Framework and Friston’s dynamic causal modeling (DCM), and
reveal the relationship between directional information transfer and cognitive state change
within the brain.
Finally, based on brain network connectivity pattern analysis, we develop a robust method
for the AD, mild cognitive impairment (MCI) and normal control (NC) subject classification
under size limited fMRI data samples. First, we calculate the Pearson correlation coefficients between all possible ROI pairs in the selected sub-network and use them to form a
feature vector for each subject. Second, we develop a regularized linear discriminant analysis
(LDA) approach to reduce the noise effect. The feature vectors are then projected onto a
subspace using the proposed regularized LDA, where the differences between AD, MCI and
NC subjects are maximized. Finally, a multi-class AdaBoost Classifier is applied to carry out
the classification task. Numerical analysis demonstrates that the combination of regularized
LDA and the AdaBoost classifier can increase the classification accuracy significantly.

Copyright by
ZHE WANG
2017

Dedicated to my family and friends.

v

ACKNOWLEDGMENTS

I would like to take this opportunity to express my sincere appreciation for all the support and encouragement that have led to the completion of this dissertation. I am greatly
indebted to my advisor, Prof. Tongtong Li, for her continuous support, instruction, and encouragement throughout my PhD studies at Michigan State University. It is a great honor
to have worked with Dr.Li. This work would not have been possible without her guidance.
I would also like to thank Prof. Hassan Khalil, Prof. David C. Zhu, Prof. Taosheng Liu,
and Prof. Dean Aslam for serving on my committee. I am deeply grateful to them for their
valuable experience, experimental support and insightful discussions throughout my PhD
program.
Special thanks go to my colleagues in the Broadband Access and Wireless Communication
(BAWC) lab for their insightful academic inputs: Mai Abdelhakim, Tianlong Song, Ahmed
Alahmadi, Zhaoxi Fang, Yuan Liang, Yu Zheng, and Run Tian. Their generous help made
my study more enjoyable.
I am greatly grateful to my family and friends for their tremendous support and encouragement. This dissertation would not have been accomplished without their love, care,
patience, and support.

vi

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Overview of Brain Connectivity Analysis . . . . . . . . . . . . . . . . . . .
1.2 Revisit of the Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Non-directional Functional Connectivity Analysis of Brain Regions
1.2.2 Directional Functional Connectivity Analysis of Brain Regions . . .
1.2.3 Network Connectivity Pattern Analysis of Brain Regions . . . . . .
1.2.4 Motivations and Problem Identification . . . . . . . . . . . . . . . .
1.3 Summary of Dissertation Contributions . . . . . . . . . . . . . . . . . . . .

x

.
.
.
.
.
.
.
.

1
1
2
2
4
8
8
10

Brain Functional Connectivity Analysis Using Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Pearson Correlation versus Mutual Information . . . . . . . . . . . .
2.2.2 Limitations with the Existing Work on MI estimation . . . . . . . . .
The proposed approach for MI estimation . . . . . . . . . . . . . . . . . . .
2.3.1 De-correlation of Data Segments . . . . . . . . . . . . . . . . . . . . .
2.3.2 Kernel-Based Estimation of Probability Density Function . . . . . . .
2.3.3 Optimal Kernel Bandwidth Estimation . . . . . . . . . . . . . . . . .
2.3.4 MI Estimation Through Monte Carlo Integration . . . . . . . . . . .
Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13
13
16
16
17
18
19
20
21
23
24
26

Chapter 2
2.1
2.2

2.3

2.4
2.5

Chapter 3
3.1

3.2

3.3

Causality Analysis of fMRI Data Based on The Directed Information Theory Framework . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Some Representative Techniques on Causality Analysis . . . . . . . .
3.1.2 Proposed Approach: DI Based Causality Analysis . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Core Concepts in the Directed Information Framework . . . . . . . .
3.2.2 Directed Information Calculation and Causality Analysis . . . . . . .
3.2.3 Practical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3.0.1 Optimal Bin Size for Time Series Digitization . . .
3.2.3.0.2 Probability Estimation . . . . . . . . . . . . . . . .
Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 fMRI Data Pre-processing and Analysis . . . . . . . . . . . . . . . . .
3.3.3 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

27
28
28
32
35
35
38
41
41
43
44
44
45
47

3.4

3.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Causality between the fMRI data and its descending simulated
3.4.2 Causality analysis based only on the Experimental fMRI Data
3.4.3 Impact of Hemodynamics on DI Based Causality Analysis . .
3.4.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . .
data
. . .
. . .
. . .
. . .

.
.
.
.
.
.

48
48
52
56
59
63

Discrete DCM and Its Relationship with Directed Information
and Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discrete Dynamic Causal Modeling . . . . . . . . . . . . . . . . . . . . . . .
The Relationship between DDCM and Directed Information . . . . . . . . .
4.3.1 Directed Information based Causality Analysis . . . . . . . . . . . . .
4.3.2 DDCM and Directed Information . . . . . . . . . . . . . . . . . . . .
4.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Relationship between DDCM and Granger Causality . . . . . . . . . . .
Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 fMRI Data Pre-processing and Analysis . . . . . . . . . . . . . . . . .
4.5.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Summary of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64
65
71
75
75
76
82
84
85
86
87
88
91
91

Chapter 4
4.1
4.2
4.3

4.4
4.5

4.6

Chapter 5

5.1
5.2

5.3

5.4

5.5

Classification of Alzheimer’s Disease, Mild Cognitive Impairment and Normal Control Subjects Using Resting-State fMRI
based Network Connectivity Analysis . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regularized Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . .
5.2.1 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Regularized LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.0.3 Shrinkage of the Mean . . . . . . . . . . . . . . . .
5.2.2.0.4 Shrinkage of the Covariance Matrix . . . . . . . . .
5.2.3 LDA and the ML Estimation . . . . . . . . . . . . . . . . . . . . . .
Classification of AD, MCI and NC Subjects based on Connectivity Pattern
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 ROI Sub-Network Formulation and Connectivity Pattern Analysis . .
5.3.2 Basic Decision Tree Construction . . . . . . . . . . . . . . . . . . . .
5.3.3 The Multi-Class AdaBoost Classifier . . . . . . . . . . . . . . . . . .
Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 fMRI Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Performance Comparison of Different Classification Algorithms . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

93
94
99
100
102
102
103
104
107
107
108
109
111
111
111
113

Chapter 6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 115
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

ix

LIST OF FIGURES

Figure 2.1:

Comparison of probability density function between AD patients and NC
subjects using MI analysis. . . . . . . . . . . . . . . . . . . . . . . . . .

25

Figure 2.2:

Box plots between AD patients and NC subjects using MI analysis. . .

25

Figure 3.1:

Directed Information based test results: mutual information, directed
information and the γ metric. Here xn denotes the fMRI data, yn
1 the
n
simulation data set I, and y2 the simulation data set II. (a) MI and DI
n
n
between xn and yn
1 ; (b) γ versus SN R corresponding to x and y1 ; (c)
n
MI and DI between xn and yn
2 ; (d) γ versus SN R corresponding to x
n
and y2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Figure 3.2:

Inter-region GC test results based on fMRI data and two sets of simulation data generated from it. (a) test results for the direction xn → yn
1;
n
n
(b) test results for the direction y1 → x ; (c) test results for the direction
n
n
xn → y n
2 ; (d) test results for the direction y2 → x . . . . . . . . . . . .

51

Figure 3.3:

Inter-region γ values of directed information based causality analysis. .

53

Figure 3.4:

Inner-region (left-right) γ values of the directed information based causality analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Figure 3.5:

Inter-region Granger Causality test result. . . . . . . . . . . . . . . . .

55

Figure 3.6:

Inner-region (left-right) Granger Causality test results. . . . . . . . . .

57

Figure 3.7:

DI under different hemodyamic response functions. . . . . . . . . . . . .

60

Figure 3.8:

Granger Causality test results of case 2.4. . . . . . . . . . . . . . . . . .

61

Figure 4.1:

Estimations results of DDCM with the experimental fMRI data. . . . .

89

Figure 4.2:

Results of directed information based causality analysis. . . . . . . . . .

90

Figure 5.1:

Comparison of 3-category (AD, MCI, NC) classification results. . . . . . 112

Figure 5.2:

Regularized LDA with AdaBoost classifier: classification accuracy for
different categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

x

Chapter 1
Introduction
In this chapter, first, a brief overview of fMRI based functional brain connectivity analysis
is provided, illustrating the backgrounds of connectivity analysis. Second, different topics of
functional connectivity analysis, including non-directional connectivity analysis, directional
connectivity (or causality) analysis, and network connectivity pattern analysis are presented.
Finally, the major contributions of this dissertation are highlighted.

1.1

Overview of Brain Connectivity Analysis

The brain is a communication network. At the neuron level, information exchanges are
achieved through communications between synapses. At the system level, different brain
regions formulate a dynamic communication network, and connectivity between the brain
regions generates our minds.
In neuroscience, connectivity analysis plays a critical role since it can provide insightful information in understanding brain functions and dysfunctions. Brain researchers are
increasingly looking for advanced computational analysis tools to assist them in understanding the functions and dysfunctions of specific brain connectivities. At the same time, driven
by the revolution in information theory, the communications area has accumulated rich
methodologies for system modeling, design, signal processing and extraction, and network
characterization and evaluation. To this end: can we develop innovative computational

1

analysis tools for brain connectivity analysis by exploiting advanced methodologies in communications? How can these tools help us understand patterns of functional connectivities?
As an effort to address these problems, this research aims to take the advantages of
functional magnetic resonance imaging (fMRI) to develop innovative modeling and analysis
methodologies for brain connectivity analysis by exploiting advanced techniques in communications, especially tools in information theory and network characterization; apply these
methodologies to brain network analysis, and explore how the proposed techniques can help
us identify connectivity problems and understand why the brain fulfills or fails a cognitive
task.

1.2

Revisit of the Existing Work

In brain connectivity analysis, there are three closely related components: non-directional
connectivity analysis, directional connectivity (or causality) analysis, and network connectivity pattern analysis. In this section, we revisit the existing work on these three topics.

1.2.1

Non-directional Functional Connectivity Analysis of Brain
Regions

Connectivity among the brain regions during a cognitive activity helps us understand brain
functions and perform disease analysis. For decades, Pearson correlation [1] has widely
been used as a quantitative metric to characterize the functional connectivity between two
different brain regions. In recent years, mutual information (MI) has gradually appeared as
an alternative metric for brain connectivity. The underlying argument is that: the Pearson
correlation coefficient only measures the linear dependence, while MI measures both linear
2

and non-linear dependence between two brain regions. Moreover, MI has a clear physical
meaning: it represents the information successfully transmitted over the two brain regions
under consideration.
In [1], Tsai et al. used MI to build the brain activation map. They showed that MI was
robust in quantifying the relationship between any two fMRI time series. An outstanding
merit of the MI approach was that, it does not depend on the a priori assumptions about the
relationship between the protocol time line and the fMRI voxel temporal response, and yet
could be as effective as Pearson correlation for calculating activation maps. In [2], Michiel
et al. applied MI to the decoding algorithm in feature selection from high dimensional data.
Their results showed that, comparing with analysis of variance (ANOVA) based method, MI
is efficient in selecting very few but strongly informative voxels, and meanwhile can achieve
the same or even better generalization or overall performance. In [3], Chai et al. used multivariate mutual information to select voxels in decoding natural scene categories from the
human brain. Their experiments showed that, comparing with the classical variance-based
“most active selection” method [4], MI based voxel selection could improve the decoding
accuracy significantly. In [5], Gomez-Verdejo et al. used MI to identify regionally specific
effects produced by a particular cognitive task, and showed that MI could confirm known
functional connections identified by Pearson correlation, and can also discover new connections.

3

1.2.2

Directional Functional Connectivity Analysis of Brain Regions

Causality analysis aims to find the relationship between causes and effects. It provides
insightful information on how brain regions interact with each other during a cognitive
task [6]. In general, causality analysis tries to determine whether the values of one time series
are useful in predicting the future values of another time series. Since 1990s, a number of
frameworks have been applied to fMRI based causality analysis, including Granger Causality
(GC), Bayesian Network, Dynamic Causal Modeling (DCM), Transfer Entropy (TE) and
Directed Information (DI).
Granger Causality The first practical causality analysis framework was proposed by
Granger in 1969 [7]. The main idea is, if two signals X1 and X2 form a causal relationship, then, instead of using the past values of X2 alone, the information contained in the
past values of X1 will help to predict X2 . More specifically, the calculation of Granger
Causality is based on the linear prediction models. Suppose Xn
1 = [X1 (1), X1 (2), ..., X1 (n)]
and Xn
2 = [X2 (1), X2 (2), ..., X2 (n)] are two time series observed from two brain regions,
respectively. Granger Causality compares the prediction errors er and e˜r in the following
equations:
L−1

al X2 (k − l) + er ,

(1.1)

[bl X1 (k − l) + cl X2 (k − l)] + e˜r ,

(1.2)

X2 (k + 1) =
l=0
L−1

X2 (k + 1) =
l=0

for k = 1, 2, ..., n. Here, er is the error of predicting X2 based only on the previous values

4

of X2 , and e˜r is the error of predicting X2 based on both the previous values of X2 and
the previous values of X1 . If e˜r is much smaller than er , that is, the introduction of the
previous values of X1 can improve the prediction accuracy, then we say there is a Granger
causal relationship between X1 and X2 .
In literature, there have been growing interests in the use of Granger Causality analysis
to identify causal interactions in neuroscience [8]. GC has been successfully applied to fMRI
data, EEG measurements, as well as neural level signals [9–11]. In these pioneering work,
the validity and computational simplicity of Granger Causality have been widely recognized.
At the same time, it has also been noticed that GC relies on the linear prediction method.
When there exist instantaneous and/or strong nonlinear interactions between two regions,
GC analysis may lead to invalid results [11].
Bayesian Network In [12], J. Pearl summarized the framework of Bayesian Network for
causal inference. The argument behind it is that: if a causal relationship exists between two
factors X and Y , the introduction of factor X may change the distribution of another factor
Y . That is, P (Y |X) = P (Y ).
Since 2000, the analyses based on Bayesian Networks have demonstrated successful applications [13,14]. Modified Bayesian Network has been applied to fMRI data by incorporating
the vector autoregressive model used in GC [14]. From a general perspective, the vector
autoregressive model based Bayesian Network framework can be regarded as a variation of
the Granger Causality analysis.
Dynamic Causal Modeling In 2003, Friston proposed the framework of Dynamic Causal
Modeling to describe the general interactions among a group of brain regions [15]. DCM
assumes that the invisible neurostate X, the (external) input U , the observed BOLD signal
Y , the parameter θ that characterizes the connectivities between two brain regions, and
5

the independent noise Ω form a dynamic system that could be described by the following
equations:

X˙ = f (X, U, θ) and Y = Λ(X) + Ω,

(1.3)

where Λ represents a cascade of differential equations which map the neurostate X to the
observed BOLD signal Y . Relying on the EM algorithm, DCM has been implemented on both
fMRI and EEG data [16]. In practical applications, due to the computational complexity,
DCM is usually used as a confirmatory approach. That is, the users need to put forward
different connectivity models and then compare them based on their likelihood evaluated
under DCM [17].
Transfer Entropy Another widely applied causal measurement in neuroscience is Transfer
Entropy (TE). TE was introduced in 2000 by Schreiber [18]. It measures the decrease of
entropy in one signal Y after another signal X has been observed:

TX→Y = H (Yt | Yt−1:t−L ) − H (Yt | Yt−1:t−L , Xt−1:t−L )

(1.4)

in which H denotes the entropy operator, Yt−1:t−L = [Yt−L , ..., Yt−1 ], Xt−1:t−L =
[Xt−L , ..., Xt−1 ].
The first exploration of applying transfer entropy in causality description was conducted
by Sporns et al. on the sensorimotor network in 2006 [19]. TE has also been applied in
MEG data to evaluate non-linear connectivity [20], and used for fMRI data [21] to detect
the directed flow of information between brain regions.
As an information theoretic framework, transfer entropy does not rely on any model

6

assumptions of the signals. However, current algorithms on TE estimation have not been
proved to be convergent [22]. Also, in [18, 23], it was shown that the amplitude of transfer
entropy could not accurately quantify the strength of influence between brain regions.
Directed Information Directed Information is an information theoretic metric, which was
first introduced by Massey when studying communication channels with feedback [24]. It
measures the directed information flow from one time series X to another time series Y ,
denoted as I(X → Y ). If I(X → Y ) > I(Y → X), then we say X has more influence on Y ,
or X is the causal side in the connectivity.
DI is a universal method. Unlike GC, which mainly relies on the linear prediction theory,
or linear modeling for the involved parameters, the DI-based causality analysis does not
have any modeling constraint on the sequences to be evaluated, hence, can be used to
characterize more general relationships. This advantage of DI has been reported in recent
advances in causality analysis [22, 25, 26]. In [27], it was pointed out that GC analysis is
effective in detecting linear or nearly linear causal relationship, but may have difficulty in
capturing nonlinear causal relationships. On the other hand, DI-based causality analysis is
more effective in capturing both linear and nonlinear causal relationships. In [28], Liu et
al. applied DI to the EEG data and compared the result with that of GC. Their conclusion
was that DI based approach could be superior to GC in capturing the instantaneous and
nonlinear causal relationship in EEG data. Moreover, in [29], it was shown that the Granger
Causality graphs of stochastic processes can be generated from the DI framework, and the
authors indicated that the DI theory provides an adequate framework for the connectivity
inference problems in neuroscience applications. A comprehensive investigation of DI can be
found in [30].

7

1.2.3

Network Connectivity Pattern Analysis of Brain Regions

Both directional connectivity and non-directional connectivity analysis study the connection
between two brain regions. As will be shown in Chapter 4, although analysis based on pairwise connection could provide insights into the brain network, the information from merely
two regions are far from enough to reveal the general functional connectivity patterns. As
a result, both of the two aforementioned analyses are inadequate in solving problems that
involve multiple brain regions, such as the classification of Alzheimer’s Disease (AD) patients,
Mild Cognitive Impairment (MCI) patients and normal control (NC) subjects based on fMRI
data. The underlying argument is that: due to variability in the brain connectivity of each
individual, the connectivity between two brain regions alone may not provide comprehensive
information for brain analysis; network connectivity pattern analysis, which looks for subtle
changes in the pattern of connectivity among multiple or all regions in the sub-network, may
reveal more in-depth information.

1.2.4

Motivations and Problem Identification

After revisiting the existing work, we identify the major problems in today’s brain connectivity analysis as follows.
• Connections between brain regions: going beyond the linear model. Existing
work on brain connectivity analysis has mainly been focused on the characterization of
connections between the regions in terms of connectivity and causality [31]. Connectivity measures the dependence between regional brain activities, and tells us which brain
regions are functionally connected during a cognitive task [32]. Moving one step further, causality analysis aims to determine the directionality of information flow among
8

the functionally connected brain regions [33, 34].
Traditional connectivity analysis mainly relies on the Pearson correlation, which only
measures the linear dependence between the regions [35–37]. Traditional causality
analysis largely relies on the Granger Causality (GC) approach [38, 39]. Again, GC
is based on the linear prediction technique [40]. It is effective in detecting linear or
nearly linear causal relationship, but may have difficulty in capturing nonlinear causal
relationships [40]. As can be seen, the study on connectivity and causality has largely
been limited to linear relationships. For more accurate characterization of connections
between brain regions, we need to go beyond the linear model.
• Relationships between representative causality analysis frameworks. GC, DI
and DCM are three representative causality analysis frameworks in brain connectivity
analysis. In literature, the relationships between GC and DCM, and between GC and
DI have been investigated. A missing link here is: what is the relationship between
DCM and DI? To fill the missing link, and reveal the connection between DCM and
DI, we need both theoretical and numerical analyses of the equivalence between DCM
and DI in characterizing the causal relationship between two brain regions.
• Reliable classifications of brain diseases based on size limited fMRI data.
Accurate distinction of AD and MCI patients from normal subjects is critical for early
diagnosis and treatment of brain disorders. However, the size of fMRI data samples
is generally quite limited, which has become a major bottleneck in fMRI based AD,
MCI and NC classification. The underlying reason is that, when the sample size is
small, most existing classifiers suffer from severe noise effects, due to both biological
variability and measurement noise. New methodologies have to be developed for robust
9

AD, MCI and NC classification.

1.3

Summary of Dissertation Contributions

The main contributions of this dissertation are summarized in the following.
In Chapter 2, we explore the estimation of mutual information to measure the nondirectional connectivities between brain regions. Traditional non-directional connectivity
analysis mainly relies on the Pearson correlation, which only measures the linear dependence between the regions. Motivated by this observation, in this chapter, we propose to
measure the non-directional connectivities with mutual information, which measures both
linear and nonlinear relationships between brain regions. An innovative MI estimator is developed. The major steps include: de-correlation of data segments, kernel-based estimation
of the probability density function, and Monte Carlo Integration. The proposed MI estimator is applied to experimental fMRI data obtained from Alzheimer disease patients and
normal subjects. The numerical result is consistent with clinical observations.
In Chapter 3, we conduct fMRI based causality analysis in brain connectivity by exploiting the directed information based framework. First, we introduce the core concepts in the
directed information framework. Second, we present how to conduct causality analysis using
directed information measures between two time series. We provide the detailed procedure
on how to calculate the DI for two finite time series. The two major steps involved here
are optimal bin size selection for data digitization, and probability estimation. Finally, we
demonstrate the applicability of DI based causality analysis using both the simulated data
and experimental fMRI data, and compare the results with that of the Granger Causality
analysis. Our analysis indicates that GC analysis is effective in detecting linear or nearly

10

linear causal relationship, but may have difficulty in capturing nonlinear causal relationships.
On the other hand, DI based causality analysis is more effective in capturing both linear and
non-linear causal relationships. Moreover, it is observed that brain connectivity among different regions generally involves dynamic two-way information transmissions between them.
Our results show that when bidirectional information flow is present, DI is more effective
than GC to quantify the overall causal relationship.
In Chapter 4, we explore the discrete Dynamic Causal Modeling (DDCM) and its relationship with DI and GC. First, we revisit DDCM, and demonstrate the relationship between
DDCM and the conventional continuous time DCM. Second, we show that under certain conditions, DDCM and DI are equivalent in characterizing the causal relationship between two
brain regions. Recall that traditionally, the accuracy of DI estimation is based on the accuracy of probability or statistic estimation, and hence requires the data length be sufficiently
long. This equivalence between DDCM and DI, in fact, also provides a simple but effective
method for DI estimation under limited data length. Finally, we illustrate the similarities
and differences between DDCM and GC.
In Chapter 5, we develop a reliable method for AD, MCI and NC classification that is
robust with respect to size limited fMRI data samples, by exploiting brain network connectivity pattern analysis. First, we propose a regularized LDA approach, which aims to reduce
the noise effect by using two shrinkage methods. The first shrinkage method moves the
estimated mean of each class towards the overall mean, and the second one shifts the estimated covariance matrix for each class towards the identity matrix. Second, we investigate
the relationship between LDA-based and Maximum Likelihood (ML) based classification or
decision making methods. Finally, we conduct the connectivity pattern classification of AD,
MCI and NC subjects by applying the regularized LDA and AdaBoost classifier based ap11

proach. Numerical analysis shows that: in comparison with the original LDA approach [41],
the regularized LDA can reduce the noise effect and increase the classification accuracy significantly. Our analysis also confirms the previous findings that the hippocampus and the
isthmus of the cingulate cortex are closely involved in the development of AD and MCI.
In Chapter 6, we summarize the conclusions and present some potential directions for
future research.

12

Chapter 2
Brain Functional Connectivity
Analysis Using Mutual Information
This chapter considers measuring brain functional connectivity using mutual information
(MI). First, we explain the advantage of MI based analysis over the conventional correlation
based analysis. Second, we propose a novel approach for MI estimation by exploiting kernelbased probability density function (pdf) estimation and optimization under the maximum
likelihood criteria. Finally, the proposed estimator is applied to true fMRI data obtained
from Alzheimer’s Disease (AD) patients and normal control (NC) subjects. The numerical
analysis demonstrates the effectiveness of the proposed approach and shows that the MI
based analysis result is consistent with clinical observations.

2.1

Introduction

Connectivity analysis based on functional Magnetic Resonance Imaging (fMRI) data helps
to reveal insights on brain functioning and disease analysis [42]. For decades, an important
metric in measuring functional connectivities has been the Pearson correlation coefficient.
In recent years, mutual information (MI) has been applied as an alternative metric for
the reason that it measures not only linear dependence between two time series but also
non-linear relationships, and meanwhile has a clear physical meaning. It represents the
13

information successfully transmitted on the brain links. It was pointed out in [3, 5] that
MI is more informative compared to traditional metrics because it could confirm known
functional connections as well as discovering new connections.
Since the last decade, there has been a growing interest in applying MI on fMRI data
analysis. In [1], Tsai et al. used MI to build the brain activation map. They showed that
MI was robust in quantifying the relationship between any two fMRI temporal response
waveforms. An outstanding merit of the MI approach was that, it does not depend on the
priori assumptions about the relationship between the protocol time line and the fMRI voxel
temporal response, and yet could be as effective as other methods for calculating activation
maps. In [2], Michiel et al. applied MI in the decoding algorithm in selecting features
from high dimensional data. Their results showed that, compared to analysis of variance
(ANOVA) based method, MI was efficient in selecting very few but strongly informative
voxels and meanwhile achieved the same or even better generalization performance. In [43],
Afshin-Pour et al. applied MI in the activation detection. They carried out experiment in
real datasets for group analyses using the general linear model, and showed that MI is a
more sensitive metric than the Jaccard overlap metric.
In literature, a dominant approach for MI calculation has been the k nearest neighbor
(kNN) estimator. This approach has been discussed throughly in [44]. The main idea is:
when two sets of high dimensional data points are independently and identically distributed
(iid), the estimation bias in the kNN density estimator will be demolished when calculating
the KL distance, and the final estimator will be asymptotically unbiased. There are two
limitations with this approach: first, the choice of k is highly empirical. The choice involves
tradeoffs between the estimation bias and variance, with smaller k leads to lower bias and
higher variance [44]. An empirical choice of k is

√
n. In fact, previous works on MI estimation

14

did not provide any discussions on how to choose k to ensure accurate estimation. Second,
applying kNN estimator on two time series could face the problem of data independence.
When data points are not independently distributed, it is not guaranteed the final result
would converge to the true value.
In this chapter, we propose to estimate MI using a novel approach to evaluate the MI
in the region level to analyze the functional connectivity of Default Mode Network (DMN)
in the brain. Note that the data segments from the fMRI data are generally correlated
with each other, which introduces skewness in the distribution [44]. In this chapter, first,
we apply a linear transformation to the fMRI data such that the covariance matrix of the
transformed data is close to the identity matrix. Second, we present an effective approach
for MI estimation by exploiting kernel-based probability density function estimation and
optimization under the maximum likelihood criteria. Finally, the proposed estimator is
applied to true fMRI data obtained from Alzheimer’s Disease (AD) patients and normal
control (NC) subjects. The numerical analysis demonstrates the effectiveness of the proposed
approach and shows that the MI based analysis result is consistent with clinical observations.
The rest of this chapter is organized as follows. In Section II, we outline the existing
approaches used in brain connectivity analysis. In Section III, we present the proposed
kernel based estimator, and the corresponding algorithms. Numerical results are provided
in Section IV, and we conclude in Section V.
Notation: The uppercase letters (X,Y ,...) denote random variables, and the lowercase
→
−
letters (x,y,...) denote the possible values they can acquire. X = [X1 , X2 , ..., Xd ] denotes
a time series vector, where Xi is the ith sample. For any x, fX (x) denotes the probability
density function (pdf) of X, and fXY (x, y) the joint probability density function for (X, Y ).
The log function log(∗) denotes the base 2 logarithm.
15

2.2

Existing Approaches

In this section, we revisit some representative approaches in the measurement of functional
connectivities, and discuss some of their limitations.

2.2.1

Pearson Correlation versus Mutual Information

In fMRI studies, a widely used statistical metric in brain functional connectivity analysis is
the Pearson correlation coefficient, which measures the linear dependence between two time
series. It is defined as:
n
i=1 (Xi

r=

n
i=1 (Xi

¯ i − Y¯ )
− X)(Y
n
i=1 (Yi

¯ 2
− X)

− Y¯ )2

,

(2.1)

¯ and Y¯ represent the mean of the two time series {X} and {Y }, respectively. For
where X
further statistical hypotheses testing, the Fisher’s Z-transformation is generally applied to
regulate distribution:

z=

1
ln
2

1+r
1−r

= tanh−1 (r).

(2.2)

The introduction of Fisher’s transform could also be used to stabilize the variance of data
points for regression based or ANOVA techniques, which has been explored in [42].
While the Pearson correlation coefficient is efficient in capturing linear correlations, it is
an inaccurate measure when the regional activities in brain show non-linear characteristics.
Moreover, as statistical metrics, the correlation coefficient r, as well as its Fisher transform
z, do not really reflect the connection strength comprehensively. The mutual information,

16

on the other hand, denoted the information successfully transmitted through a channel, has
a more clear physical meaning in measuring the strength of a connection. This motivates
applying MI analysis in measuring the brain functional connectivity.
The mutual information for two random variables X and Y is defined as:
∞ ∞

fXY (x, y)log

I(X; Y ) =
−∞ −∞

fXY (x, y)
dxdy.
fX (x)fY (y)

(2.3)

For fMRI data, which is composed of time series, the mutual information is carried out on
segments rather than single points of the data. In this chapter, we choose the length of one
segment, d, to be the rough duration of the Hemodynamic Response in terms of sampling
periods. That is,
→
− →
−
I( X ; Y ) =
Rd

2.2.2

Rd

−
−
fXY (→
x,→
y) →
−
−
−
d−
x d→
y.
fXY (→
x,→
y )log
→
−
−
fX ( x )fY (→
y)

(2.4)

Limitations with the Existing Work on MI estimation

Although MI has considerable advantages over the Pearson correlation coefficient, current
work on estimating MI has its own limitations.
A widely exploited algorithm to estimate mutual information of high dimensional data
is built upon the k nearest neighbor (kNN) estimator. Given n samples with dimension d,
this estimator calculates the pdf as:

1 Γ(d/2 + 2) 1
−
f (→
xi ) =
,
−
2
r(→
xi )d
π d/2

17

(2.5)

−
−
in which r(→
xi ) is the Euclidean distance from →
xi to its kth nearest neighboring points, and
Γ(∗) the Gamma function [3].
There are several concerns related to the kNN estimator. First, the choice of k is not a
well-defined problem and usually solved by heuristic techniques [45]. Second, the underlying
assumption of the kNN estimator is that: the data points are identically and independently
distributed (i.i.d.). Otherwise, the algorithm can not be guaranteed to converge to the true
value. In fMRI studies, however, even under resting state, the time series points can not be
simplified as i.i.d processes. Third, the MI calculation in [3] is based on individual single
data points. However, information contained in the time changing Hemodynamic waveforms
are certainly more informative than that in single data points.
Hence, unlike previous approaches in [3], the proposed estimation of mutual information
will be carried out on segments rather than single points (see Section III). The segment
length is roughly the same as that of one Hemodynamic Response. Moreover, we resort to
the non-parameterized kernel method to estimate the probability density function, and try
to reduce inter-dependence of data points by data preprocessing.

2.3

The proposed approach for MI estimation

In this section, we present the proposed MI estimator. The major steps in the estimator
include: de-correlation of data segments, kernel-based estimation of the probability density
function, and Monte Carlo Integration for MI estimation.

18

2.3.1

De-correlation of Data Segments

The data segments from the fMRI data are generally correlated with each other, which
introduces skewness in the distribution [44]. To fix this problem, we propose to adopt the
whitening transform on the data segments before estimating the probability density function
such that the covariance matrix of the transformed data is close to the identity matrix:
−
−
Given a set of d-dimensional data →
x , with mean →
µ and covariance matrix
−
−
−
−
E(→
x −→
µ )(→
x −→
µ )T

Σ

=

. Its eigenvalue decomposition is Σ = φΛφ−1 , in which Λ is a diagonal ma-

trix, with the eigenvalues as its diagonal elements, and φ the eigenvectors of the covariance
matrix. Define the whitening transform as:

→
−
x

−
= AT →
x,

(2.6)

−
where A = φΛ−1/2 . After the transform, the new data sets will have mean AT →
µ and
covariance I. In fact:

−
−
−
−
E(AT →
x − AT →
µ )(AT →
x − AT →
µ )T
= (φΛ−1/2 )T • φΛφ−1 • φΛ−1/2

(2.7)

= I,

(2.8)

where we used the fact that φT = φ−1 . It can be seen from (2.8) that, after the linear
transformation, the covariance is now an identity matrix, which means the transformed data
segments are linearly uncorrelated with each other.

19

In [44], Wang et al. also proposed a similar approach to deal with skewness problem of
the distribution. The difference between their approach and the proposed approach is that
→
−
→
−
they treat two sets of data { X } and { Y } jointly, with the assumption that they follow the
same distribution. More specifically, instead of calculating the overall covariance matrix Σ
→
−
→
−
for { X } and { Y } in (2.6) separately, they calculated the covariance matrix as:
Σ =

1
[
2n − 1
n

+
i=1

n
i=1

− −
− T
−
→ →
→ →
( Xi − µ
ˆ )(Xi − µ
ˆ)

− →
− T
→
− →
− →
( Yi − µ
ˆ )( Yi − µ
ˆ ) ],

(2.9)

n →
n −
→
−
→
−
1 [
Xi +
Yi ].
where µ
ˆ = 2n
i=1

i=1

We believe that this approach is not suitable in our application scenario because it requires
that two sets of data share the same statistical property, which may not be practical in fMRI
data as Hemodynamic Responses diverse considerably among different brain regions. For
→
−
→
−
this reason, in our analysis here, we choose to analyze { X } and { Y } separately.

2.3.2

Kernel-Based Estimation of Probability Density Function

The kernel-based estimation is an alternative framework to the assumption based parametric
estimation. Originally, the kernel-based approach was introduced to address the problem of
the phase uncertainty or origin uncertainty [46]. The basic idea is to calculate the average
of kernel functions K on each point that falls into a pre-specified kernel window. More
specifically, given a set of data points { i |i ∈ [1, m]}, the kernel estimation for the probability

20

density function at any point x is given by:

1
fˆ(x) =
m

m

x− i
h

K
i=1

,

(2.10)

in which h is the bandwidth for the kernel function. For any d-dimensional data vector
→
−
x = [x1 , ..., xj , ..., xd ], the estimation function can be extended as:
−
fˆ(→
x)=

1
mh1 h2 ...hd

m

−
− ),
Kh (→
x ,→
i

(2.11)

i=1

where

−
−)=
Kh (→
x ,→
i

d

K
j=1

xj − ij
hj

.

(2.12)

The kernel function K for a continuous variable x is often chosen as the Gaussian function [46]:
(x− )2

1 −
K(h, , x) = √ e 4h2 .
2 π

2.3.3

(2.13)

Optimal Kernel Bandwidth Estimation

The bandwidth h for the kernel function has a significant influence on the estimation accuracy. Unlike the choice of k in the kNN estimator, here h can be chosen theoretically
according to the Cross Validation Maximum Likelihood (CV-ML) criteria. More specifically,

21

the bandwidth h is chosen to maximize the leave-one-out log likelihood function given by:
m

L=
i=1

− ),
log fˆ−i (→
i

(2.14)

− ) is defined as:
in which the leave-one-out function fˆ−i (→
i
−)=
fˆ−i (→
i

1
(m − 1)h1 h2 ...hd

− ,→
−
Kh (→
i j ).

(2.15)

j=i

It can be shown that pdf estimation under this criteria could yield a result that will approach
the real density in a Kullback-Leibler entropy sense [46].
The maximization problem for the likelihood function L in (2.14) belongs to a category
of optimization problems, and can be solved reliably using various algorithms [46,47]. In this
chapter, we choose the downhill simplex method, because the complexity of the log likelihood
makes the differentiate operation quite computationally infeasible, and classical second-order
optimization method like Quasi-Newton algorithm is difficult to be implemented. More
details about downhill simplex method could be found in [47]. The algorithm is implemented
as follows:
At the initial step, randomly construct a simplex of d + 1 vertices in a d dimensional
definition domain of the function L(h). Order the vertices according to the function values,
i.e., L(h1 ) ≥ L(h2 ) ≥ ... ≥ L(hd+1 ). Then, calculate the centroid point h0 and iteratively
execute following steps: Reflection, Expansion, Contraction and Reduction.
• Reflection: A reflected point hr is computed as h0 + α(h0 − hd+1 ). If the reflected
point is better than hd but worse than h1 , then reconstruct the simplex by replacing
hd+1 with hr .
22

• Expansion: If hr outperforms h1 , compute the expanded point he = h0 + γ(h0 − hd+1 ).
If he is better than hr , construct the new simplex by replacing hd+1 with he ; else,
replace hd+1 with hr .
• Contraction: If hr is inferior to hd , compute the contracted point hc = h0 + ρ(h0 −
hd+1 ). Check if hc is better than hd+1 . If yes, reconstruction the simplex by replacing
hd+1 with hc ; else, execute the Reduction step.
• Reduction: Replace all but the best point with hi = h1 + σ(hi − h1 ).
The values of α, γ, ρ and σ are set to be 1, 2, −1/2 and 1/2, respectively. The algorithm
ends after a predefined number of iterations. It can be guaranteed that when the function
is locally smooth, an optimal solution could be reached [47].

2.3.4

MI Estimation Through Monte Carlo Integration

The Monte Carlo integration method is used here to calculate the MI after the prob−
−
ability distribution function has been obtained. For notation simplicity, let i(→
x,→
y) =
→
→
(−
x ,−
y)
−
−
fXY (→
x,→
y )log f fXY
.
−
→
→
( x )f (−
y)
X

Y

Then, to calculate the MI I =
Rd

Rd

−
−
−
−
i(→
x,→
y )d→
x d→
y , the algo-

rithm uniformly samples a finite space with a volume of V , and generates P samples
−
−
{(→
xi , →
yi ), i ∈ [1, P ]}. The mutual information, then, can be estimated as:
V
IP ≈
N

P

−
−
i(→
xi , →
yi ).

(2.16)

i=1

Since the definition domain of a Gaussian function is infinite, sampling on the whole space is
impossible. Therefore, we limit the sampling space within the interval [µ − 3σ, µ + 3σ], where
µ and σ denote the mean and standard deviation of the Gaussian function, respectively.
23

Convergence: It can be shown by the Law of Large Numbers (LLN) [48]: as P goes to
infinity, the approximation in (2.16) will converge to the real value of I, i.e., lim IP = I.
P →∞

2.4

Numerical Results

In this section, we will apply the proposed approach to resting state fMRI data collected
from both Alzheimer’s Disease patients and normal control subjects.
Brain networks operate in a cohesive manner of connections between nodes. A progressive
weakening trend of functional connectivities has been observed in the default mode network
(DMN) in AD patients [42]. In the following, we will evaluate MI between two regions
of DMN, the posterior cingulate cortex (PCC) and superior frontal gyrus (SFG). In the
data collection process, eleven patients with mild-to-moderate probable AD and twelve ageand education-matched healthy normal control subjects were recruited to participate in
this study. The MRI experiment was conducted on a GE 3T Signa R HDx MR scanner
(GE Healthcare, Waukesha, WI) with an 8-channel head coil. To study resting-state brain
function, echo-planar images, starting from the most inferior regions of the brain, were
acquired for 7 minutes with the following parameters: 38 contiguous 3-mm axial slices in an
interleaved order, time of echo = 27.7 ms, time of repetition = 2500 ms, flip angle = 80◦ ,
field of view = 22 cm, matrix size = 64 × 64, ramp sampling, and with the first four data
points discarded. Each volume of slices was acquired 164 times. Common pre-processing
procedures on resting state fMRI data were carried as detailed in [42]. Then, we carried out
the proposed approach on the pre-processes fMRI data.
Figure 1 and figure 2 show the calculated mutual information for connections between
PCC and SFG. As expected, the connections experienced a decrease in AD patients compared

24

to NC subjects. Figure 1 shows the probability distribution function of two groups. Because
of the existence of outliers in each group, two pdf curves can not be separated completely.
However, it is clear that in group level, normal subjects have shown stronger connectivities
over the AD patients.
Normal
Patient

Probability Density Function

0.8
0.6
0.4
0.2
0.0
0.0

0.5

1.0

1.5
2.0
MI (bits)

2.5

3.0

3.5

Figure 2.1: Comparison of probability density function between AD patients and NC subjects
using MI analysis.
3.0
2.5

MI (bits)

2.0
1.5
1.0
0.5

Normal

Patient

Figure 2.2: Box plots between AD patients and NC subjects using MI analysis.
Figure 2 shows the boxplots for the two groups. It can be seen that the median suffered
a 35.6%’s decrease in AD patients compared to NC subjects. This result is consistent with

25

previous clinical research finding. Significant difference on the MI between PCC and SFG
was found between the two groups based on independent two-sample t test (p = 0.04).

2.5

Summary

In this chapter, we considered the measurement of functional brain connectivities using
mutual information. We proposed a novel approach for the estimation of MI, which was
composed of three major components: de-correlation, kernel based estimation of probability
density function and Monte Carlo Integration for MI estimation. The analysis results obtained using the proposed method were consistent with clinical observations in the AD data
sets.

26

Chapter 3
Causality Analysis of fMRI Data
Based on The Directed Information
Theory Framework
This chapter aims to conduct fMRI based causality analysis in brain connectivity by exploiting the directed information (DI) theory framework. Unlike the well known Granger
Causality (GC) analysis, which relies on the linear prediction technique, the directed information theory framework does not have any modeling constraints on the sequences to be
evaluated and ensures estimation convergence. Moreover, it can be used to generate the
Granger Causality graphs. In this chapter, first, we introduce the core concepts in the directed information framework. Second, we present how to conduct causality analysis using
directed information measures between two time series. We provide the detailed procedure
on how to calculate the DI for two finite time series. The two major steps involved here
are optimal bin size selection for data digitization, and probability estimation. Finally, we
demonstrate the applicability of DI based causality analysis using both the simulated data
and experimental fMRI data, and compare the results with that of the Granger Causality
analysis. Our analysis indicates that GC analysis is effective in detecting linear or nearly
linear causal relationship, but may have difficulty in capturing nonlinear causal relationships.

27

On the other hand, DI based causality analysis is more effective in capturing both linear and
non-linear causal relationships. Moreover, it is observed that brain connectivity among different regions generally involves dynamic two-way information transmissions between them.
Our results show that when bidirectional information flow is present, DI is more effective
than GC to quantify the overall causal relationship.

3.1

Introduction

Causality analysis provides important information on how brain regions interact with each
other to accomplish a cognitive task [6]. In general, causality analysis tries to determine
whether the values of one time series X is useful in predicting the future values of another
time series Y . Here, we will first briefly revisit the work on causality analysis in literatures,
including Granger Causality (GC), Bayesian Network, Dynamic Causal Modeling (DCM)
and Transfer Entropy (TE). Then, we will introduce the Directed Information (DI) framework, explain why we adopt it, and how to apply it for causality analysis.

3.1.1

Some Representative Techniques on Causality Analysis

Granger Causality The first practical causal analysis framework was proposed by Granger
in 1969 [7]. The fundamental idea is, if two signals X and Y form a causal relationship,
then, instead of using the past value of Y alone, the information contained in the past
values (or lagged values) of X will help to predict Y . More specifically, the calculation
of Granger Causality is based on the autoregressive or linear prediction models. Suppose
Xn = [X1 , X2 , ...Xn ] and Yn = [Y1 , Y2 , ...Yn ] are two time series. The most commonly used
method in Granger Causality analysis is to compare the following two prediction errors ei
28

and e˜i :
L

Yi =

aj Yi−j + ei

(3.1)

[bj Yi−j + cj Xi−j ] + e˜i

(3.2)

j=1
L

Yi =

j=1

Here, ei is the error of prediction Yi based only on the previous value of Y , (Yi−1 , ...Yi−L ),
and e˜i is the error of predicting Yi based on both the previous values of Y , (Yi−1 , ...Yi−L ), and
the previous values of X, (Xi−1 , ...Xi−L ). In practical analysis, Granger Causality can be
tested using a nested model comparison based on the F statistics [49]. If e˜i is much smaller
than ei , that is, the introduction of the previous value of X can improve the prediction
accuracy, then we say there is a Granger Causal relationship between X and Y .
Since 1990s, there have been growing interests in the use of Granger Causality analysis to
identify causal interactions in neuroscience [8]. An early exploitation of GC in neuroscience
was carried out by Bernasconi et al. in electrophysiological data [50]. Their paper verified the applicability of GC for electrophysiological data, particularly EEG measurements.
Goebel et al. presented an application of GC on the fMRI data [51, 52]. They applied the
GC approach to a dynamic sensorimotor mapping paradigm. Bressler et al. applied GC
analysis to examine the blood oxygen level-dependent (BOLD) time series corresponding to
the top-down control signals from the frontal and parietal cortex [53]. Hu et al. applied GC
analyses on fMRI data to evaluate the causal relationship among specific brain regions, so
as to understand the impact of amnesic mild cognitive impairment (aMCI) on brain connectivity [10]. Wen et al. carried out simulations on neural signals to examine GC in both
neural level (neural GC) and fMRI level (fMRI GC) [54, 55].

29

David et al. applied GC (together with Dynamic Causal Modeling) in a combination of
fMRI and EEG data [11]. Their experiments showed that as the hemodynamics (i.e., the
blood flow or the circulation) vary from region to region, GC may not be applied directly
on the fMRI signals. However, when the hemodynamic effects were explicitly removed, GC
test can perform effective causality analysis in linear relationships.
As a well-known technique, the validity and computational simplicity of Granger Causality have been widely recognized. However, it has also been noticed that GC relies heavily
on the linear prediction method. When there exist instantaneous and/or strong nonlinear
interactions between two regions, GC analysis may lead to invalid results [11]. To address
this problem, several approaches on nonlinear Granger Causality have been proposed in literature. For example, in [56], Bezruchko et al. proposed an autoregression model constructed
in the form of a polynomial. More recently, Marinazzo et al. proposed a method to generalize GC to include the nonlinear case using the kernel technique [57]. The copula approach
has been applied for GC assessment in [9, 58]. A comprehensive discussion on nonlinear GC
could be found in [59].
Bayesian Network In [12], J. Pearl summarized the framework of Bayesian Network for
causal inference. The argument behind it is that: if a causal relationship exists between two
factors X and Y , the introduction of factor X may change the distribution of another factor
Y . That is, P (Y |X) = P (Y ).
Since 2000, the analyses based on Bayesian Networks have demonstrated successful applications [13,14]. Luessi et al. [14] modified the Bayesian Network and applied it to fMRI data
by incorporating the vector autoregressive model used in GC. Their result was in consistent
with that of the GC analysis. From a general perspective, the vector autoregressive model
based Bayesian Network framework can be regarded as a variation of the Granger Causality
30

analysis.
Dynamic Causal Modeling In 2003, Friston proposed the framework of Dynamic Causal
Modeling (DCM) to describe the general interactions among a group of brain regions [15].
DCM assumes that the invisible neurostate x, the (external) input u, the parameter θ that
characterizes the connection between two brain regions, and the independent noise ω form
a complex dynamic system that could be described by the following equations:

x˙ = f (x, u, θ) and y = L(θ, h(x)) + ω,

(3.3)

where h(x) represents a cascade of differential equations which connect the neurostate to
changes in blood volume and deoxyhemoglobin content, and L represents a non-linear output
function which relates θ and h(x) to the observed BOLD signal y [16].
With the help of EM algorithm, DCM has been attempted on both fMRI and EEG
data [16]. Some concerns with this framework are [17]: (1) As the observation model in
DCM is non-linear, estimating the latent variable that describes the neuronal activity could
be quite difficult. (2) DCM is a confirmatory approach, for which the users have to start
with different connectivity describing models, then rank them based on an approximation of
the model evidences.
Transfer Entropy Another widely applied causal measurement in neuroscience is Transfer
Entropy (TE). TE was introduced in 2000 by Schreiber [18]. It measures the decrease of
entropy in one signal Y after another signal X has been observed:

TX→Y = H (Yt | Yt−1:t−L ) − H (Yt | Yt−1:t−L , Xt−1:t−L )

31

(3.4)

in which H denotes the entropy operator, Yt−1:t−L = [Yt−L , ..., Yt−1 ] ,Xt−1:t−L = [Xt−L , ..., Xt−1 ].
Similar to GC, TE measures how much additional information the past values of process
X contains about the future observations of Y , given that we already knew the past values
of Y . The quantity measured by TE is the amount of predictive information rather than
the size of causal effect or coupling strength. In [23], it was pointed out that TE can
differentiate between interactions in the process of information storage and those in the
process of information transfer.
The first exploration of applying transfer entropy in causality description was conducted
by Sporns et al. on the sensorimotor network in 2006 [19]. Vicente et al. [18] applied TE in
the magnetoencephalography (MEG) data and showed that TE was an effective metric for
non-linear connectivity, especially for sensor-level MEG signals. Lizier et al. [21] developed
a framework that combined multivariate mutual information and transfer entropy together.
They used TE to analyze fMRI time series to detect the directed flow of information between
brain regions involved in a visuo-motor tracking task.
As an information theoretic framework, a major advantage of Transfer Entropy is that it
does not does not rely on any model assumptions of the signals. However, current algorithms
on TE estimation have not been proved to be convergent [22]. Also, in [18,23], it was shown
that the amplitude of transfer entropy could not accurately quantify the strength of influence
between brain regions.

3.1.2

Proposed Approach: DI Based Causality Analysis

In the discussions above, we revisited some representative methods on causality analysis.
These methods are either limited to an existing model on the time series under investigation,
or cannot guarantee convergence or validity in practical estimation. In an effort to overcome
32

these weaknesses, we propose to adopt the directed information theory framework.
Directed Information Given two random sequences Xn and Yn , the directed information
from Xn to Yn is defined as a sum of some conditional mutual information:
n

I(Xn

→

Yn )

=
i=1

I(Xi ; Yi |Yi−1 ),

(3.5)

where Xi = [X1 , X2 , ..., Xi ], Yi = [Y1 , Y2 , ..., Yi ]. First introduced by Massey to study
communication channel with feedback [24], DI has been proved to be an effective tool for
network analysis in communications [60] and neuroscience [25, 26]. As an information theoretical metric, DI shares some similarities with Transfer Entropy. Both of them do not rely
on any model assumptions of the signals. Moreover, it was pointed in [29] and [23] that: as
time goes to infinity, DI may approximate the rate of transfer entropy.
The DI framework is adopted here for the following reasons: (i) It is a universal method.
Unlike GC, which mainly relies on the linear prediction theory, or linear modeling for the
involved parameters, the DI based causality analysis does not have any modeling constraint
on the sequences to be evaluated, hence can be used to characterize more general relationships. (ii) It is well defined with specific physical meaning. Recall that the amplitude of
TE cannot reflect the strength of dependence between brain regions, the amplitude of DI
reflects the information flow from Xn → Yn , hence has a clear physical meaning. (iii) It
has been shown in [29] that the Granger Causality graphs could be obtained using the DI
framework. As can be seen, directed information theory provides an adequate framework for
the connectivity inference problems in neuroscience applications.
In literature, there has been a limited number of references on applications of directed
information in neuroscience [22, 28]. Quinn et al. applied DI in studying neuron spike

33

recording by introducing a Markov Process model for the signal. Liu et al. applied DI
to the EEG data and compared the result with that of GC. Their conclusion was that DI
based approach could be superior to GC in capturing the instantaneous and nonlinear causal
relationship from EEG data.
Chapter Overview In this chapter, first, we introduce the core concepts in the directed
information framework. Second, we present how to conduct causality analysis using direct
information measures between two time series. We provided the detailed procedure on how to
calculate the DI for two finite time series. The two major steps involved here are optimal bin
size selection for data digitization, and probability estimation. Finally, we demonstrate the
effectiveness of directed information based causality analysis using both the simulated data
and experimental fMRI data, and compare the results with that of the Granger Causality
analysis. For practical evaluation, we collected both stimulation fMRI data with a well
defined block-design scene-object fMRI paradigm [61, 62] and resting-state fMRI data. Our
analysis indicates that GC analysis is effective in detecting linear or nearly linear causal
relationship, but has difficulty in capturing nonlinear causal relationships. On the other
hand, DI based causality analysis can be used to capture both linear and non-linear causal
relationships. Moreover, it is observed brain connectivity among different regions generally
involves dynamic two-way information transmissions between them. Our results show that
when bidirectional information flow is involved, DI is a more effective than GC to quantify
the overall causal relationship.

34

3.2

Methods

Let uppercase letters (X,Y ,...) denote random variables, and lowercase letters (x,y,...) the
possible values they can acquire. For n ∈ N , define Xn = [X1 , X2 , ...Xn ], where Xi is the
ith sample. Each Xi is a random variable taken from the same finite alphabet Ω, with
cardinality |Ω|. For any xi ∈ Ω, P (xi ) = P rob{Xi = xi } denotes the probability for Xi
to take the value xi ; and PX |X i−1 (xi |xi−1 ) = P rob{Xi = xi |xi−1 = [x1 , x2 , ...xi−1 ]} the
i
conditional probability that the current sample Xi is xi , given that the previously observed
sequence is xi−1 = [x1 , x2 , ..., xi−1 ]. Without extra explanation, the log function log(∗)
denotes the base 2 logarithm.

3.2.1

Core Concepts in the Directed Information Framework

Entropy A fundamental concept in information theory is entropy, which is a measure of
uncertainty. For a random variable X, the entropy of X is defined as:

H(X) = −

P (xi )logP (xi ).

(3.6)

xi ∈Ω

The entropy of a random variable X represents the minimum average number of bits needed
for loseless encoding of each symbol of X.
For a random sequence Xn , the entropy could be calculated according to the chain rule:

H(Xn ) = H(X1 ) + H(X2 |X1 ) + ... + H(Xn |Xn−1 )

(3.7)

n

=
i=1

H(Xi |Xi−1 ).

35

(3.8)

Mutual Information Mutual Information (MI) measures the decrease of uncertainty of one
random variable after observing another one. The definition of mutual information between
two random variables X and Y is:

I(X; Y ) = H(Y ) − H(Y |X)
=

P (x, y)log
x,y

P (x, y)
,
P (x)P (y)

(3.9)
(3.10)

here H(Y ) is the total uncertainty (or information) in Y , and H(Y |X) is the uncertainty (or
information) left in Y after X is observed. It’s clear that mutual information is a symmetric
measurement: for two random variables X and Y , I(X; Y ) = I(Y ; X). Mutual information
measures the dependence between two random variables. If X = Y , then H(Y |X) = 0,
and I(X; Y ) = H(X) = H(Y ). If X and Y are independent, then I(X; Y ) = 0. Unlike
the Pearson Correlation Coefficient which only measures the linear dependence between two
random variables, mutual information includes both linear and non-linear dependence. For
this reason, in recent years, there has been a growing interest in applying MI to neuroscience
to measure the coupling strength among different brain regions or groups of neurons [63,64].
For two random sequences Xn and Yn , the mutual information could also be calculated
using a chain rule:

I(Xn ; Yn ) = H(Yn ) − H(Yn |Xn )

(3.11)

n

=
i=1
n

=
i=1

H(Yi |Yi−1 ) − H(Yi |Yi−1 , Xn )

(3.12)

I(Xn ; Yi |Yi−1 ).

(3.13)

Directed Information Since both correlation and mutual information are non-directional,
36

in 1990, Massey refined Markov’s work and proposed the concept of directed information
(DI) [24] aiming to measure the directed information flow from one random sequence to
another. Given Xn and Yn , the directed information from Xn to Yn is defined as:
n

I(Xn

→

Yn )

=
i=1

I(Xi ; Yi |Yi−1 ).

(3.14)

Recall that the causally conditional entropy is defined as:
n

H(Yn ||Xn )

=
i=1

H(Yi |Yi−1 , Xi ),

(3.15)

then it follows from (3.14) and (3.15) that:

I(Xn → Yn ) = H(Yn ) − H(Yn ||Xn ).

(3.16)

On the other hand,

I(Xn ; Yn ) = H(Yn ) − H(Yn |Xn )
= I(Xn → Yn ) + I(Yn−1 → Xn )

(3.17)
(3.18)

= I(Xn−1 → Yn ) + I(Yn−1 → Xn )
n

+
i=1

I(Xi ; Yi |Xi−1 , Yi−1 ).

(3.19)

In (3.19), the first term I(Xn−1 → Yn ) specifies the directed information flow from Xn
to Yn , the second term I(Yn−1 → Xn ) specifies the reverse directed information from Yn to
Xn , and the third one represents the conditional mutual information shared by both Xn and
Yn . DI reflects directional and interactive influence between two random sequences, and
37

has recently been applied to characterize the connectivity between different brain regions
[22, 28].
Directed Information and Granger Causality Granger Causality (GC) [7, 22] has long
been used in identifying causal relations between two random series. The main idea behind
the Granger Causality analysis is that, if one random process X causally influences another
random variable Y, then the knowledge of previous values of X will help to decrease errors
in predicting future values of Y. The calculation of Granger Causality is based on the
autoregressive or linear prediction models.
As can be seen from (3.1), the traditional Granger Causality analysis relies on the linear prediction models. Its nonlinear extensions generally still rely on the “linear-in-theparameter” modeling [29]. The directed information, on the other hand, contains no requirement on models, and hence provides the freedom to characterize more generalized relationships. In [29], Amblard and Olivier investigated the relationship between directed
information and Granger Causality, and showed that Granger Causality graphs could be
obtained using directed information. They further pointed out that, directed information
theory provided an adequate information theoretical framework for the connectivity inference
problems in neuroscience applications.

3.2.2

Directed Information Calculation and Causality Analysis

Developing practical estimators for directed information measures is always a challenging
problem. Over the last two decades, a limited number of directed information estimators
have been purposed. In [22], Quinn et al. utilized DI to infer causality based on neural
spike recordings. Their estimator is built upon the assumption that the random sequences
corresponding to spike recordings form stationary ergodic Markov processes and adopts the
38

simple General Linear Model(GLM). As a result, the causally conditional entropy can be
l ,X
)], in which g is a log-probability function, and J and
simplified as E[gJK (Yl−J
l−(K−1)l

K denote the orders of the Markov Processes. Although this method is not model free, its
validity has been verified in discovering causal relations among groups of neurons. Verdu et
al. purposed a K-nearest neighbor (KNN) based estimator for KL distance [44]. This idea
has been adopted in the work by Chai et al. to estimate entropy and mutual information [3].
Theoretically, directed information could be estimated after obtaining other information
theoretical measures like entropy and mutual information. However, the KNN estimator
is based on the assumption that samples in the random sequences are independent and
identically distributed (i.i.d). Also, this approach requires a large number of data points.
In this chapter, we calculate the directed information I(Xn → Yn ) by exploiting the
method initiated by Weissman et al. [30]. This approach is universal, and not limited to any
modeling assumptions on the random sequences. There are two parts in the estimation.
Part I This part has three steps.
1. Estimate H(Yn ) :

H(Yn )

1
=
n
≈

1
n

n
i=1 yi+1
n

P (yi+1 |yi )log

i=1 xi+1 ,yi+1

× log

1
P (yi+1 |yi )

39

.

1
P (yi+1 |yi )

(3.20)

P (xi+1 , yi+1 |xi , yi )
(3.21)

2. Estimate the H(Yn ||Xn ) :
H(Yn ||Xn )

1
=
n
≈ −

n

log
i=1
n

1
n

× log

1
P (yi |yi−1 , xi )

i=1 xi+1 ,yi+1

(3.22)

P (xi+1 , yi+1 |xi , yi )

P (xi+1 , yi+1 |xi , yi )
.
Px |X i Y i (xi+1 |xi , yi )

(3.23)

i+1

3. It then follows that I(Xn → Yn ) can be estimated as H(Yn ) − H(Yn ||Xn ).
Part II For validity of this estimation, it has been shown in the work by Weissman et
ˆ n → Yn ) converges to the expected real value of I(Xn → Yn ).
al. [30] that as n → ∞, I(X
To measure the causal influence of one region on another, we resort to Dn = I(Xn →
Yn ) − I(Yn → Xn ). Using (3.19), we have:
Dn = I(Xn → Yn ) − I(Yn → Xn )

(3.24)

= [I(Xn ; Yn ) − I(Yn → Xn )]
− [I(Xn ; Yn ) − I(Xn → Yn )]

(3.25)

= I(Xn−1 → Yn ) − I(Yn−1 → Xn ).

(3.26)

As shown in (3.24), Dn is the difference of two directed information between Xn and Yn .
If Dn is positive, that is, I(Xn−1 → Yn ) > I(Yn−1 → Xn ), then we say that Xn shows
more influence on Yn , and can be interpreted as the causal driver during the connectivity;
otherwise we say Yn shows more influence on Xn .
To make the result more comparable, we will use γ = Dn /I(Xn ; Yn ) instead of Dn .
Clearly, γ ∈ [−1, 1]. When |γ| approaches 1, it can be said with high confidence that there
40

does exist a causal influence between two stochastic processes; while if |γ| is adjacent to 0,
it is more likely that no clear causal relationship exists, or the samples in random sequences
are subject to strong noises. Therefore, in the simulations, a threshold based method is
developed in interpreting the γ metric.
The causality analysis of two brain regions helps us to understand which region is more
likely to be the causal driver during a particular connectivity. However, we would like to
point out that brain connectivity between two different regions generally involves dynamic
two-way information transmission between them, rather than a fixed one-way source to
destination relationship.

3.2.3

Practical evaluation

Based on our discussions in Section 2.2, for practical evaluation of the directed information,
the main point is how to estimate the probabilities involved in (3.20) accurately from discrete
time data or observations.
There are two major issues in probability estimation. First, how to choose the optimal
bin size for digitization. Second, how to estimate the probability of a particular sequence
after the random sequence has been mapped into a series of symbols.

3.2.3.0.1

Optimal Bin Size for Time Series Digitization The first problem in di-

rected information estimation of fMRI signals is digitization. If the bin size is too large, then
it results in considerable approximation error, and cannot reflect the true data distribution
accurately. If the bin size is too small, then the number of samples falling into each bin tends
to be 0 or 1 due to the very limited data length. As a result, the probability estimation
become inaccurate. In fact, when the bin size is too large or too small, the estimated DI
41

will approach zeros, due to the limited data length. Here we choose to use the bin size that
minimizes the Integrated Mean Square Error (IMSE) between the estimated probability and
its true value [65].
The random sequences in those fMRI signals acquire values in the real number field,
while the directed information probability estimation can be carried out only for discrete,
finite-size alphabets. Hence, mapping real-valued numbers into the symbols within a finite
alphabet is the first step for further procedures. In the digitization process, we adopt the
traditional histogram methods to estimate the probability of the data points falling into each
of the bins representing symbols in the alphabet.
Suppose the true probability distribution function (pdf) of a random variable X is
f (x),and the estimated pdf is fˆ(x), the IMSE is defined as:

IM SE =

E{fˆ(x) − f (x)}2 dx.

(3.27)

For a random sequence of length n, the optimal bin size that minimizes the IMSE is given
by:

h∗n

n ∞
f (x)2 dx}−1/3 ,
={
6 −∞

(3.28)

Assume the random sequence xn was sampled from a white stochastic process with distribution: f (x) =

(x−µ)2
−
1
√
e 2σ2 .
2πσ

Here µ is the mean and σ the standard deviation of the signal.

42

Then the optimal bin size for digitization can be obtained:

h∗n =

=

 ∞
n
6

−∞

 ∞
n
6

=

−1/3
(x−µ)2 √

−
[−(x − µ)e 2σ2 / 2πσ 3 ]2 dx



(3.29)



2

u2 e−u du/πσ 3

−1/3


(Let u = x−µ
σ )

0

−1/3
n 1
√ 3
6 4 πσ

(3.31)

≈ 3.49σn−1/3 .
The results in (3.31) relies on the fact that

(3.30)



(3.32)
∞
0

2

e−x dx =

√
π/2. In this chapter, the fMRI

data sequence is regarded as a Gaussian random process. For digitization, the bin size is
chosen according to (3.32).

3.2.3.0.2

Probability Estimation After digitization, real-valued fMRI time courses

become a sequence of symbols {xn } within an alphabet Ω. Denote the alphabet as Ω =
{x|x ∈ {0, 1, ...M − 1}}, where M = |Ω|; N0 , N1 , ...NM −1 represent the counts for each
symbol in the alphabet, respectively. The next step is to estimate the sequence probabilities
P (xi ) and P (xi |xi−1 ), i ∈ [1, N ]. Here we will resort to the Krichevsky-Trofimov (KT)
estimator [66] for the probability estimation. The primary reason of using the KT estimator
is that this estimator is universal and does not put any specific modeling constraints on the
random sequence. It has been shown in [30, 66] that although KT estimator is not optimal,
the bias it introduces will be upper bounded.
KT estimator first assigns 0 to the initial value for the sequence probability, and updates
this value as the sequence goes on. In each step, the algorithm analyzes the current sequence
and generates a list of count numbers for each symbol in the alphabet. Denote this list as
43

{N0 , N1 , ...NM −1 }. The algorithm goes as follows:
Initialize:
X = {∅}, {N0 , N1 , ...NM −1 } = {0, 0, ..., 0}, P (∅) = 0
Loop:
while i ≤ n do
xi ← {xi−1 , Xi = j}, j ∈ [0, M − 1];
{N0 , N1 , ...NM −1 } ← {N0 , N1 , ..., Nj + 1, ...NM −1 };
N +0.5

j
P (xi ) ← P (xi−1 ) × N +N +...N
0
1
M −1 +M/2

end while
After estimating the probability P (xn ), the conditional probability PX |X i (xi+1 |xi ) can
i+1
be obtained as P (xi+1 )/P (xi ).

3.3
3.3.1

Materials
Data Acquisition

Fourteen right-handed healthy college students (7 males, 23.4 ± 4.2 years of age) from Michigan State University volunteered to participate in this study and signed consent forms approved by the Michigan State University Institutional Review Board. The experiment was
conducted on a 3T GE Signa HDx MR scanner (GE Healthcare, Waukesha, WI) with an
8-channel head coil.
For each subject, fMRI datasets were collected on a visual stimulation condition with a
scene-object fMRI paradigm and then on a resting-state condition. The parameters for the
fMRI scan were: gradient-echo EPI, 36 contiguous 3-mm axial slices in an interleaved order,

44

time of echo (TE) = 27.7 ms, time of repetition (TR) = 2500 ms, flip angle = 80◦ , field of
view (FOV) = 22 cm, matrix size = 64 × 64, ramp sampling, and with the first four data
points discarded.
On the visual stimulation fMRI condition, each volume of images were acquired 192 times
(8 min) while a subject was presented with 12 blocks of visual stimulation after an initial 10
s ’resting period. In a predefined randomized order, the scenery pictures were presented in 6
blocks and the object pictures were presented in other 6 blocks. All pictures were unique. In
each block, 10 pictures were presented continuously for 25 s (2.5 s for each picture), followed
with a 15 s baseline condition (a white screen with a black fixation cross at the center).
The subject needed to press his/her right index finger once when the screen was switched
from the baseline to picture condition. Stimuli were displayed in color in full screen on a
1024×768 32-inch LCD monitor (Salvagione Design, Sausalito, CA) placed at the back of the
magnet room. The LCD subtended 10.2◦ × 13.1◦ of visual angle. On the resting-state fMRI
(rs-fMRI) condition, each volume of images were acquired 164 times (6 min and 50 s) after a
subject was informed to relax, keep his/her eyes closed and stay awake throughout the scan.
After the above functional data acquisition, high-resolution volumetric T1-weighted spoiled
gradient-recalled (SPGR) images with cerebrospinal fluid suppressed were obtained to cover
the whole brain with 120 1.5-mm sagittal slices, 8◦ flip angle and 24 cm FOV. These images
were used to identify anatomical locations.

3.3.2

fMRI Data Pre-processing and Analysis

All stimulus fMRI data pre-processing and analysis for each subject were conducted with
AFNI software (Cox, 1996) as described in Henderson et al. [61]. Essentially, slice-timing
correction and rigid-body motion correction were carried. Spatial blurring with a full width
45

half maximum of 4 mm was applied to reduce random noise. Multiple linear regressions
(using the “3dDeconvolve” routine in AFNI) were applied on a voxel-wise basis to find the
magnitude change when each picture condition was presented, followed with general linear
tests to find the statistical significances between stimulus conditions.
The regions of interest (ROI) in this study were defined in the Talairach coordinate
space [67]. Regions showing preferential activation to scenes over objects (voxel-based pvalue < 10−4 ) in the right and left parahippocampal gyri were defined as the right and left
PPA [61]. The right and left V1 ROIs were defined as the regions activated by pictures
(voxel-based p-value < 10−10 ) within Brodmann area 17. Because there was a high level of
activation at and around V1, a highly conservative p value threshold was chosen to define
relatively focal ROIs. The right and left SMC spherical ROIs with 6-mm radius were defined
with the centers at (R36, P22, S54) and (L38, P26, S50) correspondingly in the Talairach
coordinate space (R = Right, L = Left, P = Posterior, S = Superior). The SMC coordinate
locations were defined by Witt et al. [68] and the ROIs were created as in Zhu et al. [69]. The
time courses from the stimulation fMRI dataset that were already pre-processed as above
were detrended and had their baselines removed also. The spatially averaged time course at
each of the above ROIs was generated for the causality analyses discussed later.
The rs-fMRI pre-processing was also processed in AFNI [70] as commonly applied in the
field and as described in details in Zhu et al. [69]. Essentially, slice-timing correction and
rigid-body motion correction were carried. Spatial blurring with a full width half maximum
of 4 mm was applied to reduce random noise. The time courses were detrended and the
baselines were removed. Brain global, cerebrospinal fluid and white-matter mean signals
were modeled as nuisance variables and removed from the time courses. Finally, the time
courses were band-pass filtered to the range of 0.009 Hz – 0.08 Hz. The spatially averaged
46

time course at each of the above ROIs was generated for the causality analyses discussed
later.

3.3.3

Simulated Data

The simulated data was synthesized from the fMRI data corresponding to the primary visual
cortex (V1). Recall that the total number of samples in the time series at V1 is 192, as
described earlier. Denote this sequence as xn = [x1 , x2 , ..., xn ], where n = 192. Here we will
use two sets of simulated data.
Set I For i ∈ [1, 2, ..., 192], the first group of simulated data yn
1 = [y1,1 , y1,2 , ..., y1,n ] was
obtained as:

y1,i = 0.3 ∗ xi + 0.2 ∗ xi−1 .

(3.33)

It’s clear that Xn has a causal influence on Yn . The true fMRI data and the simulated
data set I form a linear causal relationship.
Set II For further comparison, we introduced another group that had a nonlinear relationship with the true fMRI data. For i ∈ [1, 2, ..., 192], the second set of simulated data
yn
2 = [y2,1 , y2,2 , ..., y2,n ] was obtained as:

y2,i =

1 if xi ≥ 0;
0 if xi < 0.

(3.34)

Clearly, the nonlinear relationship in the second group is difficult to be captured by a
linear autoregression model. It has also introduced a significant change in the power level in
comparison with true fMRI data.
n
To make the data more realistic, we added white Gaussian noise to yn
1 and y2 , where the

47

noise is of zero mean and variance σ02 . In the simulation, the Signal-to-Noise Ratio (SN R),
which was calculated as 20log(σx /σ0 ), was set in a range between 4 dB and 20 dB. Before
performing fMRI causality analysis, we carry out both DI based causality analysis and the
GC analysis over xn and yn for method validation. The analysis based on simulated data
helps to set up a threshold for the metric γ = [I(Xn−1 → Yn )−I(Yn−1 → Xn )]/I(Xn ; Yn ).
Note that the subtraction operation in γ may help to reduce the noise effect in fMRI data. In
the Granger Causality analysis, the parameter F-test was adopted to generate the p-value.
The max lag in the test was set to be 2.

3.4

Results

In this section, we demonstrate the effectiveness of DI based causality analysis using both
stimulated and acquired fMRI data, and compare the results with that of the GC analysis.

3.4.1

Causality between the fMRI data and its descending simulated data

In this subsection, we validate the DI based causality analysis approach using fMRI data
and the simulated data generated from it.
DI based Causality Analysis
Fig.1(a) and 1(b) present the DI based causality analysis results corresponding to the
fMRI data and the simulated data set I. Fig.1(a) shows the comparison of mutual information
n
n
n
I(xn ; yn
1 ), directed information I(x → y1 ) and the reversed directed information I(y1 →
n
n
xn ). In this example, SNR = 8 dB. Clearly, I(xn → yn
1 ) > I(y1 → x ). As can be seen,

the estimated Directed Information shows a clear surplus from xn → yn
1 , which implies that
48

120

I(xn ; yn1 )

0.20

I(xn → yn1 )
I(yn1 → xn )

100

0.15

80
0.10

γ

60

40

0.00

20

0

0.05

−0.05

0

50

100

150

−0.10

200

4

6

8

10

Time Points
70

14

16

18

20

16

18

20

0.9

I(xn ; yn2 )
I(xn → yn2 )

60

12

SNR/dB

0.8

I(yn2 → xn )

0.7

50
0.6

40

γ

0.5
0.4

30

0.3

20
0.2

10

0

0.1

0

50

100

150

200

0.0

4

6

8

10

12

14

SNR/dB

Time Points

Figure 3.1: Directed Information based test results: mutual information, directed information and the γ metric. Here xn denotes the fMRI data, yn
1 the simulation data set I, and
n and yn ; (b) γ versus SN R correyn
the
simulation
data
set
II.
(a)
MI
and
DI
between
x
2
1
n and yn ; (d) γ versus SN R corresponding
sponding to xn and yn
;
(c)
MI
and
DI
between
x
1
2
to xn and yn
2.

49

n
there is a causal relationship between xn and yn
1 , and x is more likely to be the cause as

expected.
Figure 1(b) shows the values of γ versus different SN R levels. As can be seen, when
the SNR is above 6 dB (i.e., the noise level is relatively low), we can observe a clear causal
relationship from xn → yn
1 . As the noise level gets higher, i.e. when SN R < 5 dB, the
causal relationship becomes ambiguous.
The results corresponding to the fMRI data xn and the simulated data set II, yn
2 , are
shown in Fig.1(c) and 1(d). As can be seen, the results are similar with that corresponding to
xn and yn
1 . Again, the DI based causality analysis indicates that there is a causal relationship
n
between xn and yn
2 , and x is more likely to be the causal part as expected.
n
It should be noted that the relationship between yn
1 and x is linear, but the relationship
n
between yn
2 and x is nonlinear. It can be seen that the DI based causality analysis is

effective in the nonlinear case as well. The analysis results are consistent with our priori
n
n
n
knowledge that there is a causal relationship between xn and yn
1 , and x and y2 , with x

as the causal side in both cases.
Based on our simulation results on xn and yn
1 , which is more similar with true fMRI data
than yn
2 , we found that:
• γ ∈ [0.1, 1] implies that X has more causal influence on Y; accordingly, γ ∈ [−1, −0.1]
implies that Y has more causal influence on X;
• γ ∈ [−0.1, 0.1] implies that there is no clear dominant influence between X and Y.
n
n
Granger Causality Analysis We then apply GC analysis to xn and yn
1 , and x and y2 , the

results are shown in Fig.3.2. Fig.3.2(a) and Fig.3.2(b) show the p-value of the GC analysis
n
n
corresponding to xn → yn
1 and y1 → x , respectively. As can be seen, there is a clear

50

yn1 → xn

0.8

xn → yn1

0.7

0.7

0.6

0.6
0.5

p-value

p-value

0.5

0.4

0.4

0.3

0.3
0.2
0.2
0.1

0.1

0.0

4

6

8

10

12

14

16

18

0.0

20

4

6

8

10

SNR/dB

xn → yn2

1.0

12

14

16

18

20

14

16

18

20

SNR/dB

yn2 → xn

1.0

0.8

0.6

0.6
p-value

p-value

0.8

0.4

0.4

0.2

0.2

0.0

4

6

8

10

12

14

16

18

0.0

20

SNR/dB

4

6

8

10

12

SNR/dB

Figure 3.2: Inter-region GC test results based on fMRI data and two sets of simulation
data generated from it. (a) test results for the direction xn → yn
1 ; (b) test results for the
n
n
n
n
direction y1 → x ; (c) test results for the direction x → y2 ; (d) test results for the direction
n
yn
2 →x .

51

causal relationship from xn → yn
1 ; most medians in the boxes indicate a highly significant
casual relationship (p < 0.0002, which is much smaller than the commonly accepted p-value
0.01) [71]. These results lead to the expected conclusion that two random sequences xn and
n
n
yn
1 are Granger Causally related. However, for the test on x and y2 , in which the causal

relationship is completely nonlinear, the Granger Causality test was not able to capture
the causal relationship xn → yn
2 . In both 3.2(c) and 3.2(d), most medians in the boxes
indicate that there is no significant causal relationship between these two time sequences
(0.2 < p < 0.6), leading to an unexpected conclusion that xn and yn
2 are not causally
related.

3.4.2

Causality analysis based only on the Experimental fMRI
Data

In this section, we apply both DI based causality analysis and GC analysis to the experimental fMRI data. We collected both stimulation based fMRI data with a well-defined
block-design scene-object fMRI paradigm as discussed earlier [61,62], and resting-state fMRI
data. Recall that in the scene-object paradigm, subjects viewed blocks of scenery and object
pictures. They were asked to press a button once under the right index finger when they
saw a block of pictures. We test the robustness of our causality analysis techniques against
some expected outcomes: under the stimulation fMRI paradigm, the primary visual cortex
(V1) and nearby regions are activated first, followed with activation in the parahippocampal
place area (PPA) for higher level scene processing. Some but relatively small activations
in the left sensorimotor cortex (SMC) is also expected following V1 activations. Overall
sequential neuronal activity is not expected between the right and left homologous regions

52

0.6

0.4

0.4

0.2

0.2

γ

γ

0.6

0.0

0.0

0.2

0.2

0.4

0.4

Resting

Resting

Stimuli

(a) V1 and PPA

Stimuli

(b) V1 and SMC

Figure 3.3: Inter-region γ values of directed information based causality analysis.
above. Under the resting-state condition, neuronal activity is not expected to occur in a
sequential manner among above regions.
DI based Causality Analysis Here we examine the potential causal relationship between
left V1 and left PPA, left V1 and left SMC under both resting state and visual stimulation
conditions. The directed information based γ values are shown in Figure 3.3. In resting
state condition, the medians of the γ values are within the [-0.1,0.1] region. The left V1
does not exhibit a dominating causal influence over other regions, including left PPA and
left SMC. However, under the simulation paradigm, the γ values for left V1 → left PPA and
left V1 → left SMC increase significantly. In other words, under the stimulation, left V1
shows stronger influences over left PPA, as well as left SMC, as expected.
Figure 3.4 shows the γ values between the right and left homologous brain regions in
both resting state and stimulus-based state. As can be seen, the median values are well
below 0.1. That is, the directed information based causality analysis indicates that there is
no dominating influence between the left and right homologous brain regions.
Granger Causality Analysis As in the DI based analysis, we carry parallel GC analyses

53

0.6
0.4

γ

0.2
0.0
0.2
0.4

Resting

Stimuli

(a) V1
0.6
0.4

γ

0.2
0.0
0.2
0.4

Resting

Stimuli

(b) PPA
0.6
0.4

γ

0.2
0.0
0.2
0.4

Resting

Stimuli

(c) SMC

Figure 3.4: Inner-region (left-right) γ values of the directed information based causality
analysis.
54

0.16

0.7

0.14

0.6

0.12

0.5

p-value

p-value

0.10
0.08
0.06

0.3
0.2

0.04

0.1

0.02
0.00

0.4

Resting

0.0

Stimuli

(a) V1 to PPA

Resting

Stimuli

(b) V1 to SMC
0.9

1.0

0.8
0.8

0.7

p-value

p-value

0.6
0.6
0.4

0.5
0.4
0.3
0.2

0.2

0.1
0.0

Resting

0.0

Stimuli

(c) PPA to V1

Resting
(d) SMC to V1

Figure 3.5: Inter-region Granger Causality test result.

55

Stimuli

between brain regions (Figure 3.5). For the left V1– left PPA pair (Figure 3.5 (a) and (c)),
GC analyses indicate that there is a dominating influence of left V1 over left PPA under
both resting-state (median p = 0.004) and stimulation (median p = 0.0005) conditions.
However, causal relationship between left V1 and left PPA is not expected in the resting
state condition. With the Granger Causality analysis, it is difficult to distinguish between
the resting state and stimulus-based state, as the p-values in both states are small enough
to indicate a causal relationship. For the left V1–left SMC pair (Figure 3.5 (b) and (d)),
the Granger Causality analysis seems to have reversed the expected causal relationship, and
indicates that SMC is more likely to be the cause.
Figure 3.6 shows the results of the GC analysis for the right and left homologous brain
regions, including V1, PPA and SMC. The results indicated that the information flow between each pair of homologous regions was very unbalanced, and varied significantly in most
cases. This is contradicting to the expected non-sequential activation between them, as it
is believed that the right and left homologous regions should not have significant sequential
activation.

3.4.3

Impact of Hemodynamics on DI Based Causality Analysis

In this subsection, we evaluate the impact of hemodynamics (i.e. the blood flow or the
circulation) on the performance of DI based causality analysis. We take a representative
model for the hemodynamic response function h(t) = t8.6 e−t/0.547 [72]. Let x(t) be the
sawtooth waveform (to mimic the situation under periodic stimulation), and y(t) = 0.3x(t)+

56

0.6
0.5

p-value

0.4
0.3
0.2
0.1
0.0

Resting

Stimuli

(a) V1
0.9
0.8
0.7

p-value

0.6
0.5
0.4
0.3
0.2
0.1
0.0

Resting

Stimuli

(b) PPA
1.0

p-value

0.8
0.6
0.4
0.2
0.0

Resting

Stimuli

(c) SMC

Figure 3.6: Inner-region (left-right) Granger Causality test results.
57

0.2x(t − 1). Clearly, x(t) is the causal side. Define:
xˆ(t) = hx (t) ∗ x(t),

(3.35)

yˆ(t) = hy (t) ∗ y(t).

(3.36)

Recall that the Dirac delta function, defined as

δ(t) = 0, ∀t = 0,
∞

δ(t)dt = 1,
−∞

is generally used to model the impulse response of an ideal communication channel. We
conduct DI based causal analysis for xˆ and yˆ under the following scenarios:
• Group 1:
– case 1.1: hx (t) = δ(t), hy (t) = δ(t);
– case 1.2: hx (t) = h(t), hy (t) = h(t);
– case 1.3: hx (t) = δ(t), hy (t) = 3δ(t);
• Group 2:
– case 2.1: hx (t) = δ(t), hy (t) = δ(t);
– case 2.2: hx (t) = 0.3h(t) + 0.2h(t − 1), hy (t) = 0.3h(t) + 0.2h(t − 1);
– case 2.3: hx (t) = h(t), hy (t) = 0.3h(t) + 0.2h(t − 1);
– case 2.4: hx (t) = h(t), hy (t) = 0.7h(t) + 0.4h(t − 1);
• Group 3:
58

– case 3.1: hx (t) = δ(t), hy (t) = δ(t);
– case 3.2: hx (t) = h(t), hy (t) = h(2t);
The results are shown in Figure 3.7. We look at Group 1 first. It can be observed
that, when the hemodynamic response functions hx (t) and hy (t) are identical, the causal
relationship between xˆ(t) and yˆ(t) would be the same with that of x(t) and y(t). That is,
xˆ(t) is the causal part. However, in case 1.3, when the power level of yˆ(t) is much higher
than that of xˆ(t), the causal relationship is either reversed or becomes ambiguous.
In Group 2, we consider the multipath channel model. For case 2.1 and case 2.2, we can
see clearly that xˆ(t) is the causal side, but in case 2.3, the power level of yˆ(t) is higher than
that of xˆ(t), again, the causal relationship is reversed or becomes ambiguous.
In Group 3, we consider the case when the hemodynamic response hy (t) changes much
faster than hx (t). From Figure 3.7(c), it can be seen that this did not bother the DI based
method. We can see clearly that xˆ(t) is the causal side.
For comparison purpose, we examine the impact of hemodynamics on GC analysis as
well. In most cases, GC cannot distinguish which one is the casual side. Due to space limits,
only the results for case 2.4 are shown here. Please refer to Figure 8. As can be seen, based
on the p-values, both x(t) and y(t) are identified as the causal side by GC. Our analysis
indicates that GC is more sensitive to hemodynamic effects.

3.4.4

Summary of Results

In the simulations, we first performed DI based causality analysis between the fMRI data
and the simulated data, and compared the results with that of the GC analysis. Two sets
of simulated data were generated from the fMRI data. Set I is obtained by convolving the

59

0.15

0.10

0.05

γ

0.00

−0.05
−0.10
−0.15

case 1.1
case 1.2
case 1.3

−0.20
−0.25

4

6

8

10

12

14

16

SNR/dB

(a) Group 1
0.30

0.25

0.20

γ

0.15

0.10

0.05

case 2.1
case 2.2
case 2.3
case 2.4

0.00

−0.05

4

6

8

10

12

14

16

SNR/dB

(b) Group 2
0.30

0.25

γ

0.20

0.15

0.10

0.05

case 3.1
case 3.2
0.00

4

6

8

10

12

14

16

SNR/dB

(c) Group 3

Figure 3.7: DI under different hemodyamic response functions.
60

9

x(t) → y(t)

×10−12

0.8

7

0.7

6

0.6

5

0.5

p-value

p-value

8

4

0.4

3

0.3

2

0.2

1

0.1

0

y(t) → x(t)

0.9

4

6

8

10

12

14

0.0

16

4

6

8

SNR/dB

10

12

14

16

SNR/dB

Figure 3.8: Granger Causality test results of case 2.4.
fMRI data with a simple causal model; Set II is obtained by mapping the positive points in
the fMRI data to 1, and the negative points to zeros. The simulation results showed that
the Granger Causality analysis could identify the cause clearly between the fMRI data and
data set I, but failed for the test corresponding to data set II, due to the severe nonlinearity
of the data. The directed information based approach, on the other hand, can identify the
cause accurately in both cases as long as the SNR is above 5dB.
As can be seen, GC analysis is effective in detecting linear or nearly linear causal relationship, but has difficulty in capturing nonlinear causal relationships. The underlying argument
is that GC analysis relies heavily on the linear prediction theory, or linear modeling of the
involved parameters. The directed information based causality analysis, on the other hand,
does not have any modeling constraints on the sequences to be evaluated, hence can be used
to capture both linear and non-linear causal relationships.
We then applied both the DI based analysis and the GC analysis to examine the causal
relationship in V1–PPA, V1–SMC, as well as the right and left homologous brain regions,
including V1, PPA and SMC. From the DI based analysis, we observed that: (i) In the

61

resting state, there is no dominant cause for both the V1–PPA and V1–SMC pairs. (ii) In
the stimulation based state, V1 turns out to be the cause in the V1-SMC pair. For the
V1–PPA pair, although not as strong as in the V1–SMC pair, V1 is more likely to be the
cause part. (iii) For both the resting state and the stimulus-based state, there is no dominant
cause observed in the right and left homologous brain regions. For the GC analysis, it can
be seen that: (i) The results for the V1–PPA are consistent with that of the DI analysis.
(ii) For the V1–SMC pair, the results corresponding to resting state are also consistent with
the DI analysis; however, in the stimulation based state, it shows that SMC is more likely
to be the cause, which is contradicting with the expected sequential brain activation (V1 to
SMC). In this paradigm, the activity in SMC is weak, which could be a reason that the GC
analysis could not detect the sequential activity. (iii) The information flows between each
pair of homologous regions were not balanced in most cases, which is contradicting to the
expected non-sequential activation between them.
Finally, we evaluate the impact of hemodynamic effects on DI based causality analysis
method using simulated data. we observed that: (i) even if the hemodynamic response
function of the driving side changes slower than that of the other side, the proposed DI
method can still identify the causal side accurately. (ii) However, when the power level
of the driving side is much lower than that of the other side, then the causal relationship
may be reversed or become ambiguous. This is because that: in the digitization process,
higher power level maps to higher entropy; as in BOLD, higher fMRI amplitude implies more
significant activity levels. We will investigate more on this in the future.
Our results indicate that DI is an effective technique to quantify the overall causal relationship. It is also observed that brain connectivity between two different regions generally
involves dynamic two-way information transmission between them, rather than a fixed one62

way source to destination relationship.

3.5

Summary

In this chapter, we presented the directed information framework and showed how to apply
it for fMRI causality analysis. We provided the detailed procedure on how to calculate
the DI for two finite time series. The two major steps involved here are optimal bin size
selection for data digitization, and probability estimation. We applied the DI based causality
analysis to both the simulated data and experimental fMRI data, and compared the results
with that of the Granger Causality analysis. Our results indicated that GC analysis is
effective in detecting linear or nearly linear causal relationship, but has difficulty in capturing
nonlinear causal relationships. On the other hand, DI based causality analysis is effective
in capturing both linear and non-linear causal relationships. Moreover, it was observed that
brain connectivity among different regions generally involves dynamic two-way information
transmissions between them. Our results showed that when bidirectional information flow
is present, DI is more effective than GC to quantify the overall causal relationship.
We would also like to point out that with DI based approach, the performance improves
as the data size increases. This is because the probability estimation gets more accurate as
we have more samples. For future work, we would continue our research on functional and
effective brain connectivity by combining the conventional information theory, the directed
information framework as well as the network-level information theory.

63

Chapter 4
Discrete DCM and Its Relationship
with Directed Information and
Granger Causality
In this chapter, we explores the discrete Dynamic Causal Modeling (DDCM) and its relationship with Directed Information (DI) and Granger Causality (GC). First, we demonstrate
the relationship between DDCM and the continuous DCM. Second, we prove the conditional
equivalence between DDCM and DI in characterizing the causal relationship between two
brain regions. This equivalence between DDCM and DI also provides an effective method
for DI estimation. Moreover, it is shown that when the hemodynamic system is invertible, the DI-based causal relationship between the neurostates of two brain regions is the
same with that between the observed BOLD signals. Finally, we illustrate the similarities
and differences between DDCM and GC. Although they share a similar mathematic form,
the causality measures they utilize are completely different. The theoretical techniques are
demonstrated using fMRI data obtained under both resting state and stimulus based state.
Our numerical analysis is consistent with that reported in previous study.

64

4.1

Introduction

Causality analysis aims to find the relationship between causes and effects. It provides
insightful information on how brain regions interact with each other during a cognitive
task [6]. In general, causality analysis tries to determine whether the values of one time
series are useful in predicting the future values of another time series. Since 1990s, a number
of frameworks have been applied to fMRI based causality analysis. Among them, Granger
Causality (GC), Directed Information (DI), and Dynamic Causal Modeling (DCM) are three
representative approaches. In this chapter, we will revisit these three causality analysis
frameworks, discuss the relationships between them, especially the relationship between
DCM and DI.
Granger Causality The first practical causality analysis framework was proposed by
Granger in 1969 [7]. The main idea is, if two signals X1 and X2 form a causal relationship, then, instead of using the past values of X2 alone, the information contained in the
past values of X1 will help predict X2 . More specifically, the calculation of Granger Causality is based on the linear prediction models. Suppose Xn
1 = [X1 (1), X1 (2), ..., X1 (n)] and
Xn
2 = [X2 (1), X2 (2), ..., X2 (n)] are two time series observed from two brain regions, respectively. Granger Causality compares the prediction errors er and e˜r in the following equations:
L−1

al X2 (k − l) + er ,

(4.1)

[bl X1 (k − l) + cl X2 (k − l)] + e˜r ,

(4.2)

X2 (k + 1) =
l=0
L−1

X2 (k + 1) =
l=0

for k = 1, 2, ..., n. Here, er is the error of predicting X2 based only on the previous values

65

of X2 , and e˜r is the error of predicting X2 based on the previous values of both X2 and
X1 . If e˜r is much smaller than er , that is, the introduction of the previous values of X1 can
improve the prediction accuracy, then we say there is a Granger causal relationship between
X1 and X2 .
In literature, there have been growing interests in applying GC to identify causal interactions in the brain [8–11]. For example, in [10], Hu et al. applied GC analysis on fMRI
data to evaluate the causal relationship among specific brain regions, so as to understand
the impact of amnesic mild cognitive impairment (aMCI) to brain connectivity. In [11],
David et al. applied GC (together with Dynamic Causal Modeling) to a combination of
fMRI and EEG data. Their experiments showed that as the hemodynamics (i.e., the blood
flow or the circulation) vary from region to region, GC may not be applied directly on the
fMRI signals. However, when the hemodynamic effects were explicitly removed, GC test can
perform effective causality analysis in linear relationships.
As a widely accepted technique, the validity and computational simplicity of Granger
Causality have been appreciated. However, it has also been noticed that GC relies heavily
on the linear prediction method. When there exist instantaneous and/or strong nonlinear
interactions between two regions, GC analysis may lead to invalid results [11,27]. To address
this problem, several approaches on nonlinear Granger Causality have been proposed. For
example, in [56], Bezruchko et al. proposed an auto-regression model constructed in the
form of a polynomial. More recently, Marinazzo et al. proposed a method to generalize GC
to include the nonlinear case using the kernel technique [57]. The copula approach has been
applied for GC assessment in [9, 58]. A comprehensive discussion on nonlinear GC could be
found in [59].
Directed Information Directed Information is an information theoretic metric, which was
66

first introduced by Massey when studying communication channels with feedback [24]. It
measures the directed information flow from one time series X to another time series Y ,
denoted as I(X → Y ). If I(X → Y ) > I(Y → X), then we say X has more influence on Y ,
or X is the causal side in the connectivity.
DI is a universal method. Unlike GC, which mainly relies on the linear prediction theory,
or linear modeling for the involved parameters, the DI-based causality analysis does not
have any modeling constraint on the sequences to be evaluated, hence, can be used to
characterize more general relationships. This advantage of DI has been reported in recent
advances in causality analysis [22, 25, 26]. In [27], it was pointed out that GC analysis is
effective in detecting linear or nearly linear causal relationship, but may have difficulty in
capturing nonlinear causal relationships. On the other hand, DI-based causality analysis is
more effective in capturing both linear and nonlinear causal relationships. In [28], Liu et
al. applied DI to the EEG data and compared the result with that of GC. Their conclusion
was that DI based approach could be superior to GC in capturing the instantaneous and
nonlinear causal relationship in EEG data. Moreover, in [29], it was shown that the Granger
Causality graphs of stochastic processes can be generated from the DI framework, and the
authors indicated that the DI theory provides an adequate framework for the connectivity
inference problems in neuroscience applications. A comprehensive investigation of DI can be
found in [30].
Dynamic Causal Modeling In 2003, Friston proposed the framework of Dynamic Causal
Modeling to describe the general interactions among a group of brain regions [15]. DCM
assumes that the invisible neurostate X, the (external) input U , the observed BOLD signal
Y , the parameter θ that characterizes the connectivities between different brain regions, and
the independent noise Ω form a dynamic system that could be described by the following
67

equations:

X˙ = f (X, U, θ) and Y = Λ(X) + Ω,

(4.3)

where Λ represents a cascade of differential equations which map the neurostate X to the
observed BOLD signal Y . Relying on the EM algorithm, DCM has been implemented on both
fMRI and EEG data [16]. In practical applications, due to the computational complexity,
DCM is usually used as a confirmatory approach. That is, the users need to put forward
different connectivity models and then compare them based on their likelihood evaluated
under DCM [17].
Compared with GC, DCM provides a more comprehensive characterization of the dynamic interactions between multiple regions. In [73], Friston et al. pointed out that GC and
DCM were complementary to each other: GC models the causal dependency among observed
responses, while DCM models the causal interactions among the hidden neurostates. On the
other hand, in [74], Friston provided an example to show that DCM and GC may generate
different results given the same dataset. The underlining argument is that: DCM takes into
account both the external input and the biological variations of the hemodynamic response,
which are not involved in GC.
To this end, it can be seen that the relationships between GC and DCM, and between
GC and DI have been investigated in literature. While GC is efficient in detecting linear
causal relationships, both DI and DCM can be used to characterize more general causal
relationships. A missing link here is: what is the relationship between DCM and DI?
In this chapter, we aim to fill this missing link, and explore the connection between
DCM and DI. First, we revisit the discrete DCM (DDCM), and demonstrate the relationship

68

between DDCM and the conventional continuous time DCM. Second, we show that under
certain conditions, DDCM and DI are equivalent in characterizing the causal relationship
between two brain regions. Recall that traditionally, the accuracy of DI estimation is based
on the accuracy of probability or statistic estimation, and hence requires the data length be
sufficiently long. This equivalence between DDCM and DI, in fact, also provides a simple
but effective method for DI estimation under limited data length. More specifically, the
major contributions of this chapter can be summarized as:
1. We demonstrate the relationship between DDCM and the continuous time DCM, and
confirm the validity of DDCM. In DCM, the neural dynamics within the brain are
described using a differential equation. Conventionally, DDCM is obtained from DCM
in two steps: first, sampling the continuous DCM and approximating the differential
equation with a difference equation; second, modeling the hemodynamic response as
an LTI system, and characterizing it with a convolution. In this chapter, rather than
using approximation, we prove that when the input to the neural dynamic system is
a constant, then DDCM can be strictly derived from DCM under the noise free case.
Our result further demonstrates the validity and accuracy of the DDCM model.
2. We reveal the conditional equivalence between DDCM and DI in characterizing the
causal relationship between two brain regions. More specifically, assuming that the
dynamical neural system is causal, the neurostate and the noise at each region are
normally distributed, and the external input is a constant, we show that DDCM and
DI are equivalent in characterizing the causal relationship between two brain regions.
We also show that when the hemodynamic system is invertible, then the DI-based
causal relationship between the neurostates of two brain regions is the same with that

69

between the observed BOLD signals. Finally, this equivalence between DDCM and DI
provides a simple method for DI estimation.
3. We illustrate the similarities and differences between DDCM and GC. Although GC
and DDCM share a similar mathematic form, DDCM determines the causality based on
values of the connectivity coefficients, while GC determines the causality by comparing
the prediction errors. In addition, both the external input and the biological variations
of the hemodynamic response are taken into account in DDCM, but are not involved
in GC.
4. We validate the theoretical results with fMRI data obtained under both resting state and
stimulus based state. Numerical result shows that both DDCM and DI can capture
the causal relationships between the primary visual cortex (V1), the parahippocampal
place area (PPA), and between V1 and the sensorimotor cortex (SMC). As expected,
in the stimulus based state, V1 has shown significant causal influence over PPA and
SMC; in the resting state, no clear causal relationship can be observed. Our results
are consistent with that reported in previous study [27].
The rest of this chapter is organized as follows. In Section II, we revisit DDCM and
demonstrate its relationship with the continuous time DCM. In Section III, we prove the
conditional equivalence between DDCM and DI in characterizing the causal relationship
between two brain regions. The similarities and differences between DDCM and GC are
illustrated in Section IV. We present the numerical results in Section V, and conclude in
Section VI.

70

4.2

Discrete Dynamic Causal Modeling

In continuous time DCM, the invisible neurostates of d brain regions are denoted by a
vector X = [X1 , X2 , ..., Xd ]t , where each Xi , i = 1, 2, ..., d, represents the neurostate of
the ith region. The basic idea of DCM is that, the neurostate X, the external input U ,
˜ that describe the connections among brain regions, the
the connectivity matrices A˜ and B
observed BOLD signal Y and the independent noise can be formulated as a complex dynamic
system, characterized as:

˙
˜
˜ (t) + Ω1 (t),
X(t)
= AX(t)
+ BU

(4.4)

˜
Y(t) = Λ(X(t))
+ Ω2 (t),

(4.5)

˜ represents the
where Ω1 (t) and Ω2 (t) are the state noise and observation noise, and Λ
mapping from the neurostate X(t) to the observed BOLD signal Y(t).
The functional connectivity in DCM is mainly characterized by matrix A˜ [8]. For instance,
in a model with two brain regions, that is, d = 2 and X = [X1 , X2 ]t , the connectivity matrix
A˜ will be:



˜
˜
 A11 A12 
A˜ = 
.
˜
˜
A21 A22

(4.6)

Here, for i, j = 1, 2, A˜ii measures the influence of the past values of Xi on its future values,
and A˜ji measures the influence of past values of Xi on the future values of Xj . The absolute
values of A˜12 and A˜21 describe the causal relationship between the two regions: when |A˜12 | >
|A˜21 |, it means that X2 has imposed more influence over X1 ; and when |A˜21 | > |A˜12 |, it
means that X1 has imposed more influence over X2 [17, 75].
71

It can be seen from equations (4.4) and (4.5) that the continuous time DCM characterizes
the dynamic neural system using two continuous-time equations. However, parameter estimation in continuous time equations faces considerable challenges in practical applications.
To overcome this difficulty, there have been efforts to simplify DCM to a more tractable
form, such as the switching linear dynamic model (SLDS) [75] and multivariate dynamical
model (MDS) [17]. In both approaches, the continuous time equations are discretized, and
the mapping between the neurostate and the BOLD signal is approximated as an LTI system, characterized using a convolution. That is, the discrete DCM (DDCM) model can be
obtained as:

X(k + 1) = AX(k) + BU (k) + Ω1 (k),

(4.7)

M

Y(k) =
m=0

Λ(m)X(k − m) + Ω2 (k),

(4.8)

where A is the connectivity matrix, {Λ(m), m = 0, 1, · · · M } denotes the convolution coefficients corresponding to the hemodynamic response, and Ω1 (k) and Ω2 (k) denote the noise
terms independent of the brain state and the input.
Consider the case of two regions, region 1 and region 2, where equation (4.7) can be
rewritten as:


















 X1 (k + 1)   A11 A12   X1 (k)   B1 
 Ω11 (k) 

=

+
 U (k) + 
.
X2 (k + 1)
A21 A22
X2 (k)
B2
Ω12 (k)

(4.9)

Similar to the continuous time DCM, coefficients A12 and A21 actually measure the causal
relationship between region 1 and region 2. More specifically, if |A21 | > |A12 |, then X1 is
more likely to be the casual side, and vice versa. The same analysis holds when multiple
72

brain regions are under investigations [17, 75].
As can be seen, equation (4.7) is obtained by approximating the continuous time differential equation in DCM with a difference equation. Here, we will show that, in the noise-free
case, when the external input is a constant, DDCM can be directly derived from DCM. For
this purpose, we need to prove the following Lemma first.
˙
˜
˜ (t), where A˜ and B
˜ are deterLemma 1 For a differential equation X(t)
= AX(t)
+ BU
ministic matrices, the solution to the equation is:
t

˜

X(t) = eAt X(0) +

˜

˜ (τ )dτ.
eA(t−τ ) BU

(4.10)

0

Proof Taking the derivatives on both side of equation (4.10), the left hand side of the
˙
equation would be X(t).
Given that U (t) is causal, the right hand side would be:
∞
d At
˜
˜
˜ (τ )dτ ]
u(t − τ )eA(t−τ ) BU
[e X(0) +
dt
0
∞ d
∞
d ˜
˜
˜
˜ At X(0) +
˜ (τ )dτ +
˜ (τ )dτ
= Ae
u(t − τ ) eA(t−τ ) BU
u(t − τ )eA(t−τ ) BU
dt
dt
0
0
∞
∞
˜
˜
˜
) BU
˜ A(t−τ
˜ (τ )dτ
˜ (τ )dτ +
˜ At X(0) +
u(t − τ )Ae
δ(t − τ )eA(t−τ ) BU
= Ae
0

0

=

˜
˜
˜ At
˜ (τ )|τ =t
Ae
X(0) + eA(t−τ ) BU
t

˜
˜ At
= A[e
X(0) +

t

+ A˜

˜
˜ (τ )dτ
eA(t−τ ) BU

0

˜
˜ (τ )dτ ] + BU
˜ (t)
eA(t−τ ) BU

0

˜
˜ (t),
= AX(t)
+ BU

(4.11)

in which u(t) is the unit step function, and δ(t) the Dirac delta function. That means,
˜
˜
˙
˜ (τ )dτ is the solution to equation X(t)
˜
˜ (t).
X(t) = eAt X(0) + 0t eA(t−τ ) BU
= AX(t)
+ BU

To illustrate the relationship between DCM and DDCM, assume there is no background
noise, and the external input is a constant, i.e., U (t) = U . In this case, sampling the
73

neurostate in equation (4.10) at t = kT , where k = 0, 1, 2..., and T the sampling period, we
get:
kT

˜

X(kT ) = eAkT X(0) +
0

˜

˜ dτ.
eA(kT −τ ) BU

(4.12)

Accordingly, the neurostate at time t = (k + 1)T is
(k+1)T

˜

X((k + 1)T ) = eA(k+1)T X(0) +
0

˜
˜ dτ.
eA( (k+1)T −τ ) BU

(4.13)

˜
Subtracting eAT X(kT ) from X((k + 1)T ), we have:

(k+1)T

˜

X((k + 1)T ) − eAT X(kT ) =

0
(k+1)T

=
=

kT
˜
A˜−1 (eAT

˜
˜ dτ −
eA( (k+1)T −τ ) BU

kT
0

˜
˜ dτ
eA( (k+1)T −τ ) BU

˜
˜ dτ
eA( (k+1)T −τ ) BU

˜
− I)BU.

(4.14)

˜
˜
˜ we can rewrite equation (4.14) as:
Let A = eAT and B = A˜−1 (eAT − I)B,

X( (k + 1)T ) = AX(kT ) + BU.

(4.15)

When the sampling period T is fixed, we can simplify equation (4.15) as:

X(k + 1) = AX(k) + BU.

(4.16)

This confirms the validity of the DDCM model, and the relationship between DCM and
DDCM.

74

In the following sections, we will investigate the relationship of DDCM with the DI-based
causality analysis framework and the Granger Causality analysis.

4.3

The Relationship between DDCM and Directed
Information

In this section, we show that, under certain assumptions, DI and DDCM are equivalent in
characterizing the causal relationship between different brain regions.

4.3.1

Directed Information based Causality Analysis

Directed Information is a causality analysis framework based on information theory. It was
first introduced by Massey to study the capacity of a communication channel with feedbacks [24]. The amplitude of DI has a clear physical meaning: it reflects the information
n
flow from one time series Xn
1 to another, X2 . In [29], it was pointed out that the directed

information theory provides an effective and adequate framework for the connectivity inference problems in neuroscience applications, as the Granger Causality graphs could be
derived using DI.
n
The directed information from one time series Xn
1 to another X2 is calculated as [24]:
n
n
I(Xn
1 → X2 ) =

k
[h(X2 (k)|X2k−1 ) − h(X2 (k)|Xk−1
2 , X1 )],

(4.17)

k=1

where Xki = [Xi (1), Xi (2), ..., Xi (k)], i = 1, 2, and h denotes the differential entropy operator.
n
n
n
n
n
If I(Xn
1 → X2 ) is greater than I(X2 → X1 ), we say X1 has more causal influence over X2 ;
n
otherwise Xn
2 has more causal influence over X1 .

75

Practical evaluation of directed information measures has been a challenging problem.
Over the last two decades, a limited number of directed information estimators have been
purposed [3, 22, 30]. In the following subsections, we will derive the theoretical form of DI
under certain assumptions, and investigate its relationship with DDCM.

4.3.2

DDCM and Directed Information

When deriving the relationship between DDCM and DI, we impose the following assumptions
to make the problem more tractable: (i) The dynamic neural system under investigation is
a causal system, which means for each brain region, the current value of the neurostate
depends only on previous values of neurostates of the region and its related regions. (ii) For
each region, both the neurostate and the background noise are normally distributed, and the
variances are the same in related brain regions. More specifically, for each k = 1, 2, · · · , n,
the variances corresponding to the neurostate and the background noise are σx2 and σ02 ,
respectively. (iii) The external input U is a constant. This assumption is reasonable when
the changing rate of the external input is much slower than that of neurostates.
In the following analysis, let the uppercase letters (X,Y ,...) denote random variables,
and the lowercase letters (x,y,...) the possible values they can acquire. In particular,
x1 (k) and x2 (k) denote the possible values X1 (k) and X2 (k) can acquire, and ω11 (k)
and ω12 (k) denote the possible values Ω11 (k) and Ω12 (k) can acquire. Given a time series Xn = [X(1), X(2), ..., X(n)], n ∈ N , for any x(k), k = 1, 2, . . . n, P (x(k)) denotes the
probability for X(k) to take the value x(k), and P (x(k)|xk−1 ) the conditional probability that the current sample X(k) is x(k), given that the previously observed sequence is
xk−1 = [x(1), x(2), ..., x(k − 1)].
k
Following equation (4.9), the conditional probability P (x2 (k) | xk−1
2 , x1 ) can be written

76

as:

k−1 k
k
P (x2 (k) | xk−1
2 , x1 ) = P (A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1) | x2 , x1 )

= P (A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1) | x2k−1 , xk−1
1 )
= P (ω12 (k − 1)).

(4.18)

This implies that, for each k = 1, 2, · · · , n, the conditional probability density function of
the neurostate X2 (k) given Xk−1
and Xk1 is Gaussian with variance σ02 .
2
It is well known that given a Gaussian random variable Ξ with variance σξ2 , the corresponding differential entropy h(Ξ) can be calculated as:

h(Ξ) =

1
log 2πeσξ2 .
2

(4.19)

Therefore, based on equations (4.18) and (4.19), the differential entropy corresponding to
the neurostate X2 (k) given Xk−1
and Xk1 can then be calculated as:
2

k
h(X2 (k)|Xk−1
2 , X1 ) =

1
log 2πeσ02 .
2

(4.20)

Similarly, the conditional probability P (x2 (k) | xk−1
2 ) can be simplified as:
k−1
P (x2 (k) | xk−1
2 ) = P (A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1) | x2 )

= P (A21 x1 (k − 1) + ω12 (k − 1)).

77

(4.21)

As a result, the corresponding differential entropy will be:

h(X2 (k)|Xk−1
2 )=

1
log 2πe(A221 σx2 + σ02 ),
2

(4.22)

where σx2 is the variance of the neurostate, which is assumed to have no significant changes
among related regions and within the observation frame.
Based on equations (4.18) to (4.22), the directed information can then be obtained as:
n

I(Xn
1

→

Xn
2)

=

1
1
[ log 2πe(A221 σx2 + σ02 ) − log 2πeσ02 ]
2
2

k=1

n
σ2
log(1 + A221 x2 )
2
σ0
n
=
log(1 + cA221 );
2
=

(4.23)

where c = σx2 /σ02 is the ratio of the power of neural activities and the noise power. Similarly,
n
2
we can prove that I(Xn
2 → X1 ) = (n/2)log(1 + cA12 ).

It can be seen from equation (4.23) that after the parameters in DDCM have been
obtained, the directed information from one region to another can be calculated accordingly.
That is, equation (4.23) provides an effective method for the estimation of DI.
Note that when c > 0, log(1 + cx2 ) is a monotonically increasing function. Based on the
discussions above, we can obtain the following proposition:
n
n
n
Proposition 4.1 If |A21 | > |A12 |, then I(Xn
1 → X2 ) > I(X2 → X1 ), that is, region 1 is
n
n
n
more likely to be the causal side; otherwise, we will have I(Xn
2 → X1 ) > I(X1 → X2 ), and

region 2 is more likely to be the causal side.
Proposition 4.1 is in accordance with the previous analysis for DDCM: the causality
analysis can be carried out based on the absolute values of the coefficients of the connectivity
78

matrix. This means, for a causal dynamic neural system with a constant external input,
when the neurostate and the background noise are normally distributed, DI and DDCM are
equivalent in characterizing the causal relationship between two brain regions.
On the other hand, in practice, we can only observe the BOLD signal Yn rather than the
neurostate Xn . That is, given two brain regions, region 1 and region 2, the calculation of DI
n
n
can only be carried out on the observations Yn
1 and Y2 rather than the neurostates X1 and

Xn
2 . The natural questions are: will the causal relationship obtained from the neurostates
of two brain regions be maintained in the observed BOLD signals? Whether DDCM is
still equivalent with DI concerning the observed BOLD signals in charactering the causal
relationship between two brain regions?
In the following, we will prove that: in the observation noise-free case, as long as the
hemodynamic system is invertible, DI calculated using the estimated neurostates is equal to
the DI calculated using the observed signals.
Following equation (4.8), for each k = 1, 2, · · · n, the values of the observed BOLD signals
of region l, l = 1, 2, can be written as
M

yl (k) =
m=0

Λl (m)xl (k − m) + Ω2l (k).

79

(4.24)

When there is no observation noise, equation (4.24) can be rewritten in the matrix form:






















yl (1)   Λl (0)
0
0
 
 

yl (2) 
0
  Λl (1) Λl (0)
 
 
=
yl (3) 
  Λl (2) Λl (1) Λl (0)
 
..   ..
..
..
.   .
.
.
 
 
yl (k)
0
0
0

···

0

···

0

···

0

..

.

···

..
.
Λl (0)

















xl (1) 


xl (2) 



.
xl (3) 


.. 
. 


xl (k)

(4.25)

That is, we have
ylk = Λl,k xkl , k = 1, 2, · · · n, l = 1, 2,

(4.26)

where ylk = [yl (1), yl (2), · · · , yl (k)]T , xkl = [xl (1), xl (2), · · · , xl (k)]T , and Λl,k is a k × k
matrix shown in equation (4.25). It can be seen from equation (4.26) that the determinant
of matrix Λl,k equals to |Λl (0)|k . Therefore, if Λl (0) = 0, then Λl,k is invertible. Hence
for k = 1, 2, · · · n, and l = 1, 2, the values of the neurostate xkl can be recovered from the
k
observed BOLD signal ykl through xkl = Λ−1
l,k yl .
k
As a result, the conditional probability P (y2 (k) | yk−1
2 , y1 ) can be written as:

k−1
k
k
P ( y2 (k) | yk−1
2 , y1 ) = P ( y2 (k) | Λ2,k−1 x2 , Λ1,k x1 )
k
= P ( y2 (k) | xk−1
2 , x1 )
M

= P(
m=0

k
Λ2 (m)x2 (k − m) | xk−1
2 , x1 )

k
= P ( Λ2 (0)x2 (k) | xk−1
2 , x1 )
k
= P ( Λ2 (0)[A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1)] | xk−1
2 , x1 )
k−1 )
= P ( Λ2 (0)[A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1)] | xk−1
2 , x1

= P ( Λ2 (0)ω12 (k − 1) )
80

(4.27)

This implies that, for each k = 1, 2, · · · , n, the conditional probability density function of
Y2 (k) given Yk−1
and Yk1 is Gaussian with variance Λ22 (0)σ02 . Therefore, Following equation
2
(4.19), the differential entropy corresponding to Y2 (k) given Y2k−1 and Yk1 can then be
calculated as:

k
h(Y2 (k)|Yk−1
2 , Y1 ) =

1
log 2πeΛ22 (0)σ02 .
2

(4.28)

Similarly, the conditional probability P (y2 (k)|y2k−1 ) can be simplified as:

k−1
P (y2 (k) | yk−1
2 ) = P (Λ2 (0)[A21 x1 (k − 1) + A22 x2 (k − 1) + B2 U + ω12 (k − 1)] | Λ2,k−1 x2 )

= P (Λ2 (0)[A21 x1 (k − 1) + ω12 (k − 1)]).

(4.29)

As a result, the corresponding differential entropy will be

h(Y2 (k)|Yk−1
2 )=

1
log 2πeΛ22 (0)(A221 σx2 + σ02 ),
2

(4.30)

where σx2 is the variance of neurostate, which is assumed to have no significant differences
among related regions.
Based on equations (4.28) and ((4.30), we have:
n

I(Yn
1

→

Yn
2)

=

1
1
[ log 2πeΛ22 (0)(A221 σx2 + σ02 ) − log 2πeΛ22 (0)σ02 ]
2
2

k=1

n
σ2
log(1 + A221 x2 )
2
σ0
n
=
log(1 + cA221 );
2

=

81

(4.31)

where c = σx2 /σ02 is the ratio of the power of neural activities and the noise power. Similarly,
n
2
we can prove that I(Yn
2 → Y1 ) = (n/2) log(1 + cA12 ).

Comparing equation (4.31) with equation (4.23), we can see that in the noise-free case, as
long as the hemodynamic system is invertible, DI obtained from the observed BOLD signal
n
equals to the DI obtained from the neurostate, that is, I(Y1n → Y2n ) = I(Xn
1 → X2 ) and
n
I(Y2n → Y1n ) = I(Xn
2 → X1 ).
n
n
n
Proposition 4.2 If |A21 | > |A12 |, then I(Yn
1 → Y2 ) > I(Y2 → Y1 ), that is, region 1 is
n
n
n
more likely to be the causal side; otherwise, we will have I(Yn
2 → Y1 ) > I(Y1 → Y2 ), and

region 2 is more likely to be the causal side.
Proposition 4.2 implies that in the noise-free case, as long as the hemodynamic system is
invertible, the DI-based causal relationship between the neurostates of two brain regions is
the same with that between the observed BOLD signals, and DCM and DI are still equivalent
in characterizing the causal relationship between brain regions.

4.3.3

Discussions

In the previous subsection, we prove that under certain assumptions, DI and DDCM are
equivalent to each other in characterizing the causal relationship between two brain regions.
However, such an equivalence may not be generalized to the multi-region DDCM. The reason
is that: when more than two regions are involved in DDCM, the causal relationship between
two regions should take their interactions with other regions into account.

82

When multiple brain regions are under investigation, equation (4.7) would be




 X1 (k + 1)


..

.


XN (k + 1)

  A11 · · · A1N   X1 (k)
 

 

.. . .
..
..
=

. .
.
.
 

 

AN 1 · · · AN N
XN (k)









  B1
 
  ..
+ .
 
 
BN






 Ω11 (k)




..
U + 
.




Ω1N (k)





 , (4.32)



where N is the number of regions. For any two different regions, region i and region j,
i, j = 1, 2, · · · N, under the same assumptions in Section III.B, the conditional probability
k
P (xj (k)|xk−1
j , xi ) can be written as:
N

P (xj (k) |

k
xk−1
j , xi )

= P(
l=1

k
Ajl xl (k − 1) + Bj U + ω1j (k − 1) | xk−1
j , xi )

= P(
l=i,j

Ajl xl (k − 1) + ω1j (k − 1)).

(4.33)

Similarly, the conditional probability P (xj (k) | xjk−1 ) can be simplified as:
N

P (xj (k) |

xk−1
j )

= P(
l=1

= P(
l=j

Ajl xl (k − 1) + Bj U + ω1j (k − 1) | xjk−1 )
Ajl xl (k − 1) + ω1j (k − 1)).

(4.34)

It can be seen from equations (4.33) to (4.34) that for each k = 1, 2, · · · n, the items
within the sum
l=i,j

Ajl xl (k − 1) may be correlated with each other. As a result, DI cannot

be derived with the same approach shown in equations (4.18) to (4.22). Therefore, unlike
the two-region case, when multiple regions are involved, whether the causal relationships
characterized by DDCM and DI are equivalent is still an open problem.

83

4.4

The Relationship between DDCM and Granger
Causality

Granger Causality (GC) is the first practical causality analysis framework for time series. It
can be applied directly to any given time series to detect the coupling among brain regions.
The main idea is, if two signals X1 and X2 form a causal relationship, then, instead of using
the past values of X2 alone, the information contained in the past values of X1 will help to
predict X2 . More specifically, the calculation of Granger Causality is based on the linear pren
diction models. Suppose Xn
1 = [X1 (1), X1 (2), ..., X1 (n)] and X2 = [X2 (1), X2 (2), ..., X2 (n)]

are two time series observed from two brain regions. Granger Causality compares the prediction errors er and e˜r in the following equations:
L−1

al X2 (k − l) + er ,

(4.35)

[bl X1 (k − l) + cl X2 (k − l)] + e˜r ,

(4.36)

X2 (k + 1) =
l=0
L−1

X2 (k + 1) =
l=0

for k = 1, 2, ..., n. Here, er is the error of predicting X2 based only on the previous values
of X2 , and e˜r is the error of predicting X2 based on both the previous values of X2 and the
previous values of X1 . If e˜r is much smaller than er , that is, the introduction of the previous
values of X1 can improve the prediction accuracy, then we say there is a Granger causal
relationship between X1 and X2 . In practical analysis, GC can be tested using a nested
model comparison based on the F statistics [49].
Since 1990s, there have been growing interests in applying Granger Causality analysis to
identify causal interactions in neuroscience [8]. GC has been successfully applied to fMRI
84

data, EEG measurements, as well as neural level signals [9]. In these pioneering work, the
validity and computational simplicity of Granger Causality have been widely recognized.
To show the differences between GC and DDCM, rewrite equations (4.35) and (4.36) as
follows:




L−1











 X2 (k + 1) 
 al 0   X1 (k − l)   er 

=


+
.
l=0
X2 (k + 1)
bl c l
X2 (k − l)
e˜r

(4.37)

Comparing equation (4.37) with equations (4.7) and (4.9), we can see that, as in the state
equation in DDCM, GC applies an auto-regression model to the observed sequences. However, there are several major differences between GC and DDCM:
First, GC uses the regression error terms rather than the coefficient matrices to determine
the causal relationship between the two sequences. This means, the causality measures used
in GC and DDCM are completely different. Second, in [74], Friston provided an example
to show that DCM and GC could generate different results given the same dataset. The
underlining argument is that: both the external input U and the biological variance of the
hemodynamic response are taken into account in DDCM but are not involved in GC. Finally,
it has also been noticed in [11] that because GC relies on the linear prediction method, when
there exist instantaneous and/or strong nonlinear interactions between two regions, GC
analysis may lead to invalid results.

4.5

Numerical Analysis

In this section, we briefly describe how to validate the equivalence of DDCM and DI between
two regions using experimental fMRI data obtained under both resting state and stimulus

85

based state.

4.5.1

Data Acquisition

Fourteen right-handed healthy college students (7 males, 23.4 ± 4.2 years of age) from Michigan State University volunteered to participate in this study and signed consent forms approved by the Michigan State University Institutional Review Board. The experiment was
conducted on a 3T GE Signa HDx MR scanner (GE Healthcare, Waukesha, WI) with an
8-channel head coil.
For each subject, fMRI datasets were collected on a visual stimulation condition with a
scene-object fMRI paradigm and then on a resting-state condition. The parameters for the
fMRI scan were: gradient-echo EPI, 36 contiguous 3-mm axial slices in an interleaved order,
time of echo (TE) = 27.7 ms, time of repetition (TR) = 2500 ms, flip angle = 80◦ , field of
view (FOV) = 22 cm, matrix size = 64 × 64, ramp sampling, and with the first four data
points discarded.
On the visual stimulation fMRI condition, each volume of images were acquired 192
times (8 min) while each subject was presented with 12 blocks of visual stimulation after
an initial 10 s “resting” period. In a predefined randomized order, the scenery pictures were
presented in 6 blocks and the object pictures were presented in other 6 blocks. All pictures
were unique. In each block, 10 pictures were presented continuously for 25 s (2.5 s for each
picture), followed with a 15 s baseline condition (a white screen with a black fixation cross
at the center). The subject needed to press his/her right index finger once when the screen
was switched from the baseline to picture condition. Stimuli were displayed in color in full
screen on a 1024 × 768 32-inch LCD monitor (Salvagione Design, Sausalito, CA) placed at
the back of the magnet room. The LCD subtended 10.2◦ × 13.1◦ of visual angle. On the
86

resting-state fMRI (rs-fMRI) condition, each volume of images were acquired 164 times (6
min and 50 s) after a subject was informed to relax, keep his/her eyes closed and stay awake
throughout the scan. After the above functional data acquisition, high-resolution volumetric
T1-weighted spoiled gradient-recalled (SPGR) images with cerebrospinal fluid suppressed
were obtained to cover the whole brain with 120 1.5-mm sagittal slices, 8◦ flip angle and 24
cm FOV. These images were used to identify anatomical locations.

4.5.2

fMRI Data Pre-processing and Analysis

All stimulus fMRI data pre-processing and analysis for each subject were conducted with
AFNI software (Cox, 1996) as described in Henderson et al. [61]. Essentially, slice-timing
correction and rigid-body motion correction were conducted. Spatial blurring with a full
width half maximum of 4 mm was applied to reduce random noise. Multiple linear regressions
(using the “3dDeconvolve” routine in AFNI) were applied on a voxel-wise basis to find the
magnitude change when each picture condition was presented, followed with general linear
tests to find the statistical significances between stimulus conditions.
The regions of interest (ROI) in this study were defined in the Talairach coordinate
space [67]. Regions showing preferential activation to scenes over objects (voxel-based pvalue < 10−4 ) in the right and left parahippocampal gyri were defined as the right and left
PPA [61]. The right and left V1 ROIs were defined as the regions activated by pictures
(voxel-based p-value < 10−10 ) within Brodmann area 17. Because there was a high level of
activation at and around V1, a highly conservative p value threshold was chosen to define
relatively focal ROIs. The right and left SMC spherical ROIs with 6 mm radius were defined
with the centers at (R36, P22, S54) and (L38, P26, S50) correspondingly in the Talairach
coordinate space (R = Right, L = Left, P = Posterior, S = Superior). The SMC coordinate
87

locations were defined by Witt et al. [68] and the ROIs were created as in Zhu et al. [69]. The
time courses from the stimulation fMRI dataset that were already pre-processed as above
were detrended and had their baselines removed also. The spatially averaged time course at
each of the above ROIs was generated for the causality analyses discussed later.
The rs-fMRI pre-processing was also processed in AFNI [70] as commonly applied in the
field and as described in details in Zhu et al. [69]. Essentially, slice-timing correction and
rigid-body motion correction were carried. Spatial blurring with a full width half maximum
of 4 mm was applied to reduce random noise. The time courses were detrended and the
baselines were removed. Brain global, cerebrospinal fluid and white-matter mean signals
were modeled as nuisance variables and removed from the time courses. Finally, the time
courses were band-pass filtered to the range of 0.009 Hz – 0.08 Hz. The spatially averaged
time course at each of the above ROIs was generated for the causality analyses discussed
later.

4.5.3

Result

In this subsection, we validate the equivalence of DDCM and DI between two regions using
experimental fMRI data obtained under both resting state and stimulus based state.
Recall that in the scene-object paradigm, subjects viewed blocks of scenery and object
pictures. They were asked to press a button with the right index finger when they saw a
block of pictures. We test the robustness of our causality analysis techniques against some
expected outcomes: under the stimulation fMRI paradigm, the primary visual cortex (V1)
and nearby regions are activated first, followed with activation in the parahippocampal place
area (PPA) for higher level scene processing. Some but relatively small activations in the left
sensorimotor cortex (SMC) is also expected following V1 activations. Under the resting-state
88

Connection Coefficients between V1 and PPA

1.8
1.6

1.2

Connection Coefficients between V1 and PPA

1.0

1.4
1.2

0.8

1.0

0.6

0.8
0.6

0.4

0.4

0.2

0.2
0.0

|A12 |

|A21 |

Connection Coefficients between V1 and SMC

2.0

0.0
2.5

|A12 |

|A21 |

Connection Coefficients between V1 and SMC

2.0

1.5

1.5
1.0
1.0
0.5

0.5

0.0

|A12 |

|A21 |

0.0

|A12 |

|A21 |

Figure 4.1: Estimations results of DDCM with the experimental fMRI data.
condition, neuronal activity is not expected to occur in a sequential manner among above
regions.
The simulation result of V1 and PPA under both resting and stimulus based states are
shown in Figure (4.1(a)) and (4.1(b)). It can be seen that under the resting state, V1 does
not exhibit a dominating influence over PPA. However, under the stimulus based state, |A21 |
is increased considerably compared to |A12 |. In other words, V1 shows stronger influences
over PPA as expected. Figure (4.1(c)) and (4.1(d)) have shown a similar pattern for the
regions V1 and SMC.
In our previous study [27], we have carried out DI based causality analysis on the same
89

0.6

0.4

0.4

0.2

0.2

γ

γ

0.6

0.0

0.0

0.2

0.2

0.4

0.4

Resting

Resting

Stimuli

Stimuli

Figure 4.2: Results of directed information based causality analysis.
dataset. The corresponding result is shown in Figure (4.2), where the causal relationship
n
between two brain regions is measured with the metric γ. Given two time series Xn
1 and X2 ,

γ is calculated as

γ=

n
n
n
I(Xn
1 → X2 ) − I(X2 → X1 ) ,
n
I(Xn
1 ; X2 )

(4.38)

n
where I(Xn
1 ; X2 ) is the mutual information between two time series. Clearly, γ ∈ [−1, 1].

When |γ| approaches 1, it can be said with high confidence that there does exist a causal
influence between two brain regions; while if |γ| is adjacent to 0, it is more likely that no
clear causal relationship exists, or the samples in random sequences are subject to strong
noises.
It can be seen that in the resting state, the medians of the γ values are within the
range [−0.1, 0.1]. V1 does not exhibit a dominating causal influence over PPA and SMC.
However, under the stimulus based state, the γ values for V1 → PPA and V1 → SMC
increase significantly. In other words, under the stimulus based state, V1 shows stronger
influences over PPA, as well as SMC.

90

From Figure (4.1) and Figure (4.2), we can see that the causality analysis result obtained
with DDCM is in consistent with our previous analysis using the directed information in [27].

4.5.4

Summary of Result

In this section, we applied DDCM to the experimental fMRI data to examine the causal
relationship in V1–PPA and V1–SMC region pairs. We observed that under the resting
state there was no dominant causal influence for both pairs; while under the stimulus based
states, V1 turned out to exhibit more influence over PPA and SMC. The result is consistent
with the expectations and our previous result using DI.

4.6

Summary

This chapter investigated the discrete time DCM (DDCM) and its relationship with Directed
Information (DI) and Granger Causality (GC). First, we demonstrated the relationship between DDCM and the continuous time DCM. Rather than using approximation, we proved
that when the input to the neural dynamic system is a constant, then DDCM can be strictly
derived from DCM under the noise free case. This result further validates the DDCM model.
Second, based on information theory, we revealed the conditional equivalence between DDCM
and DI in characterizing the causal relationship between two brain regions. More specifically, assuming that the dynamic neural system is causal, the neurostate and the noise at
each region are normally distributed, and the external input is a constant, we showed that
DDCM and DI are equivalent in characterizing the causal relationships between two brain
regions. This equivalence between DDCM and DI provides a simple method for DI estimation. We also showed that when the hemodynamic system is invertible, the DI-based causal

91

relationship between the neurostates of two brain regions is the same with that between
the observed BOLD signals. However, it should be pointed out that conditional equivalence
between DDCM and DI needs further investigation when multiple regions are involved. Finally, we illustrated the similarities and differences between DDCM and GC. Although they
share a similar mathematic form, the causality measures they utilize are completely different.
Note that GC detects the causal relationship between the observed signals, and DCM detects the causal connections of the hidden neurostates. The conditional equivalence between
GC and DCM remains an interesting problem. The theoretical techniques are demonstrated
using fMRI data obtained under both resting state and stimulus based state. Our numerical
analysis is consistent with that reported in previous study.

92

Chapter 5
Classification of Alzheimer’s Disease,
Mild Cognitive Impairment and
Normal Control Subjects Using
Resting-State fMRI based Network
Connectivity Analysis
This chapter proposes a robust method for the Alzheimer’s Disease (AD), mild cognitive
impairment (MCI) and normal control (NC) subject classification under size limited fMRI
data samples, by exploiting brain network connectivity pattern analysis. First, we select
regions of interest (ROIs) within the default mode network (DMN) to formulate a subnetwork. We calculate the Pearson correlation coefficients between all possible ROI pairs in
the sub-network and use them to form a feature vector for each subject. Second, we propose
a regularized linear discriminant analysis (LDA) approach, where we take shrinkage based
regularization procedures to reduce the noise effect (including both biological variability and
measurement errors) due to limited sample size. The feature vectors are then projected onto
a one-dimensional axis using the proposed regularized LDA, where the differences between

93

AD, MCI and NC subjects are maximized. Based on the Central Limit Theorem, we show
that when used for fMRI based brain analysis, LDA is equivalent to the optimal maximum
likelihood based classification method. Finally, a decision tree based multi-class AdaBoost
classifier, which is robust to noise effect, is applied to the projected one-dimensional vectors
to carry out the classification task. Numerical analysis demonstrates that the combination of
regularized LDA and the AdaBoost classifier can increase the classification accuracy significantly. Our analysis confirms the previous findings that the hippocampus and the isthmus
of the cingulate cortex are closely involved in the development of AD and MCI.

5.1

Introduction

Alzheimer’s Disease (AD) is the most common form of dementia, and causes problems with
memory, thinking and behavior. It is a degenerative brain disorder, characterized by progressive deterioration of nerve cells, eventually leading to cell death. Mild Cognitive Impairment
(MCI) is a condition in which people show a slight, but noticeable and measurable decline
in cognitive capabilities, beyond what is considered normal for their age. Older people with
MCI may or may not progress to AD, though they have a higher risk of doing so. Accurate
distinction of AD and MCI from normal control (NC) subjects is critical for early diagnosis
and treatment of brain disorders.
Traditional AD and MCI diagnosis methods are generally based on positron emission
tomography (PET) and cerebrospinal fluid (CSF) [76]. In recent years, there has also been
an increasing interest in noninvasive diagnosis methods based on electroencephalography
(EEG) [77], structural magnetic resonance imaging (MRI) [78], and functional magnetic
resonance imaging (fMRI) [36, 79].

94

In literature, the majority of existing noninvasive classification approaches rely on MRI
and EEG. In [80], Pritchard et al. analyzed spectral-band measures of EEG data acquired
from AD patients and NC subjects. They found that classifiers based on multivariate discriminant analysis [81] and the nearest neighbor approach could typically achieve a two-category
(AD and NC) classification accuracy of 80% when applied to EEG data. At the same time,
they showed that the accuracy could be improved if nonlinear EEG measures were added.
In [82], Magnin et al. applied the support vector machine (SVM) classifier to the wholebrain anatomical MRI data acquired from AD patients and NC subjects. They formulated
feature vectors for classification using gray matter information extracted from T1-weighted
MR images of AD, MCI and NC subjects, and achieved a two-category classification accuracy of 94.5%. In [83], Korolev et al. combined data from clinical biomarkers, MRI signals,
and plasma biomarkers and developed a classifier to predict whether an MCI patient would
develop Alzheimer’s disease over a three-year period. Their prediction accuracy was 80%.
More recently, fMRI, which maps brain activities to metabolic changes (such as the
blood-oxygen-level dependent (BOLD) contrast) in cerebral blood flow, has also been used
to classify AD, MCI and NC subjects [36, 79]. The underlying argument is that cerebral
blood flow and neuronal activation are coupled. That is, when a particular brain region
becomes active, blood flow to that region also increases.
Compared with EEG, fMRI data can display active brain areas more directly, has much
better spatial resolution throughout the brain. Unlike structural MRI which mainly reflects
the anatomical information of brain tissues and structure, fMRI focuses more on functional
brain activities, and can provide more direct measurement on how different brain regions
are involved in particular brain activities, hence provide more insight on the changes of
functional brain connectivity during the evolution of MCI and AD.
95

In [36], Wang et al. extracted two intrinsically anti-correlated networks using resting
state fMRI data from 14 AD patients and 14 NC subjects, and applied a Pseudo-Fisher
Linear Discriminative Analysis (pFLDA) on the high dimensional feature vectors. Their
two-category classification accuracy was 83%. In [79], Chen et al. applied the same technique
to larger datasets. Similarly, the accuracy of two-category classification of AD patients and
NC subjects was 82%.
While structural MRI has been widely applied to clinical diagnosis of brain disorders,
fMRI has mainly been used for research purposes. As a result, the size of fMRI data samples
is generally quite limited, which has become a major bottleneck in fMRI based AD, MCI and
NC classification. The underlying reason is that, when the sample size is small, most existing
classifiers suffer from severe noise effects, due to both biological variability and measurement
noise.
Motivated by this observation, in this chapter, we develop a reliable method for AD,
MCI and NC classification that is robust with respect to size limited fMRI data samples,
by exploiting brain network connectivity pattern analysis. More specifically, we take multiple regions of interest (ROIs) in the brain, formulate a sub-network, and then analyze the
network connectivity pattern by evaluating the correlation between all ROI pairs within the
sub-network. The underlying argument is that: due to variability in the brain connectivity
of each individual, the connectivity between two brain regions alone may not be sufficient
to distinguish NC subjects from patients with cognitive impairments; brain network connectivity pattern analysis, which looks for subtle changes in the pattern of connectivity among
multiple or all regions in the sub-network, may reveal more in-depth information.
The proposed classification scheme can be described as follows. First, we select an ROI
sub-network and formulate the feature vectors by calculating the Pearson correlation coef96

ficients between all pairs of ROIs. In this chapter, we formulate the ROI sub-network by
selecting regions within the default mode network (DMN), which denotes the network of
brain regions that are active when the brain is at the resting state [84]. Prior resting-state
fMRI studies have demonstrated that the DMN is affected by AD [42, 85–88]. More specifically, in this chapter, we select the right and left hippocampi and isthmus of the cingulate
cortices (ICCs) (4 regions) as our ROI sub-network. This is because that both hippocampus
and ICC are part of the DMN, and can be well defined anatomically through the FreeSurfer
software [88], even in brains with abnormal anatomy [42]. It has also been demonstrated [42]
that the functional connection between hippocampus and ICC was reduced in AD. Second, we
propose a regularized linear discriminant analysis (LDA) approach, where we take shrinkage
based regularization procedures to reduce the noise effect (including both biological variability and measurement errors) due to limited sample size. The feature vectors are then
projected onto a one-dimensional axis using the proposed regularized LDA, where the differences between AD, MCI and NC subjects are maximized. Based on the Central Limit
Theorem, we show that when used for fMRI based brain functioning classification, LDA is
equivalent to the optimal maximum likelihood based classification method. Finally, a decision tree based multi-class AdaBoost classifier, which is robust to noise effect, is applied to
the projected one-dimensional vectors to carry out the classification task.
The major results of this chapter can be summarized as:
1. We propose a regularized LDA approach, which aims to reduce the noise effect by using
two shrinkage methods. The first shrinkage method moves the estimated mean of each
class towards the overall mean, and the second one shifts the estimated covariance
matrix for each class towards the identity matrix. Numerical analysis shows that: in
comparison with the original LDA approach [41], the regularized LDA can reduce the
97

noise effect and increase the classification accuracy significantly.
2. We investigate the relationship between LDA-based and Maximum Likelihood (ML)
based classification or decision making methods. Recall that LDA aims to separate two
or more classes by projecting them into a subspace or direction where different classes
show most significant differences [41]. In this chapter, we prove that when the original
data are normally distributed, LDA is equivalent to maximizing the log-likelihood
function of the projected data. Note that there are millions of neurons within one fMRI
voxel, according to the Central Limit Theorem, the overall fMRI signal corresponding
to each voxel follows the normal distribution approximately. This implies that when
used for fMRI based brain functioning classification, LDA is equivalent to the optimal
ML based classification method.
3. We conduct the connectivity pattern classification of AD, MCI and NC subjects by applying the regularized LDA and AdaBoost classifier based approach. First, we calculate
the Pearson correlation coefficients between all possible pairs of the ROIs within the
group to formulate the feature vectors. Second, the feature vectors are then projected
onto a one-dimensional axis using the proposed regularized LDA, where the differences
between AD, MCI and NC subjects are maximized. Finally, we construct the decision
tree based on the projected feature vectors and carry out the classification using the
multi-class AdaBoost classifier.
In this chapter, we choose to utilize the AdaBoost classifier instead of the naive
Bayesian classifier, since it has been observed consistently in literature that: the AdaBoost classifier could achieve significantly higher classification accuracy than the naive
Bayesian classifier when the sample size is very limited [89]. Our numerical results
98

demonstrate that: (i) LDA-Bayesian classifier can achieve a three-category (AD, MCI
and NC) classification accuracy of 44%; (ii) LDA-AdaBoost classifier can increase the
accuracy to 69%; (iii) when AdaBoost is combined with the regularized LDA, the
accuracy can be further increased to 75%.
As expected, it is also observed that compared with AD and NC subjects, it is more
difficult for the classifier to identify MCI subjects. The classification accuracy for AD
and NC are as high as 80% and 83%, respectively, while the accuracy for MCI is only
63%. Our analysis also confirms the previous findings that the hippocampus and the
isthmus of the cingulate cortex are closely involved in the development of AD and MCI.
The rest of this chapter is organized as follows. In Section II, we present the proposed regularized LDA approach, and explore the relationship between LDA based and the Maximum
Likelihood based classification methods. In Section III, we describe the ROI sub-network
formulation, and elaborate how to perform AD, MCI and NC classification through connectivity pattern analysis. In Section IV we present the numerical results, and we conclude in
Section V.

5.2

Regularized Linear Discriminant Analysis

In this section, first, we revisit the Linear Discriminant Analysis method. Second, we integrate two shrinkage methods with the original LDA to formulate the regularized LDA.
Finally, we investigate the relationship between LDA and the ML estimation method.

99

5.2.1

Linear Discriminant Analysis

Linear Discriminant Analysis aims to separate two or more classes by projecting them into
a subspace or direction where different classes show most significant differences [41]. Here,
we illustrate the basic idea of LDA using the three-class case. Suppose we have a set of
d−dimensional vector samples X = {x1 , x2 , ..., xn }, where n1 of them are from the first
class, denoted as C1 , and n2 of them are from the second class, denoted as C2 , and the
remaining n3 = n − n1 − n2 of them are from the third class, denoted as C3 . For i = 1, 2, 3,
the mean and scatter matrix (i.e., the scaled covariance matrix) of each of the three classes
are defined as:

µi =

1
ni

x,

(5.1)

x∈Ci

Si =
x∈Ci

(x − µi )(x − µi )t .

(5.2)

Consider the projection of vectors in X to a new d−dimensional space:

y = W x, x ∈ X,

(5.3)

where W is a d × d matrix to be determined by the LDA algorithm. In this chapter, we only
utilize the first dimension y of projected vector y where the differences among three classes
are maximized. As a result, Equation (5.3) can be rewritten as:

y = wt x,

100

(5.4)

where wt is the first row of the matrix W . For i = 1, 2, 3, let

C˜i = {y = wt x | x ∈ Ci }.
Define µ = n1

n
i=1

xi as the overall mean, SW =

3
i=1

(5.5)

Si as the within-class scatter matrix, and

the between-class scatter matrix SB as:
3

SB =
i=1

ni (µi − µ)(µi − µ)t .

(5.6)

LDA seeks a transform vector w that maximizes the following objective function:

J(w) =

wt SB w
.
wt SW w

(5.7)

It can be proved [41, 81] that to maximize Equation (5.7), w should satisfy

−1 S w = λw,
SW
B

(5.8)

−1 S , LDA then
for some constant λ. Performing eigenvalue decomposition to matrix SW
B
−1 S as w.
chooses the eigenvector corresponding to the largest eigenvalues of the matrix SW
B

As will be shown in Section III, various classifiers, such as the Bayesian classifier and the
AdaBoost classifier can then be applied to the projected vectors {yi = wt xi }n
i=1 for further
classification.

101

5.2.2

Regularized LDA

The original LDA algorithm has been widely applied in supervised learning problems [81].
However, as mentioned earlier, when the total number of subjects is small, the estimated
statics suffer considerably from the noise effect caused by both biological variability and
measurement error, leading to low classification accuracy. For our fMRI based AD, MCI
and NC classification, due to the very limited sample size, LDA together with the Bayesian
classifier could only achieve an accuracy that is under 50%. To reduce the noise effect, we
propose to regulate the original LDA using two shrinkage methods.

5.2.2.0.3

Shrinkage of the Mean The first shrinkage method, originally proposed by

Tibshirani et al. [90] for gene expression profiling, adjusts the estimated mean vectors. In our
3

case, let C =

i=1

Ci be the whole sample set. Recall that for any x ∈ C, x = [x(1), · · · , x(d)]t .

For i = 1, 2, 3, k = 1, · · · , d, define µi,k
ni = |Ci | and n = n1 + n2 + n3 . Let µi =

1
x(k), and µk
ni
x∈Ci
[µi,1 , µi,2 , ..., µi,d ]t , and µ =

1
n

x(k), where
x∈C

[µ1 , µ2 , ..., µd ]t . The

algorithm first calculates the following variances:

ξk2 =

1
n−3

3
i=1 x∈Ci

[x(k) − µi,k ]2 ,

k = 1, · · · , d.

(5.9)

Then for i = 1, 2, 3, and k = 1, · · · , d, the scaled distance of µi,k to the centroid µk is
calculated as:

di,k =

µi,k − µk
,
mi ξk

102

(5.10)

where mi =

1/ni + 1/n. After that, the distance is shrunken as follows:

di,k ← sign(di,k )max(0, |di,k | − δ),

(5.11)

where δ is a positive step size determined through cross-validation [81]. Now based on
Equation (5.10), the shrinkage is achieved as follows:

µi,k ← µk + mi ξk di,k .

(5.12)

As can be seen from (5.11) and (5.12), each dimension of µi has been shrunken towards
the overall mean. This shrinkage method is essentially based on the t test between the mean
of each class and the overall mean at every dimension. Recall that the t score of two sets of
random variables {z1 } and {z2 }, which have the same standard deviation Sz , is defined as:
t=

µz1 − µz2
sz

1
nz1

+ n1z

,

(5.13)

2

where µz1 = E{z1 } , µz2 = E{z2 }, and nZ1 and nZ2 are the sample sizes [91]. In this
shrinkage method, for i = 1, 2, 3, k = 1, · · · , d, the t score between each µi,k and µk pair is
defined as a distance in (5.10). If the distance di,k is small, i.e., if di,k < δ, then most likely
it is caused by the noise effect. In this case, the shrinkage method forces di,k to be zero, and
therefore reduces the noise effect.

5.2.2.0.4

Shrinkage of the Covariance Matrix The second shrinkage method, pro-

posed by Friedman et al. [92], regulates the estimation of covariance matrix Si for each class

103

as:

Si ← (1 − )Si + I,

i = 1, 2, 3,

(5.14)

where I is the identity matrix, and a positive number determined through cross-validation.
The basic idea of this shrinkage method is that: when the sample size is small, the estimated
covariance matrix Si , i = 1, 2, 3, generally becomes non-invertible. By adding a small perturbation to the slightly scaled covariance matrix, the adjusted or shrunken Si will generally
become invertible as expected.
After the regularized LDA transform, the feature vectors are projected into a set of real
valued numbers. After that, a selected classifier can be applied to the transformed data for
further classification.

5.2.3

LDA and the ML Estimation

In this subsection, we demonstrate that when the original data from all classes are normally
distributed, then LDA is equivalent to the ML method. For i = 1, 2, 3, assuming each vector
xj in class Ci has the same probability density function (pdf):

fX (x; µi , Σi ) =

1
2π d |Σi |

1
exp{− (x − µi )t Σi −1 (x − µi )}.
2

(5.15)

Consider a general linear transform defined by:

y = W x.

104

(5.16)

where W is a d × d matrix. For the transformed data, the probability density function
becomes:

1
˜ −1 (y − µ
˜ i )t Σ
˜ i )},
exp{− (y − µ
i
2
d
˜
2π |Σi |
1

˜ i) =
˜ i, Σ
fY (y; µ

(5.17)

˜ i = W Σi W t .
˜ i = W µi , and Σ
where i = 1, 2, 3, µ
Recall that in LDA, we try to find W such that the difference among different classes is
maximized in the transformed space. Without loss of generality, we assume that the major
difference lies in the first dimension of the transformed vector y only, and the remaining d−1
˜ i can be decomposed
˜ i and Σ
dimensions make little contributions. Under this assumption, µ
as:


1
˜
0 
Σi

 ˜
˜i = 
µ
,
 , Σi = 
d−1
d−1
˜
˜
0 Σ
µ


µ
˜1i



(5.18)

˜ d−1 ≈ Σ
˜ d−1 . Accordingly, the matrix W can also be
˜ d−1
˜ d−1 , Σ
since for each i, µ
≈ µ
i
i
decomposed as


W1





W =
.
d−1
W

(5.19)

˜ 1 = W 1 Σi W 1 t and Σ
˜ d−1 =
˜ d−1
˜ i, Σ
In this case, we have µ
˜1i = W 1 µi , µ
= W d−1 µ
i
i
i
t

W d−1 Σi W d−1 .
For fairness, in LDA based classification, the sample size of the three classes is assumed
to be the same, i.e., n1 = n2 = n3 = n/3. With the probability density function given in

105

(5.17), the log-likelihood function of the transformed data can be written as [93]:
3

˜ i)
˜ i, Σ
log fY (y; µ

L(W ) =
i=1 y∈C˜
i

n
= nlog|W | − log(2π)d −
2
−

1
2

3
i=1 x∈Ci

3
i=1

ni
˜ 1|
log|Σ
i
2

˜ 1 )−1 (W 1 x − µ
˜1i )
(W 1 x − µ
˜1i )t (Σ
i

n
˜ d−1 |
−
log|Σ
2
1
˜ d−1 )−1 (W d−1 x − µ
˜ d−1 )t (Σ
˜ d−1 )
−
(W d−1 x − µ
2

(5.20)

x∈C

To find the optimal W that maximizes L(W ), set

∂L(W )
˜1
∂Σ
i

= 0 and

∂L(W )
˜ d−1
∂Σ

˜ 1 = W 1 SW W 1 t ,
Σ
i
˜ d−1 = W d−1 SB W d−1 t .
Σ

= 0, we get:

(5.21)
(5.22)

Substitute (5.21) and (5.22) into (5.20) and remove the constant items, the optimization of
L(W ) is equivalent to optimizing the following function:

Leq (W ) = nlog|W | −

n
n
t
t
log|W 1 SW W 1 | − log|W d−1 SB W d−1 |.
2
2

(5.23)

The optimal choice of W will satisfy the differential equation:
dLeq (W )
= 0.
dW

(5.24)

It was shown in [94–96] that the partial differential equations are satisfied when W is com-

106

−1 S .
posed of eigenvectors of the matrix SW
B
−1 S , then
If we only keep the eigenvector corresponding to the largest eigenvalue of SW
B

we obtain the LDA algorithm presented in Section 5.2.1. As can be seen, LDA is equivalent
to the ML method.

5.3

Classification of AD, MCI and NC Subjects based
on Connectivity Pattern Analysis

In this section, we formulate the ROI sub-network, and perform AD, MCI and NC Subjects
classification through connectivity pattern analysis, by exploiting the proposed Regularized
LDA.

5.3.1

ROI Sub-Network Formulation and Connectivity Pattern
Analysis

The default mode network (DMN) is one of the well studied networks at the resting state [84].
Prior resting-state fMRI studies have demonstrated that the DMN is affected by AD [42,85–
88]. Both hippocampus and ICC are part of the DMN, and can be well defined anatomically
through the FreeSurfer software [88], even in brains with abnormal anatomy [42]. The
paper by Zhu et al. [42] specifically demonstrated that the functional connection between
hippocampus and ICC was reduced in AD.
Motivated by the observations above, in this chapter, we select the right and left hippocampi and ICCs (4 regions) as our ROI sub-network. Our connectivity pattern analysis
is carried out following the procedure below.

107

First, we calculate the Pearson correlation coefficients between all possible pairs of the
ROIs within the group to formulate the feature vectors. As we now have 4 regions in the ROI
sub-network, for each subject i, we can obtain a d−dimensional (d = 6) vector xi , consisting
of the Pearson correlation coefficients for each pair of ROIs. When we have n subjects, we
get the feature vector set {x1 , · · · , xn }.
Second, using the proposed regularized LDA, we map {x1 , · · · , xn } to a one-dimensional
subspace or axis, where the differences between AD, MCI and NC subjects are maximized,
and denote the projected vectors as {y1 , · · · , yn }.
Finally, we construct the decision tree based on {y1 , · · · , yn } and carry out the classification using the multi-class AdaBoost classifier.
In the following subsections, we will provide more details on decision tree construction,
and multi-class classification using AdaBoost.

5.3.2

Basic Decision Tree Construction

In the classification procedure, we will construct T = 50 basic decision trees. Each decision
tree divides the LDA projected data set yi , i = 1, · · · , n, into K regions, and each region is
called a leaf node. Here, the number of regions, K, and the boundaries for all the regions are
chosen by the decision tree algorithm to minimize the Gini impurity coefficient IG [97]. More
specifically, assuming yi ∈ ci , where ci ∈ {C˜1 , C˜2 , C˜3 }. Here C˜1 , C˜2 , C˜3 denote the projected
data set corresponding to AD, MCI and NC subjects, respectively. For k = 1, 2, ..., K,
without loss of generality, suppose km data samples {(yk1 , ck1 ), (yk2 , ck2 ), ..., (ykm , ckm )}
are assigned to node k, where yk ∈ {yi }, ck ∈ {C˜1 , C˜2 , C˜3 }, kl = 1, · · · , km . The Gini
l
l

108

impurity coefficient of node k is calculated as:
km

IG (k) =
ki =1

where fki =

fki (1 − fki ),

(5.25)

number of occurance of ck within node k
i
.
km

For any given input y to be classified, if y falls within the boundaries of node k, then it
will be assigned to node k, and paired with the majority class inside this node. Note that
in our case, yi , i = 1, · · · , n, are all real-valued numbers. That is, {yi } ∈ R. In this case,
the boundary between two neighboring regions is reduced to a point, and hence each reigon
corresponds to an interval on the R axis.
The decision tree is a weak classifier. In most applications, it needs to be incorporated
with an ensemble classifier to achieve higher accuracy. Some representative ensemble algorithms include Bagging and Boosting [98]. Note that the naive Bayesian classifier is subject
to severe overfitting problems when the sample size is limited, leading to low classification
accuracy. In the following, we will apply the AdaBoost algorithm to construct the ensemble
classifier, due to its robustness under noise effect [81, 89].

5.3.3

The Multi-Class AdaBoost Classifier

The multi-class AdaBoost classifier is built upon an ensemble of weak decision tree classifiers [89]. Given a set of labeled data {(y1 , c1 ), (y2 , c2 ), ..., (yn , cn )}, where ci ∈ {C˜1 , C˜2 , C˜3 },
the algorithm first starts with an empty ensemble and T decision trees, as constructed above.
Each sample yi in the date set is given an initial weight wi = 1/n, where i = 1, 2, · · · , n.
Then for t = 1, 2, · · · , T , the algorithm will iteratively implement the following procedures:
1. Weighted Classification Error Calculation Apply a weak decision tree classifier t to
109

the samples and calculate the weighted classification error. More specifically, let

1 if ci = c˜i ,
0 if ci = c˜i ,

I(ci , c˜i ) =

(5.26)

where c˜i is the assigned class for sample yi , and ci is the true class yi belongs to. Then the
weighted classification error et would be
n

et =

wi I(ci , c˜i ).

(5.27)

i=1

2. Voting Weight Assignment Based on the weighted classification error et , the algorithm
will assign a voting weight αt for the weak decision tree classifier t as follows:

αt = ln

1 − et
+ ln2,
et

(5.28)

and then add classifier t into the ensemble.
3. Weight Update Before the next iteration, the weight of each data sample yi is updated
as follows:

wi ← wi eαt I(ci ,c˜i ) ,
wi ←

wi

n

i=1

.

(5.29)
(5.30)

wi

The procedure in (5.30) ensures that the weights {wi }, i = 1, 2, · · · , n, form a probability
distribution with

n

i=1

wi = 1. As can be seen, after the update, those samples which have

been incorrectly classified in current iteration will have higher weights in the next iteration.
4. Final Classification After T iterations, there will be T decision trees in the ensemble.
110

The final classification is a weighted majority votes of each of those classifiers.
As will be shown in the next Section, the combination of regularized LDA and AdaBoost
can achieve much higher accuracy in AD, MCI and NC classification than the conventional
approach based on the original LDA and the Bayesian classifier.

5.4
5.4.1

Numerical Analysis
fMRI Data Acquisition

In our data collection process, 10 patients with mild-to-moderate probable Alzheimer’s Disease, 11 patients with MCI and 12 age- and education-matched healthy NC subjects were
recruited to participate in this study. The fMRI experiment was conducted on a GE 3T
Signa R HDx MR scanner (GE Healthcare, Waukesha, WI) with an 8-channel head coil. To
study resting-state brain function, echo-planar images, starting from the most inferior regions of the brain, were acquired for 7 minutes with the following parameters: 38 contiguous
3mm axial slices in an interleaved order, time of echo = 27.7ms, time of repetition = 2500ms,
flip angle = 80◦ , field of view = 22cm, matrix size = 64 × 64, ramp sampling, and with the
first four data points discarded. Each volume of slices was acquired 164 times. Common
pre-processing procedures on resting state fMRI data were carried as detailed in [42].

5.4.2

Performance Comparison of Different Classification Algorithms

In this subsection, we present the classification performance of the proposed method and
compare it to existing methods. Since the size of data samples is small, the performance

111

of the classifier is evaluated by the Leave-One-Out (LOO) cross-validation. As described
earlier, the ROIs used are the hippocampus and ICC from both hemispheres of the brain.
0.8
0.7
0.6

Accuracy

0.5
0.4
0.3
0.2
0.1
0.0

LDA
with Bayesian

LDA
with AdaBoost

Regularized LDA
with AdaBoost

Figure 5.1: Comparison of 3-category (AD, MCI, NC) classification results.

Figure (5.1) shows the performance of three classifiers. In the first one, a naive Bayesian
classifier is employed after the original LDA transform. As can be seen, its final accuracy is
only 44%. As explained in Section II, the main reason of such an unsatisfying performance is
that: when the number of data samples is small, the estimation of class means and covariance
matrices in LDA suffers from severe noise effect, leading to overfitting. In the second one,
the original LDA is combined with the AdaBoost classifier. As can be seen, accuracy is
increased to 69% by AdaBoost. The third one is what we proposed, the regularized LDA
is combined with the AdaBoost classifier. The shrinkage operations in the regularized LDA
can reduce the noise effect in the estimation, and further improve the accuracy to 75%.
Figure 5.2 shows the classification results of all the three classes of subjects using regularized LDA and the AdaBoost classifier. As expected, compared with NC subjects and AD
patients, it is more difficult for the classifier to identify MCI patients, and the classification
accuracy for MCI patients is much lower than that for AD and NC subjects.

112

0.9
0.8
0.7

Accuracy

0.6
0.5
0.4
0.3
0.2
0.1
0.0

ADP

MCI

NC

Figure 5.2: Regularized LDA with AdaBoost classifier: classification accuracy for different
categories.

5.5

Summary

This chapter proposed a reliable method for AD, MCI and NC subject classification that is
highly robust under size limited fMRI data samples, by exploiting brain network connectivity
pattern analysis. To do it, first, we selected the right and left hippocampi regions and
isthmus of the cingulate cortices (ICCs) as our ROI sub-network, and calculated the Pearson
correlation coefficients between all possible ROI pairs and used them to form a feature
vector for each subject. Second, the vectors were projected into a one-dimensional axis
using the proposed regularized LDA approach, where the differences between AD, MCI
and NC subjects were maximized. Shrinkage based regularization procedures were taken
to reduce the noise effect due to the limited sample size. Finally, a decision tree based
multi-class AdaBoost classifier, which is robust to noise effect, was applied to the projected
one-dimensional vectors to perform the classification.
Both the theoretical and numerical analysis demonstrated that: (i) The regularization
methods and the AdaBoost classifier can increase the classification accuracy significantly;
113

(ii) Brain network connectivity analysis, which evaluates the changes in the pattern of connectivity among multiple or all regions in the sub-network, can reveal in-depth information
about brain connectivity and result in relatively accurate classification of AD, MCI and NC,
especially when the sample size is very limited; (iii) Our analysis confirms the previous findings that the hippocampus and the isthmus of the cingulate cortex are closely involved in
the development of AD and MCI.
Potentially, the proposed framework can be applied to other classification problems as
well, especially under limited sample size.

114

Chapter 6
Conclusions and Future Work

6.1

Conclusions

Brain functional connectivity analysis has been shown as an essential method in revealing
the organization of brain networks, identifying function hubs and discovering biological patterns. In this research, we developed innovative modeling and analysis methodologies by
exploiting advanced techniques in network communication and information theory, and exploring how the proposed techniques can help us understand cognitive process and identify
brain problems.
First, in Chapter 2, we considered the measurement of non-directional connectivities
between brain regions using mutual information. We proposed a novel approach for the
estimation of MI, which was composed of three major components: de-correlation, kernel
based estimation of probability density function and Monte Carlo Integration. The proposed
MI estimator was applied to experimental fMRI data obtained from Alzheimer’s Disease
patients and normal subjects. The numerical result was consistent with clinical observations.
Next, in Chapter 3, we presented the directed information framework and showed how
to apply it for fMRI causality analysis. We provided the detailed procedure on how to
calculate the DI for two finite time series. The two major steps involved here are optimal
bin size selection for data digitization, and probability estimation. We applied the DI based
causality analysis to both the simulated data and experimental fMRI data, and compared the
115

results with that of the Granger Causality analysis. Our results indicated that GC analysis is
effective in detecting linear or nearly linear causal relationship, but has difficulty in capturing
nonlinear causal relationships. On the other hand, DI based causality analysis is effective
in capturing both linear and non-linear causal relationships. Moreover, it was observed that
brain connectivity among different regions generally involves dynamic two-way information
transmissions between them. Our results showed that when bidirectional information flow
is present, DI is more effective than GC to quantify the overall causal relationship.
Then, in Chapter 4, we investigated the discrete time DCM (DDCM) and its relationships with Directed Information and Granger Causality. First, we demonstrated the relationship between DDCM and the continuous time DCM. Rather than using approximation,
we proved that when the input to the neural dynamic system is a constant, then DDCM can
be strictly derived from DCM under the noise free case. This result further validates the
DDCM model. Second, based on information theory, we revealed the conditional equivalence
between DDCM and DI in characterizing the causal relationships between two brain regions.
More specifically, assuming that the dynamic neural system is causal, the neurostate and
the noise at each region are normally distributed, and the external input is a constant, we
showed that DDCM and DI are equivalent in characterizing the causal relationships between
two brain regions. This equivalence between DDCM and DI provides a simple method for
DI estimation. We also showed that when the hemodynamic system is invertible, the DIbased causal relationship between the neurostates of two brain regions is the same with that
between the observed BOLD signals. However, it should be pointed out that conditional
equivalence between DDCM and DI needs further investigation when multiple regions are
involved. Finally, we illustrated the similarities and differences between DDCM and GC.
Although they share a similar mathematic form, the causality measures they utilize are
116

completely different. Note that GC detects the causal relationship between the observed
signals, and DCM detects the causal connections of the hidden neurostates. The conditional
equivalence between GC and DCM remains an interesting problem. The theoretical techniques were demonstrated using fMRI data obtained under both resting state and stimulus
based state. Our numerical analysis was consistent with that reported in previous study.
Finally, in Chapter 5, we proposed a reliable method for AD, MCI and NC subject classification that is highly robust under size limited fMRI data samples, by exploiting brain network connectivity pattern analysis. To do it, first, we selected the right and left hippocampi
regions and isthmus of the cingulate cortices (ICCs) as our ROI sub-network, and calculated
the Pearson correlation coefficients between all possible ROI pairs and used them to form a
feature vector for each subject. Second, the vectors were projected into a one-dimensional
axis using the proposed regularized LDA approach, where the differences between AD, MCI
and NC subjects were maximized. Shrinkage based regularization procedures were taken
to reduce the noise effect due to the limited sample size. Finally, a decision tree based
multi-class AdaBoost classifier, which is robust to noise effect, was applied to the projected
one-dimensional vectors to perform the classification.
Both the theoretical and numerical analysis demonstrated that: (i) The regularization
methods and the AdaBoost classifier can increase the classification accuracy significantly;
(ii) Brain network connectivity analysis, which evaluates the changes in the pattern of connectivity among multiple or all regions in the sub-network, can reveal in-depth information
about brain connectivity and result in relatively accurate classification of AD, MCI and NC,
especially when the sample size is very limited; (iii) Our analysis confirms the previous findings that the hippocampus and the isthmus of the cingulate cortex are closely involved in
the development of AD and MCI.
117

Potentially, the proposed framework can be applied to other classification problems as
well, especially under limited sample size.

6.2

Future Work

We propose the following directions for the future research.
Brain network connectivity pattern analysis – from Pearson correlation to mutual information
• In Chapter 2 and Chapter 5, interesting results had been drawn based on mutual information and network connectivity pattern analysis. Further research can be conducted
on MI (including multi-voxel MI) based network connectivity pattern analysis. It is
important to develop new classification algorithms based on the mutual information
characteristics of part of the subjects, compare the classification results with those
from the Pearson-based approach, and explore the possibility of developing MI as a
new biomarker for AD diagnosis, especially at the early stage.
• In addition to the existing data set, future research can also be conducted on the
Alzheimer’s Disease Neuroimaging Initiative (ADNI) (http://adni.loni.usc.edu/) data
set, where more than 200 resting-state fMRI datasets have been collected from aged
normal subjects, and MCI and AD patients. Some subjects have multiple time points,
and the dataset size continues to grow. The ADNI dataset is freely available to researchers.
Understanding causality from a network perspective

118

• For causality analysis, traditionally, only two regions are considered at a time. Note
that the dependence relationship between two time series may not be conserved when
additional observations are taken into account. Further research is needed to study
the causality between two brain regions while taking additional regions into consideration, and investigate the relationship between the extended discrete DCM model and
multivariate DI theory.
Stability analysis of the brain network
• Brain network stability is essentially an uncharted area. While the stability on functional brain activity measures has been studied by examining the variations of various
measures along the time and spatial domain [99], and the stability of brain network
topology has been studied in [100], existing work on brain network stability has been
very limited. It is important for the future research to explore how to characterize
brain stability quantitatively, model the information processing at the region level,
and evaluate network communication pattern changes in different groups of subjects.

119

BIBLIOGRAPHY

120

BIBLIOGRAPHY

[1] A. Tsai, I. Fisher, JohnW., C. Wible, I. Wells, WilliamM., J. Kim, and A. Willsky, “Analysis of functional MRI data using mutual information,” in Medical Image
Computing and Computer-Assisted Intervention MICCAI99, ser. Lecture Notes in
Computer Science, C. Taylor and A. Colchester, Eds. Springer Berlin Heidelberg,
1999, vol. 1679, pp. 473–480.
[2] V. Michel, C. Damon, and B. Thirion, “Mutual information-based feature selection
enhances fMRI brain activity classification,” in 5th IEEE International Symposium on
Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008., May 2008, pp. 592–595.
[3] B. Chai et al., “Exploring functional connectivities of the human brain using multivariate information analysis,” in Advances in Neural Information Processing Systems
22, Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, Eds. Curran
Associates, Inc., 2009, pp. 270–278.
[4] L. M. Parkes, J. V. Schwarzbach, A. A. Bouts, P. Pullens, C. M. Kerskens, D. G.
Norris et al., “Quantifying the spatial resolution of the gradient echo and spin echo
bold response at 3 tesla,” Magnetic resonance in medicine, vol. 54, no. 6, pp. 1465–
1472, 2005.
[5] V. Gmez-Verdejo, M. Martnez-Ramn, J. Florensa-Vila, and A. Oliviero, “Analysis of
fMRI time series with mutual information,” Medical Image Analysis, vol. 16, no. 2, pp.
451 – 458, 2012.
[6] A. Roebroeck et al., “Causal time series analysis of functional magnetic resonance
imaging data,” in JMLR: Workshop and Conference Proceedings 12, 2011, pp. 65–94.
[7] C.Granger, “Investigating causal relations by econometric models and cross-spectral
methods,” Econometrica, vol. 37, no. 3, pp. 424–438, 1969.
[8] K. J. Friston, “Functional and effective connectivity: a review,” Brain connectivity,
vol. 1, no. 1, pp. 13–36, 2011.
[9] M. Hu and H. Liang, “A copula approach to assessing granger causality,” NeuroImage,
vol. 100, pp. 125–134, 2014.
[10] P. Liang, Z. Li, G. Deshpande, Z. Wang, X. Hu, and K. Li, “Altered causal connectivity
of resting state brain networks in amnesic mci,” PloS one, vol. 9, no. 3, p. e88476, 2014.
121

[11] O. David et al., “Identifying neural drivers with functional mri: An electrophysiological
validation,” PLOS BIOLOGY, vol. 6, pp. 2683–2697, Dec 2008.
[12] J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed. New York: Cambridge
University Press, Sep 2009.
[13] E. E. Schadt et al., “An integrative genomics approach to infer causal associations
between gene expression and disease,” Nature Genetics, vol. 37, pp. 710–717, July
2005.
[14] M. Luessi et al., “Variational bayesian causal connectivity analysis for fMRI,” Frontiers
in Neuroinformatics, vol. 8, pp. 1–16, May 2014.
[15] K. Friston et al., “Dynamic causal modeling,” NeuroImage, vol. 19, pp. 1273–1302,
2003.
[16] K. Stephan et al., “Dynamic causal models of neural system dynamics: current state
and future extensions,” J. Biosci, vol. 32, pp. 129–144, Jan 2007.
[17] S. Ryali et al., “Multivariate dynamical systems models for estimating causal interactionsin fMRI,” NeuroImage, vol. 54, pp. 807–823, 2011.
[18] R. Vicente et al., “Transfer entropy – a model-free measure of effective connectivity
for the neurosciences,” J Comput Neurosci, vol. 30, pp. 45–67, 2011.
[19] M. Lungarella and O. Sporns, “Mapping information flow in sensorimotor networks,”
PLOS Computational Biology, vol. 2, pp. 1301–1312, Oct 2006.
[20] R. Vicente et al., “Transfer entropya model-free measure of effective connectivity for
the neurosciences,” Journal of Computational Neuroscience, vol. 30, pp. 45–67, Aug
2010.
[21] J. T. Lizier et al., “Multivariate information-theoretic measures reveal directed information structure and task relevant changes in fMRI connectivity,” Journal of Computational Neuroscience, vol. 30, pp. 85–107, Aug 2010.
[22] C. Quinn et al., “Estimating the directed information to infer causal relationships in
ensemble neural spike train recordings,” J Comput Neurosci, vol. 30, pp. 17–44, Feb
2011.
[23] M. Wibral et al., Directed Information Measures in Neuroscience. Springer, 2014.
122

[24] J. Massey, “Causality,feedback, and directed information,” in Int. Symp. Inf. Theory
Appl., Honolulu, HI, Nov. 1990, pp. 303–305.
[25] H.H.Permuter et al., “Interpretations of directed information in portfolio theory, data
compression, and hypothesis testing,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp.
3248–3259, Jun. 2009.
[26] N. Soltani and A. Goldsmith, “Inferring neural connectivity via measured delay in
directed information estimates,” in IEEE Int Symp. on Inf. Theory, July 2013, pp.
2503–2507.
[27] Z. Wang, A. Alahmadi, D. Zhu, and T. Li, “Causality analysis of fmri data based on the
directed information theory framework,” Biomedical Engineering, IEEE Transactions
on, vol. PP, no. 99, pp. 1–1, 2015.
[28] Y. Liu and S. Aviyente, “Quantification of effective connectivity in the brain using a measure of directed information,” Computational and Mathematical Methods
in Medicine, vol. 2012, 2012.
[29] P.-O. Amblard and O. J. Michel, “On directed information theory and Granger Causality graphs,” J Comput Neurosci, vol. 30, pp. 7–16, Feb 2010.
[30] J. Jiao et al., “Universal estimation of directed information,” IEEE Trans. Inf. Theory,
2014.
[31] A. Razi and K. J. Friston, “The connected brain: Causality, models, and intrinsic
dynamics,” IEEE Signal Processing Magazine, vol. 33, no. 3, pp. 14–35, 2016.
[32] R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Corbetta, S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo et al., “Dynamic
functional connectivity: promise, issues, and interpretations,” Neuroimage, vol. 80, pp.
360–378, 2013.
[33] G. K. Cooray, B. Sengupta, P. Douglas, M. Englund, R. Wickstrom, and K. Friston, “Characterising seizures in anti-nmda-receptor encephalitis with dynamic causal
modelling,” NeuroImage, vol. 118, pp. 508–519, 2015.
[34] G. K. Cooray, B. Sengupta, P. K. Douglas, and K. Friston, “Dynamic causal modelling
of electrographic seizure activity using bayesian belief updating,” NeuroImage, vol. 125,
pp. 1142–1154, 2016.

123

[35] D. S. Bassett, M. Yang, N. F. Wymbs, and S. T. Grafton, “Learning-induced autonomy
of sensorimotor systems,” Nature neuroscience, vol. 18, no. 5, pp. 744–751, 2015.
[36] K. Wang et al., “Discriminative analysis of early Alzheimers disease based on two
intrinsically anti-correlated networks with resting-state fMRI,” Medical Image Computing and Computer-Assisted Intervention–MICCAI 2006, pp. 340–347, 2006.
[37] N. U. Dosenbach, B. Nardos, A. L. Cohen, D. A. Fair, J. D. Power, J. A. Church, S. M.
Nelson, G. S. Wig, A. C. Vogel, C. N. Lessov-Schlaggar et al., “Prediction of individual
brain maturity using fmri,” Science, vol. 329, no. 5997, pp. 1358–1361, 2010.
[38] G. Bellucci, S. Chernyak, M. Hoffman, G. Deshpande, O. Dal Monte, K. M. Knutson,
J. Grafman, and F. Krueger, “Effective connectivity of brain regions underlying thirdparty punishment: functional mri and granger causality evidence,” Social neuroscience,
vol. 12, no. 2, pp. 124–134, 2017.
[39] A. Duggento, G. Valenza, L. Passamonti, M. Guerrisi, R. Barbieri, and N. Toschi, “Reconstructing multivariate causal structure between functional brain networks through
a laguerre-volterra based granger causality approach,” in Engineering in Medicine and
Biology Society (EMBC), 2016 IEEE 38th Annual International Conference of the.
IEEE, 2016, pp. 5477–5480.
[40] X. Hu, S. Hu, J. Zhang, W. Kong, and Y. Cao, “A fatal drawback of the widely used
granger causality in neuroscience,” in Information Science and Technology (ICIST),
2016 Sixth International Conference on. IEEE, 2016, pp. 61–65.
[41] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of
eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[42] D. C. Zhu et al., “Alzheimer’s disease and amnestic mild cognitive impairment weaken
connections within the default-mode network: a multi-modal imaging study,” Journal
of Alzheimer’s Disease, vol. 34, no. 4, pp. 969–984, 2013.
[43] B. Afshin-Pour, H. Soltanian-Zadeh, G.-A. Hossein-Zadeh, C. L. Grady, and S. C.
Strother, “A mutual information-based metric for evaluation of fMRI data-processing
approaches,” Humman Brain Mapping, vol. 32, pp. 699–715, 2011.
[44] Q. Wang et al., “Divergence estimation for multidimensional densities via k-nearestneighbor distances,” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2392–2405, May
2009.

124

[45] B. S. Everitt, S. Landau, M. Leese, and D. Stahl, Miscellaneous Clustering Methods,
in Cluster Analysis, 5th ed. Chichester, UK: Wiley, 2011.
[46] J. S. Racine, “Nonparametric econometrics: A primer,” Foundations and Trends in
Econometrics, vol. 3, no. 1, pp. 1–88, 2008.
[47] T. Weise, Global Optimization Algorithms – Theory and Application. Germany: itweise.de (self-published), 2009. [Online]. Available: http://www.it-weise.de/projects/
book.pdf
[48] C. Robert and G. Casella, Monte Carlo Statistical Methods, 2nd ed. Springer, 2005.
[49] J. Geweke, “Measurement of linear dependence and feedback between multiple time
series,” Journal of the American Statistical Association, vol. 77, no. 378, pp. 304–313,
1982.
[50] C. Bernasconi and P. Konig, “On the directionality of cortical interactions studied by
structural analysis of electrophysiological recordings,” Biological Cybernetics, vol. 81,
pp. 199–210, 1999.
[51] R. Goebela et al., “Investigating directed cortical interactions in time-resolved fMRI
data using vector autoregressive modeling and Granger Causality mapping,” Magnetic
Resonance Imaging, vol. 21, pp. 1251–1261, 2003.
[52] A. Roebroeck et al., “Mapping directed influence over the brain using Granger Causality and fMRI,” NeuroImage, vol. 258, pp. 230–2421, 2005.
[53] S. L. Bressler et al., “Top-down control of human visual cortex by frontal and parietal
cortex in anticipatory visual spatial attention,” The Journal of Neuroscience, vol. 28,
pp. 10 056–10 061, 2008.
[54] X. Wen et al., “Is Granger Causality a viable technique for analyzing fMRI data?”
PLOS ONE, vol. 8, pp. 1–11, July 2013.
[55] ——, “Causal interactions in attention networks predict behavioral performance,” The
Journal of Neuroscience, vol. 32, pp. 1284–1292, Jan 2012.
[56] B. P. Bezruchko et al., “Modeling nonlinear oscillatory systems and diagnostics of
coupling between them using chaotic time series analysis: applications in neurophysiology,” Physics-Uspekhi, vol. 51, no. 3, pp. 304–310, 2008.

125

[57] D. Marinazzo et al., “Kernel method for nonlinear Granger Causality,” Physical Review
Letters, vol. 100, no. 14, p. 144103, 2008.
[58] S. Iyengar, J. Dauwels, P. Varshney, and A. Cichocki, “Quantifying eeg synchrony
using copulas,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, March 2010, pp. 505–508.
[59] D. Marinazzo, W. Liao, H. Chen, and S. Stramaglia, “Nonlinear connectivity by
Granger Causality,” Neuroimage, vol. 58, no. 2, pp. 330–338, 2011.
[60] G. Kramer, “Capacity results for the discrete memoryless network,” IEEE Trans. Inf.
Theory, vol. 49, no. 1, pp. 4–21, 2003.
[61] J. Henderson et al., “Functions of parahippocampal place area and retrosplenial cortex
in real-world scene analysis: An fMRI study,” Visual Cognition, vol. 19, no. 7, pp. 910–
927, 2011.
[62] R. Epstein and N. Kanwisher, “A cortical representation of the local visual environment,” Nature, vol. 392, no. 6676, pp. 598–601, 1998.
[63] J. Jeong et al., “Mutual information analysis of the eeg in patients with alzheimer’s
disease,” Clinical Neurophysiology, vol. 112, pp. 827–835, 2001.
[64] L. Bettencourt et al., “Functional structure of cortical neuronal networks grown in
vitro,” Physical Review, vol. 75, no. 2, pp. 021 915–1–021 915–10, 2007.
[65] D. Scott, “On optimal and data-based histograms,” Biometrika, vol. 66, no. 3, pp.
605–10, 1979.
[66] F. Willemx et al., “The context-tree weighting method: Basic properties,” IEEE Trans.
Inf. Theory, vol. 41, no. 3, May. 1995.
[67] J. Talairach and P. Tournoux, Co-planar stereotaxic atlas of the human brain: 3dimensional proportional system: an approach to cerebral imaging. New York: Georg
Thieme Verlag., 1988.
[68] S. T. Witt et al., “Functional neuroimaging correlates of finger-tapping task variations:
an ale meta-analysis,” Neuroimage, vol. 42, no. 1, pp. 343–356, 2008.

126

[69] D. C. Zhu and S. Majumdar, “Integration of resting-state fmri and diffusion-weighted
mri connectivity analyses of the human brain: limitations and improvement,” J Neuroimaging, vol. 24, no. 2, pp. 1763–186, 2014.
[70] R. W. Cox, “Afni: software for analysis and visualization of functional magnetic resonance neuroimages,” Comput Biomed Res, vol. 29, no. 3, pp. 162–173, 1996.
[71] S. Smith, “Overview of fMRI analysis,” The British Journal of Radiology, vol. 77, pp.
S167–S175, 2004.
[72] K. Schreiber and B. Krekelberg, “The statistical analysis of multi-voxel patterns in
functional imaging,” PLoS ONE, vol. 8, no. 7, p. e69328, 07 2013.
[73] K. Friston, R. Moran, and A. K. Seth, “Analysing connectivity with granger causality
and dynamic causal modelling,” Current opinion in neurobiology, vol. 23, no. 2, pp.
172–178, 2013.
[74] K. Friston, “Dynamic causal modeling and granger causality comments on: The identification of interacting networks in the brain using fMRI: Model selection, causality
and deconvolution,” Neuroimage, vol. 58, no. 2, pp. 303–305, 2011.
[75] J. F. Smith et al., “Identification and validation of effective connectivity networks
in functional magnetic resonance imaging using switching linear dynamic systems,”
Neuroimage, vol. 52, pp. 1027–1040, 2010.
[76] E. Westman et al., “Combining MRI and measures for classification of Alzheimer’s
disease and prediction of mild cognitive impairment conversion,” Neuroimage, vol. 62,
no. 1, pp. 229–238, 2012.
[77] C. Lehmann et al., “Application and comparison of classification algorithms for recognition of Alzheimer’s disease in electrical brain activity (EEG),” Journal of neuroscience methods, vol. 161, no. 2, pp. 342–350, 2007.
[78] R. Cuingnet et al., “Automatic classification of patients with Alzheimer’s disease from
structural MRI: a comparison of ten methods using the ADNI database,” neuroimage,
vol. 56, no. 2, pp. 766–781, 2011.
[79] G. Chen et al., “Classification of Alzheimer disease, mild cognitive impairment, and
normal cognitive status with large-scale network analysis based on resting-state functional MR imaging,” Radiology, vol. 259, no. 1, pp. 213–221, 2011.

127

[80] W. S. Pritchard et al., “EEG-based, neural-net predictive classification of Alzheimer’s
disease versus control subjects is augmented by non-linear EEG measures,” Electroencephalography and clinical Neurophysiology, vol. 91, no. 2, pp. 118–130, 1994.
[81] R. O. Duda et al., Pattern classification. John Wiley & Sons, 2012.
[82] B. Magnin et al., “Support vector machine-based classification of Alzheimers disease
from whole-brain anatomical MRI,” Neuroradiology, vol. 51, no. 2, pp. 73–83, 2009.
[83] I. O. Korolev et al., “Predicting progression from mild cognitive impairment to
alzheimer’s dementia using clinical, mri, and plasma biomarkers via probabilistic pattern classification,” PLoS ONE, vol. 11, no. 2, pp. 1–25, 02 2016.
[84] B. Yeo et al., “The organization of the human cerebran cortex estimated by intrinsic
functional connection,” J Neurophysiol, vol. 106, pp. 1125–1165, June 2011.
[85] M. A. Binnewijzend et al., “Resting-state fmri changes in alzheimer’s disease and mild
cognitive impairment,” Neurobiology of aging, vol. 33, no. 9, pp. 2018–2028, 2012.
[86] M. D. Greicius et al., “Default-mode network activity distinguishes alzheimer’s disease from healthy aging: evidence from functional mri,” Proceedings of the National
Academy of Sciences of the United States of America, vol. 101, no. 13, pp. 4637–4642,
2004.
[87] H.-Y. Zhang et al., “Resting brain connectivity: Changes during the progress of
alzheimer disease 1,” Radiology, vol. 256, no. 2, pp. 598–606, 2010.
[88] B. Fischl and thers, “Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain,” Neuron, vol. 33, no. 3, pp. 341–355, 2002.
[89] J. Zhu et al., “Multi-class AdaBoost,” Statistics and its Interface, vol. 2, no. 3, pp.
349–360, 2009.
[90] R. Tibshirani et al., “Diagnosis of multiple cancer types by shrunken centroids of gene
expression,” Proceedings of the National Academy of Sciences, vol. 99, no. 10, pp.
6567–6572, 2002.
[91] J. Rice, Mathematical statistics and data analysis. Cengage Learning, 2006.
[92] J. H. Friedman, “Regularized discriminant analysis,” Journal of the American statistical association, vol. 84, no. 405, pp. 165–175, 1989.
128

[93] I. J. Myung, “Tutorial on maximum likelihood estimation,” Journal of mathematical
Psychology, vol. 47, no. 1, pp. 90–100, 2003.
[94] H. Zhou et al., “On projections of Gaussian distributions using maximum likelihood
criteria,” in Information Theory and Applications Workshop, 2009. IEEE, 2009, pp.
431–438.
[95] S. R. Searle, “Matrix algebra useful for statistics,” New York, vol. 1982, 1982.
[96] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced
rank HMMs for improved speech recognition,” Speech communication, vol. 26, no. 4,
pp. 283–297, 1998.
[97] L. Breiman et al., Classification and regression trees. CRC press, 1984.
[98] Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC Press, 2012.
[99] K. A. Garrison, D. Scheinost, E. S. Finn, X. Shen, and R. T. Constable, “The (in)
stability of functional brain network measures across thresholds,” Neuroimage, vol.
118, pp. 651–661, 2015.
[100] J. K. Rzucidlo, P. L. Roseman, P. J. Laurienti, and D. Dagenbach, “Stability of whole
brain and regional network topology within and between resting and cognitive states,”
PloS one, vol. 8, no. 8, p. e70275, 2013.

129