DIRECTED INFORMATION FOR COMPLEX NETWORK ANALYSIS FROM
MULTIVARIATE TIME SERIES
by
Ying Liu

A DISSERTATION
Submitted
to Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Electrical Engineering
2012

ABSTRACT
DIRECTED INFORMATION FOR COMPLEX NETWORK ANALYSIS FROM
MULTIVARIATE TIME SERIES
by
Ying Liu
Complex networks, ranging from gene regulatory networks in biology to social networks in
sociology, have received growing attention from the scientiﬁc community. The analysis of
complex networks employs techniques from graph theory, machine learning and signal processing. In recent years, complex network analysis tools have been applied to neuroscience
and neuroimaging studies to have a better understanding of the human brain. In this thesis, we focus on inferring and analyzing the complex functional brain networks underlying
multichannel electroencephalogram (EEG) recordings. Understanding this complex network
requires the development of a measure to quantify the relationship between multivariate
time series, algorithms to reconstruct the network based on the pairwise relationships, and
identiﬁcation of functional modules within the network.
Functional and eﬀective connectivity are two widely studied approaches to quantify the
connectivity between two recordings. Unlike functional connectivity which only quantiﬁes
the statistical dependencies between two processes by measures such as cross correlation,
phase synchrony, and mutual information (MI), eﬀective connectivity quantiﬁes the inﬂuence one node exerts on another node. Directed information (DI) measure is one of the
approaches that has been recently proposed to capture the causal relationships between two
time series. Two major challenges remain with the application of DI to multivariate data,
which include the computational complexity of computing DI with increasing signal length
and the accuracy of estimation from limited realizations of the data. Expressions that can
simplify the computation of the original deﬁnition of DI while still quantifying the causality
relationship are needed. In addition, the advantage of DI over conventionally causality mea-

sures such as Granger causality has not been fully investigated. In this thesis, we propose
time-lagged directed information and modiﬁed directed information to address the issue of
computational complexity, and compare the performance of this model free measure with
model based measures (e.g. Granger causality) for diﬀerent realistic signal models.
Once the pairwise DI between two random processes is computed, another problem is to
infer the underlying structure of the complex network with minimal false positive detection.
We propose to use conditional directed information (CDI) proposed by Kramer to address
this issue, and introduce the time-lagged conditional directed information and modiﬁed conditional directed information to lower the computational complexity of CDI. Three network
inference algorithms are presented to infer directed acyclic networks which can quantify the
causality and also detect the indirect couplings simultaneously from multivariate data.
One last challenge in the study of complex networks, speciﬁcally in neuroscience applications, is to identify the functional modules from multichannel, multiple subject recordings.
Most research on community detection in this area so far has focused on ﬁnding the association matrix based on functional connectivity, instead of eﬀective connectivity, thus not
capturing the causality in the network. In addition, in order to ﬁnd a modular structure
that best describes all of the subjects in a group, a group analysis strategy is needed. In this
thesis, we propose a multi-subject hierarchical community detection algorithm suitable for
a group of weighted and asymmetric (directed) networks representing eﬀective connectivity,
and apply the algorithm to multichannel electroencephalogram (EEG) data.

Copyright by
Ying Liu
2012

ACKNOWLEDGEMENTS

First of all, I would like to express my sincere gratitude to my advisor, Professor Selin
Aviyente, for her guidance, encouragement, and support in every stage of my graduate
study. Her knowledge, kindness, patience, passion, and vision aﬀected me a lot and have
provided me with lifetime beneﬁts.
I am also grateful to my dissertation committee members, Professor Hayder Radha, Professor Ramakrishna Mukkamala, and Professor Pang-Ning Tan, for their valuable comments
and suggestions on the thesis draft, as well as for the experience as a student with these three
outstanding teachers. I would also like to thank many faculty members of MSU who were
the instructors for the courses I took. The course works have greatly enriched my knowledge
and provided the background and foundations for my thesis research.
My PhD study could have never been completed without the help of my fellow graduate
students at MSU. I would like to express my special thanks to the colleagues at our lab, Ali
Yener Mutlu, Marcos Bolanos, and Suhaily Cardona, for their suggestions, helps, and all the
happy and tough time we have been through. I also want to express my thanks to Xi Lu,
Lei Zhang, Shenglan Gao, Guanqun Zhang, Di Tang, Meng Cai, Ting Sun, Mingwu Gao,
Jiankun Liu, Qiong Huo, Xiaochen Tang and Yuemin Jin, for their kind help during my four
years at MSU, who enriched my study and life.
Finally, I would like to express my gratitude to my family. Their endless love and support
always encourage me to deal with obstacles in every aspects of my life. In particular, I
want to express my deepest gratitude to my dear husband, Jiayin, for his enduring love,
encouragement, patience, and understanding.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . .
1.1 Measures to quantify the causality . . . . . . . . . . . . . . .
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Granger causality based measures . . . . . . . . . . .
1.1.3 Information theoretic causality measures . . . . . . .
1.1.4 Directed information versus other causality measures
1.2 EEG data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Overview of the contributions . . . . . . . . . . . . . . . . .

1
5
6
7
9
12
14
16

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

CHAPTER 2 QUANTIFICATION OF EFFECTIVE CONNECTIVITY
BY DIRECTED INFORMATION . . . . . . . . . . . . . . . . 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Modiﬁed directed information . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Problems with the implementation of directed information . . . . . . 22
2.2.2 Time-lagged directed information . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Modiﬁed directed information . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Modiﬁed directed information versus transfer entropy . . . . . . . . . 30
2.2.5 Order selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.6 Normalization and signiﬁcance testing . . . . . . . . . . . . . . . . . 34
2.2.7 Performance of modiﬁed directed information . . . . . . . . . . . . . 35
2.3 Application of DI to bivariate signal models . . . . . . . . . . . . . . . . . . 39
2.3.1 Simulated signal models . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Biological data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Estimation of directed information . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.1 Estimation based on entropy estimation . . . . . . . . . . . . . . . . 52
2.4.2 Estimation based on mutual information and multi-information . . . 55
2.4.3 Performance comparison of estimators . . . . . . . . . . . . . . . . . 58
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
CHAPTER 3 DIRECTED NETWORK INFERENCE BASED
RECTED INFORMATION . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Conditional directed information . . . . . . . . . . . . . . . . .
3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Motivational example . . . . . . . . . . . . . . . . . . . .
3.2.3 Computation of conditional directed information . . . . .
3.2.4 Modiﬁed conditional directed information . . . . . . . .
vi

ON DI. . . . . . . 64
. . . . . . . 64
. . . . . . . 66
. . . . . . . 66
. . . . . . . 67
. . . . . . . 68
. . . . . . . 71

3.3

3.4

3.5

3.6

Network inference algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Algorithm based on directed information and conditional directed
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Algorithm based on time lag . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Algorithm based on modiﬁed time-lagged directed information and
conditional directed information . . . . . . . . . . . . . . . . . . . .
3.3.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Synthetic data: Linear network . . . . . . . . . . . . . . . . . . . .
3.4.2 Synthetic data: Nonlinear network . . . . . . . . . . . . . . . . . .
Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Problems with current algorithms . . . . . . . . . . . . . . . . . . .
3.5.2 Comparison with existing algorithms . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

73

.
.

74
74

.
.
.
.
.
.
.
.
.

77
78
80
80
83
85
85
86
87

CHAPTER 4 COMMUNITY DETECTION FOR DIRECTIONAL NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Overview of community detection methods . . . . . . . . . . . . . . .
4.2.3 Group analysis approaches . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Algorithm for community detection . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Algorithm for community detection in weighted directed networks . .
4.3.2 Algorithm for community detection for multiple subjects . . . . . . .
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Directed vs. undirected networks . . . . . . . . . . . . . . . . . . . .
4.4.2 Group analysis on synthetic data . . . . . . . . . . . . . . . . . . . .
4.4.3 Group analysis on EEG Data . . . . . . . . . . . . . . . . . . . . . .
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88
88
91
91
92
94
95
96
98
99
99
102
104
110

CHAPTER 5 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . 113
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
APPENDIX A CONDITIONAL DIRECTED INFORMATION IN TWO GENERAL TRIVARIATE MODELS . . . . . . . . . . 118
APPENDIX B COMPUTATION OF TIME-LAGGED CONDITIONAL
DIRECTED INFORMATION . . . . . . . . . . . . . . . . . 121
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

vii

LIST OF TABLES

2.1

Performance comparison for single order linear model . . . . . . . . . . . . . . .

36

2.2

Performance comparison for multi-order linear model . . . . . . . . . . . . . . .

38

2.3

Performance comparison for nonlinear model . . . . . . . . . . . . . . . . . . . .

38

2.4

Electrode pairs in the region of interest with signiﬁcant DI values . . . . . . . .

51

2.5

Performance Comparison for linear model (128 realizations) . . . . . . . . . . .

60

2.6

Performance Comparison for linear model (1024 realizations) . . . . . . . . . . .

60

2.7

Performance Comparison for nonlinear model (128 realizations) . . . . . . . . .

61

2.8

Performance Comparison for nonlinear model (1024 realizations) . . . . . . . . .

61

3.1

Average F -score for three proposed algorithms and PDC for linear network . . .

82

3.2

Average F -score for three algorithms of nonlinear network . . . . . . . . . . . .

85

4.1

Average false discovery rate for group analysis methods . . . . . . . . . . . . . . 104

4.2

Comparison of Modularity for CRN with for random graphs . . . . . . . . . . . 108

4.3

Comparison of Modularity for ERN with for random graphs . . . . . . . . . . . 108

viii

LIST OF FIGURES

2.1

Average information ﬂow over 100 simulations for single order linear model
computed using the original deﬁnition of DI, TLDI and MDI. . . . . . . . . . .

36

2.2

Modiﬁed DI for bivariate linear autoregressive model with multiple time lags. . .

37

2.3

Modiﬁed DI for nonlinear autoregressive model with multiple time lags. . . . . .

39

2.4

Application of directed information and Granger causality to bivariate linear
autoregressive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Application of directed information and Granger causality to bivariate nonlinear autoregressive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Application of directed information and Granger causality to linear mixing for
both linear and nonlinear autoregressive models. . . . . . . . . . . . . . . . . . .

44

2.5

2.6

2.7

Application of directed information and Granger causality to single source model. 47

2.8

Application of directed information and Granger causality to two asymmetric
coupled Lorenz oscillators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Application of directed information and Granger causality to EEG data. . . . .

50

2.10 Averaged information ﬂow for linear model using diﬀerent estimators(128 realizations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

3.1

The DI value and time lag for both hub and chain models. . . . . . . . . . . . .

69

3.2

Algorithm 1: directed network inference based on DI and CDI . . . . . . . . . .

75

3.3

Algorithm 2: directed network inference based on TLDI and TL . . . . . . . . .

77

3.4

Algorithm 3: directed network inference based on MDI and MCDI . . . . . . . .

79

3.5

The performance of proposed algorithms for linear network inference. . . . . . .

82

3.6

The performance of proposed algorithms for nonlinear network inference. . . . .

84

4.1

Hierarchical optimization of modularity by Blondel et al. [1]. . . . . . . . . . . .

96

4.2

Algorithm 4: community detection of weighted networks . . . . . . . . . . . . .

98

4.3

Algorithm 5: community detection of multiple weighted networks . . . . . . . . 100

2.9

ix

4.4

Community detection of diﬀerent representation of the network. . . . . . . . . . 102

4.5

Average false discovery rate of the community detection algorithm for a simulated directed weighted network. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.6

Applying the multi-subject community detection algorithm to 10 subjects for
CRN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.7

Applying the multi-subject community detection algorithm to 10 subjects for
ERN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.8

The information ﬂow ratio of both response types and their corresponding
random networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

x

Chapter 1
INTRODUCTION

Complex networks are abound in nature and engineering, ranging from neural networks and
protein interaction networks in biology to social networks in sociology and the internet in
communication [2, 3, 4]. A network is referred to as a complex network not only because
of its size, but also because of the nature of the interactions (e.g. nonlinear) between its
subsystems and the behavior of the individual network nodes (dynamic) [5]. The analysis
of complex networks employs techniques from graph theory, machine learning, statistical
physics, and signal processing [6, 7, 8]. Complex networks across a range of applications are
found to have similar macroscopic behavior such as small-world topology and scale-free distribution. However, these properties are not suﬃcient for a comprehensive understanding of
the network at an intermediate scale. There is an interest to understand how these networks
are structurally organized and change dynamically over time and frequency. In recent years,
complex network analysis tools have been applied to neuroscience and neuroimaging studies
and have resulted in a better understanding of the brain at a system level. In this dissertation, we focus on the analysis of the complex brain network using information-theoretic
measures and attempt to gain some insights into the functionality of the brain.
One common approach to analyze the brain as a complex network is to describe it as a
graph. Each node in the graph represents a particular region and the neuronal oscillations
associated with it. Diﬀerent neuroimaging modalities, such as electroencephalogram (EEG),
magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), can
be used to record the brain activity and extract complex network characteristics from the
human brain. However, compared to other neuroimaing modalities, EEG is able to record
brain activity with higher temporal resolution and accuracy [9]. Once the neuroimaging
recordings are collected, the ﬁrst goal for brain network analysis is to determine the edges

1

between the nodes or connectivity from the observation of multivariate time series. Three
kinds of brain connectivity have been studied recently to determine and quantify the strength
of the edge: anatomical connectivity, functional connectivity and eﬀective connectivity [5].
Anatomical connectivity is the set of physical or structural connections linking neuronal
units at a given time and can be obtained from measurements of the diﬀusion tensor [10].
Functional connectivity captures the statistical dependence between distributed and often
spatially remote neuronal units by measuring their correlations in either time or frequency
domain. Eﬀective connectivity describes the set of causal eﬀects of one neural system over
another [5, 11], which can reﬂect both the interaction and the direction of the information ﬂow
in the system. These three types of connectivity help in the understanding of the functional
segregation and functional integration of the brain. Functional segregation refers to a cortical
area of the brain specializing for some aspects of perceptual or motor processing and this
specialization is anatomically segregated within the cortex [12]. When a function involves
diﬀerent specialized areas, the union of these areas is mediated by functional integration [12].
The assessment of functional integration is important to understand how diﬀerent areas of
the brain coordinate with each other for a particular task. Both functional connectivity and
eﬀective connectivity can be used to describe the functional integration of the brain. In
fact, most of the research in the analysis of brain networks focus on functional connectivity
where the edges correspond to the pairwise correlation and result in undirected graphs, which
is not usually suﬃcient to describe the actual neurological processes. On the contrary, as
Friston pointed out functional integration of the brain can be better understood through
eﬀective connectivity since it reﬂects the dynamic (activity dependent and time dependent)
characteristics of a system. In this sense, the brain network can be well described by eﬀective
networks where the edges of the graph have direction and the corresponding association
matrix is no longer symmetric. Therefore, we expect that using eﬀective connectivity would
reveal new topological characteristics of the brain. This dissertation focuses mainly on the
eﬀective connectivity and the related network inference and community detection problems.

2

Measures to quantify the eﬀective connectivity can be categorized into three groups, dynamic causal modeling (DCM), Granger causality based measures, and information-theoretic
measures. Dynamic causal modeling employs a generative model to explain how activity in
one brain area is aﬀected by activity in another by using diﬀerential equations in continuous
time [11, 13]. The parameters of these equations encode the strength of connections and
how they change with experimental factors. DCM tries to ﬁnd the best model that explains
the data but it requires a priori knowledge of the system, such as the input of the system
and hidden states. In addition, DCM is limited to networks with small size [12]. Granger
causality is deﬁned as a stochastic process X causing another process Y if the prediction of
Y at the current time point, Yn , is improved when taking into account the past samples of X.
Diﬀerent from DCM, measures based on Granger causality assumes the data reﬂect states
that cause each other and capture the dependencies among the observations directly. Therefore, Granger causality based methods are more viable and can be applied directly to any
time series without knowing any knowledge about how the generating data are structured.
However, in practice, Granger causality is usually implemented within a linear framework,
e.g. bivariate or multivariate autoregressive models, and yielding methods such as directed
transfer function (DTF), and partial directed coherence (PDC) [14, 15, 16]. These methods
are limited to capturing linear relations and suﬀer from the common problems of parametric
model, such as determination of the order. However, EEG recordings are simultaneously
recorded at diﬀerent locations of the brain, and are known to have nonlinear dependencies
between recordings from diﬀerent sites [17]. Measures that can address the issue of model
dependency are needed. Recently, information theoretic tools [18, 19, 20], such as transfer entropy, directed transinformation, and directed information, address the issue of model
dependency and have found numerous applications in neuroscience [21, 22, 23]. Transfer
entropy (TE) proposed by Schreiber quantiﬁes causality as the deviation of the observed
data from the generalized Markov condition. Transfer entropy is based on a Markov assumption and the performance of transfer entropy depends on the estimation of transition

3

probabilities, which requires the selection of order or memory of the Markov processes X and
Y [24]. Directed transinformation (DT) introduced by Saito [19] measures the information
ﬂow from the current sample of one signal to the future samples of another signal given the
past samples of both signals. However, this measure does not discriminate between totally dependent and independent processes [25]. Recently, directed information proposed by
Marko [26] and later re-formalized by Massey, Kramer, Tatikonda and others has attracted
attention for quantifying directional dependencies [20, 26, 27, 28, 29]. Directed information
theory has been mostly aimed towards the study of communication channels with feedback.
In recent years, new theoretical developments motivated the use of this measure in quantifying causality between two time series. In particular, Amblard et al. [29] recently showed how
directed information and Granger causality are equivalent for linear Gaussian processes and
proved key relationships between existing causality measures and the directed information.
Therefore, there has been a growing interest in applying this measure to applications in
signal processing, neuroscience and bioinformatics. One major issue remaining with the application of directed information is the estimation and computation of directed information
from limited amount of data [20, 26, 27, 30]. Therefore, a simpliﬁed expression to reduce the
dimensionality of DI estimation is needed. In addition, the comparison of DI with existing
measures, in particular the model dependent measures based on Granger causality, is needed
to verify its eﬀectiveness for the analysis of neuroscience data.
Although directed information is eﬀective at quantifying the relationship between pairs
of neuronal populations, it is not suﬃcient to reveal the actual network structure. The DI
value between two processes by itself can not reﬂect the true structure of the network. A
large DI value does not guarantee direct causality between two time series, i.e., one signal
may aﬀect the other through the third signal [31]. Therefore, we use causal conditional
directed information introduced by Kramer [27] to address this problem and propose multiple
algorithms to infer the directed network. The inferred network can demonstrate the true
eﬀective connectivity between two processes and the system topology.

4

In most applications, discovering the global topology of the network is not suﬃcient.
Motifs that can reﬂect the local organizational features of the network have also been of
interest, using community detection and network classiﬁcation methods [32, 33]. Previous
work has shown that functional brain networks exhibit scale free and small-world properties
and have a hierarchical structure [34]; and that the community structure of human brain
changes with age and the task at hand [35]. For example, Fair et al. showed that young
children and young adults have diﬀerent community structures in functional brain networks
from the study of resting state fMRI data [35]. Similarly, Ferrarini et al. showed that the
resting-state human brain has a hierarchical functional module structure [36] and Meunier
et al. revealed age-related changes in the modular structure of human brain functional
networks from fMRI [37]. Chavez et al. pointed out that the modular structure of the human
brain provides important information on the functional organization of brain areas during
normal and pathological neural activities [38]. Therefore, in order to discover the underlying
organization of the network, the partition of the brain network into small functional modules
is needed. Traditional clustering algorithms require a priori knowledge about the number
of clusters [39]. Therefore, modularity based algorithms are widely used to choose the best
partitions of a network by maximizing the modularity. In many studies, it is important to
discover these functional modules across multiple subjects. In this dissertation, we extend
a greedy algorithm, Louvain method [1], to weighted and directed networks to ﬁnd the
functional communities of the brain across subjects.

1.1

Measures to quantify the causality

In this section, we will give a brief introduction to some popular Granger causality and
information-theoretic based causality measures.

5

1.1.1

Notations

Before introducing the various measures to quantify the eﬀective connectivity, we will ﬁrst
review some common notations and deﬁnitions that will be used throughout this dissertation.
Let X = X n = X1:n = (X1 , · · · , Xn ) be a random process with length n and p(x1 , · · · , xn ) =
p(xn ) = p(x1:n ) be the joint probability of random variables (X1 , · · · , Xn ). DX n = X n−1 =
(0, X1 , · · · , Xn−1 ) will be used to deﬁne the time delayed version of sequence X n , which is
also equivalent to X1:n−1 .
Given two continuous random variables X and Y , the mutual information (MI) is deﬁned
as follows1 :

∫ ∫
I(X; Y ) =

p(x, y) log

p(x, y)
dxdy,
px (x)py (y)

(1.1)

where p(x, y) is the joint probability density function (pdf) of X and Y , and px (x), py (y)
are the marginal pdfs of X and Y , respectively. I(X; Y ) ≥ 0 with equality if and only if X
and Y are independent [40]. In information theory, mutual information can be interpreted
as the amount of uncertainty about X that can be reduced by the observation of Y or the
amount of information Y can provide about X, i.e., I(X; Y ) = H(X) − H(X|Y ). Since
I(X; Y ) ≥ 0, H(X|Y ) ≤ H(X) with equality if and only if X and Y are independent, i.e.,
conditioning reduces entropy [40].
Mutual information has a natural generalization to multiple variables deﬁned as multiinformation (total correlation) [41]:
∫
Ir [P1···r (y1 , y2 , · · · , yr )] =

∫
···

P
(y , y , · · · , yr ) r
P1···r (y1 , y2 , · · · , yr ) log[ 1...r 1 2
]d y. (1.2)
p1 (y1 ) · · · pr (yr )

Multi-information captures more collective properties than just pairwise relations as
quantiﬁed by mutual information. The relationship between multi-information and mutual
information is as follows [42]:
I(X N , Y N ) = I(X N ; Y N ) + I(X N ) + I(Y N ),
1 All integrals are from −∞ to +∞ unless otherwise speciﬁed.

6

(1.3)

where I(X N , Y N ) is the multi-information between 2N random variables X1 , · · · , XN , Y1 , · · ·
, YN , and I(X N ), I(Y N ) are the multi-information between N random variables X1 , · · · , XN
and Y1 , · · · , YN , respectively, while I(X N ; Y N ) is the mutual information between two random vectors of length N .
For any three random variables X, Y and Z, if the conditional distribution of Z depends
only on Y and is conditionally independent of X, i.e. p(z|y) = p(z|yx), then X, Y and Z
are said to form a Markov chain, denoted by X → Y → Z. In this case, the conditional
mutual information between X and Y given Z deﬁned as I(X; Z|Y ) = H(Z|Y )−H(Z|X, Y )
is equal to 0 [40].

1.1.2

Granger causality based measures

Granger causality is widely used to describe the causality between two time series. It is
deﬁned as a stochastic process X causing another process Y if the prediction of Y at the
current time point, Yn , is improved when taking into account the past samples of X. This
approach is appealing but gives rise to many questions on how to apply this deﬁnition to
real data [43]. Granger causality has commonly been implemented within a linear prediction
framework using a bivariate autoregressive model. In this framework, the improvement of
predicting Yn is assessed by the change of the variances of the prediction errors when the
signals are ﬁtted by univariate and bivariate autoregressive models. For two univariate signal
models,
X(n) =

Y (n) =

p1
∑
i=1
p2
∑

αxi X(n − i) + ηx (n),

(1.4)

βyi Y (n − i) + ηy (n),

(1.5)

i=1

where p1 and p2 are the order of the random processes X and Y, respectively, αxi and βyi
are the autoregressive coeﬃcients, and ηx (ηy ) are the noise processes. In this model, the
prediction of the current sample of Yn only depends on the past samples of itself. While for

7

a bivariate AR model,
X(n) =

Y (n) =

p1
∑
i=1
p2
∑

αxi X(n − i) +
βyi Y (n − i) +

i=1

p3
∑

γxi Y (n − i) + ηx (n),

(1.6)

γyi X(n − i) + ηy (n),

(1.7)

i=1
p4
∑
i=1

the prediction of each signal depends on the past samples of both signals.
Granger employs variance to evaluate the improvement of the prediction and the Granger
causality from X to Y can be quantiﬁed as:
(
)
var(Yn |Y n−1 )
GX→Y = ln
,
var(Yn |X n−1 Y n−1 )

(1.8)

where if var(Yn |Y n−1 ) > var(Yn |X n−1 Y n−1 ), X causes Y. If the past of X does not
improve the prediction of Yn , GX→Y is close to zero.
The original deﬁnition of Granger causality is quantiﬁed in the time domain and is limited to bivariate models. Later, researchers introduced multivariate autoregressive (MVAR)
models for multiple simultaneously recorded time series analysis and proposed methods such
as directed transfer function (DTF), partial directed coherence (PDC) and directed partial
correlation [44, 14, 16] to quantify Granger causality both in the time and the frequency
domains in a multivariate setting. We will only brief introduce PDC here since it is the
most widely used measure and will be used for comparisons with DI for network inference
in Chapter 3.
Partial directed coherence
Consider an m-dimensional MVAR process with order p as follows:

 



 X1 (n − r)
 X1 (n) 












 X2 (n − r)
 X2 (n) 
p

 ∑

=

Ar 




 r=1

.
.
.
.



.
.












Xm (n − r)
Xm (n)
8

  u1 (n) 

 

 

 

 
  u2 (n) 

 

+

 

 
.
.

 
.

 

 

 

 
um (n)

(1.9)

where ui (n) with i = 1, · · · , m represents independent Gaussian white noise with covariance
∑
matrix , and Ar with r = 1, · · · , p is the m × m coeﬃcient matrix. The PDC measure
from signal j to i is given by:
πi,j (f ) = √

ai,j (f )
aH (f )aj (f )
j

,

(1.10)

where ai,j (f ) is the i, jth entry of A(f ) = I −A(f ) = [a1 (f ), · · · , am (f )], A(f ) is the Fourier
transform of the coeﬃcients, aH (f ) is the Hermtian transpose of aj (f ). The computation
j
of πi,j (f ) relies on the parameters of the MVAR model and thus the performance of PDC
depends on the ﬁtness of the MVAR model to the signal, which requires a proper choice of
order p for the model and enough number of time samples to estimate the parameters.
Overall, all of these measures, e.g. Granger causality and PDC, are limited to capturing
linear relations or require a priori knowledge about the underlying signal models [23].

1.1.3

Information theoretic causality measures

Mutual information can be extended to random vectors or sequences X N and Y N as
I(X N ; Y N ), where I(X N ; Y N ) = H(X N ) − H(X N |Y N ) = H(Y N ) − H(Y N |X N ). However, mutual information is a symmetric measure and does not reveal any directionality
or causality between two random sequences. Information theoretic tools [18, 19, 20], such
as transfer entropy, directed transinformation and directed information, address the issue
of model dependency and evaluate the prediction improvement by ‘information (entropy)’
directly.
Transfer entropy
Transfer entropy (TE) proposed by Schreiber computes causality as the deviation of the
observed data from the generalized Markov condition and is deﬁned as [18],
n
TX→Y =

∑

p(yn yn−l:n−1 xn−m:n−1 ) log

yn ,yn−l:n−1 ,xn−m:n−1

p(yn |yn−l:n−1 xn−m:n−1 )
,
p(yn |yn−l:n−1 )
(1.11)

9

where m and l are the orders (memory) of the Markov processes X and Y, respectively and
p(yn yn−l:n−1 xn−m:n−1 ) is the joint probability of random variables (Yn , Yn−l:n−1 , Xn−m:n−1 ).
When n > max (l, m), transfer entropy can be expressed in terms of mutual information as
follows,
∑

p(yn |yn−k:n−1 xn−l:n−1 )
p(yn |yn−k:n−1 )
∑
p(xn−l:n−1 yn |yn−k:n−1 )
=
p(yn yn−k:n−1 xn−l:n−1 ) log
p(xn−l:n−1 |yn−k:n−1 )p(yn |yn−k:n−1 )
n
TX→Y =

p(yn yn−k:n−1 xn−l:n−1 ) log

(1.12)

= I(Xn−l:n−1 ; Yn |Yn−k:n−1 ),
where the last equality follows the deﬁnition of conditional mutual information, i.e. I(X; Y |Z) =
∑
p(xy|z)
p(xyz) log p(x|z)p(y|z) . The relationship between transfer entropy and conditional mutual information shown in the above equation has also been veriﬁed in [29]. It is important
to note that transfer entropy is usually deﬁned for a physical recording system, therefore,
instantaneous information exchange is not considered. In addition, the deﬁnition of TE
implies a stationary Markov assumption for the underlying system such that the state of
Yn only depends on the past l states of itself and the past m states of process X, i.e.,
p(yn |y1:n−1 x1:n−1 ) = p(yn |yn−l:n−1 xn−m:n−1 ).
Directed transinformation
Directed transinformation (T) introduced by Saito [19] measures the information ﬂow
from the current sample of one signal to the future samples of another signal given the past
samples of both signals. Directed transinformation is deﬁned as,
DT (X → Y) =

N
∑

I(Xn ; Yn+1:n+F |Xn−P :n−1 Yn−P :n−1 Yn ),

(1.13)

n=1

where Yn+1:n+F = (Yn+1 · · · Yn+F ) are the F future samples of Y, the value of F and P
changes with the current time sample n constrained by F + P + 1 = N where N is the length
of the signal. Diﬀerent from previously introduced measures, directed transinformation measures the inﬂuence of the current sample of X on the future samples of Y. The deﬁnition
of directed transinformation does not make any assumptions about the underlying model
10

for the interactions. However, the computation of each term of the above equation requires
the information of the whole time series, i.e. the joint probability estimation of 2N random
variables, which is computationally very complex. In addition, directed transinformation
can not discriminate between independent and identical processes since its value is equal to
0 for both cases [25].
Directed information
Massey addressed the issue of symmetry for mutual information by deﬁning the directed
information from a length N sequence X N = (X1 , · · · , XN ) to Y N = (Y1 , · · · , YN ) [20] as
follows:
DI(X N → Y N ) = H(Y N ) − H(Y N ||X N )
=

N
∑

(1.14)
I(X n ; Yn |Y n−1 ),

n=1

where H(Y N ||X N ) is the entropy of the sequence Y N causally conditioned on the sequence
X N and H(Y N ||X N ) is deﬁned as:
H(Y N ||X N ) =

N
∑

H(Yn |Y n−1 X n ),

(1.15)

n=1

which diﬀers from H(Y N |X N ) =

∑N

n=1 H(Yn |Y

n−1 X N ) in that X n replaces X N in each

term on the right-hand side of equation (1.15), i.e. only the causal inﬂuence of the time
series X up to the current time sample n on the process Y is considered.
An alternative deﬁnition of the directed information is proposed by Tatikonda in terms
of Kullback-Leibler (KL) divergence [28]. It shows that the diﬀerence between mutual information and directed information is the introduction of feedback in the deﬁnition of directed
information [20, 29, 28]. Mutual information and directed information expressed by KL
divergence are written as:
I(X N ; Y N ) = DKL (p(xN , y N )||p(xN )p(y N )),

(1.16)

−
DI(X N → Y N ) = DKL (p(xN , y N )||←(xN |y N )p(y N )),
p

(1.17)

11

−
where ←(xN |y N ) =
p

∏N

n−1 y n−1 ) is the feedback factor inﬂuenced by the feedback
n=1 p(xn |x

in the system, i.e., the probability that the input X at current time is inﬂuenced by the past
values of both itself and Y. If there is no feedback, then p(xn |xn−1 y n−1 ) = p(xn |xn−1 )
−
−
→
→
and ←(xN |y N ) = p(xN ). In fact, p(xN , y N ) = ←(xN |y N )− (y N |xN ), where − (y N |xN ) =
p
p
p
p
∏N
n n−1 ) and is deﬁned as the feedforward factor aﬀected by the memory of the
n=1 p(yn |x y
system. If the system is memoryless, then p(yn |xn y n−1 ) = p(yn |xn ).
Entropy and mutual information are extensive quantities, which grow with the length of
the signal. Thus, Shannon introduced entropy rate for stochastic processes. The entropy
1
rate of a stochastic process {Xi } is deﬁned as H(X ) = limn→∞ n H(X1 , · · · , Xn ). In

addition, when {Xi } is a stationary stochastic process, the entropy rate is also the limit of
the conditional entropy [40], i.e.,
1
H(X1 , · · · , Xn ) = lim H(Xn |X1 , · · · , Xn−1 ),
n→∞ n
n→∞
lim

(1.18)

When dealing with discrete valued processes, one can establish that, assuming stationarity,
the directed information rate can be written as [27, 29],
N
1 ∑
I(X n ; Yn |Y n−1 ),
N →∞ N

DI∞ (XN → YN ) = lim

n=1

(1.19)

= lim I(X N ; YN |Y N −1 ).
N →∞

which can further be decomposed into two parts as:
DI∞ (XN → YN ) =
=

lim I(X1:N −1 ; YN |Y1:N −1 ) + lim I(XN ; YN |X1:N −1 , Y1:N −1 )

N →∞

N →∞
N → Y N ) + DI (X N → Y N ||X N −1 )
DI∞ (DX
∞

(1.20)

where DI∞ (DX N → Y N ) is the directed information rate from delayed version of XN to
YN , and DI∞ (X1:n → Y1:n ||X1:n−1 ) is the instantaneous information exchange rate.

1.1.4

Directed information versus other causality measures

Directed information has been mostly aimed towards the study of communication channels
with feedback. In recent years, new theoretical developments motivated the use of this
12

measure in quantifying causality between two time series. Amblard et al. [29] recently
showed how directed information and Granger causality are equivalent for linear Gaussian
processes and proved key relationships between existing causality measures and the directed
information. Based on Granger’s deﬁnition of causality, Geweke introduced the Geweke’s
indices to quantify the causal linear dependencies under Gaussian assumptions [45]. Amblard
et al. proved that the directed information rate and Geweke’s indices are equal for Gaussian
processes [29] as indicated by,
DI∞ (DX N → Y N ) =

1
ε2 (Y |Y N −1 )
log 2 ∞ NN −1 N −1 = FX N →Y N ,
2
ε∞ (YN |Y
X
)

(1.21)

where N is the length of the signal, DI∞ (X N → Y N ) is the directed information rate,
ε2 (YN |Y N −1 ) = limN →∞ ε2 (YN |Y N −1 ) is the asymptotic variance of the prediction residue
∞
when predicting YN from the observation of Y N −1 , and FX N →Y N refers to the linear feedback measure from random processes X N to Y N deﬁned by Geweke [29]. Moreover, directed
information and Granger’s approach are equivalent for multivariate time series in the case of
Gaussian distributions [29]. In addition, Amblard et al. proved that for a stationary process
without considering the instantaneous information exchange, the directed information rate is
equal to DI∞ (DX N → Y N ) and is equivalent to the transfer entropy when l = m = n − 1 in
equation (1.11). Al-khassaweneh et al. derived the relationship between directed information
and directed transinformation that DT (X → Y) − DT (Y → X) = DI(X → Y) − DI(Y →
X), which indicates that both measures reveal the same information about the diﬀerence
of information ﬂow in two directions [25]. However, compared to other measures, DI has
several advantages. First, diﬀerent from Granger causality implemented in an AR setting,
DI is a model free measure and can quantify both the linear and nonlinear directional information ﬂow. Second, transfer entropy and directed transinformation are equal to 0 for both
independent and identical processes and they can not discriminate between these two types
of processes. For DI when the two processes are independent, DI = 0; and when the two
processes are identical, DI = H(Yn |Yn−1 ). Therefore, DI can discriminate between totally
dependent and independent processes. On the other hand, when we consider two time se13

ries without instantaneous information exchange, DI is equal to DI(DX → Y ) and cannot
discriminate between identical and independent processes, either. However, in this case, the
two processes will never be identical and we do not have to worry about it not discriminating
between totally dependent and independent cases. Overall, DI can be applied to any class
of signals without assumptions about the signal model (e.g. stationarity) or the interactions
between signals (e.g. linear). Therefore, in this dissertation, we focus on the estimation and
computation of DI, and apply this measure to network inference and community detection
problems encountered in multichannel EEG recordings.

1.2

EEG data

With the advance of neuroimaging technology, EEG is able to record brain activity with
higher temporal resolution and accuracy than ever before. In this dissertation, we analyze
the brain network based on the EEG data provided by Dr. Jason Moser from the Department
of Psychology at Michigan State University. Here we give a brief overview of the methods
used for EEG data collection, which include subject recruitment, task and data reduction.
Participants
EEG data from ten undergraduates were drawn from an ongoing study of relationships
between the error-related negativity (ERN) and individual diﬀerences2 . ERN is a brain potential response that occurs following performance errors in a speeded reaction time task [48].
All participants retained for analysis made at least six errors for computation of stable ERNs,
as in [49]. No participants discontinued their involvement once the experiment had begun.
Task
Participants completed a letters version of the Eriksen Flanker task [48]. Stimuli were
presented on a Pentium R Dual Core computer, using Presentation software (Neurobehavioral systems, Inc.) to control the presentation and timing of stimuli, the determination of
response accuracy, and the measurement of reaction times. During the task, participants
2 Participants for the present analysis were drawn from samples reported on in [46, 47]

14

were presented with a string of ﬁve letters. Each ﬁve-letter string was either congruent (e.g.
FFFFF) or incongruent (e.g. EEFEE) and participants were required to respond to the
center letter (target) via the left or right mouse button. Trial types were varied randomly
such that 50% of the trials were congruent. Letters were displayed in a standard white font
on a black background and subtended 1.3◦ of visual angle vertically and 9.2◦ horizontally. A
standard ﬁxation mark (+) was presented during the inter-trial interval (ITI). Each trial began with the presentation of the ﬂanking letters (i.e. EE EE). Flanking letters remained on
the screen for 35 ms and were followed by the target (i.e. EEFEE), which remained for 100
ms (135 ms total presentation time). Each trial was followed by a variable ITI (1200 − 1700
ms). The entire experimental session consisted of 480 trials grouped into six blocks of 80
trials each. The letters constituting each string were varied between blocks (e.g., M and N
in block 1 and E and F in block 2) and response-mappings were reversed at the midpoint of
each block (e.g., left mouse-button click for M through 40 trials of block 1, then right-mouse
button click for M for the last 40 trials of block 1) in order to elicit a suﬃcient number of
errors for ERN calculation.
Psychophysiological Data Recording, Reduction and Analysis
Continuous electroencephalographic (EEG) activity was recorded by 64 Ag-AgCl electrodes placed in accordance with the 10/20 system. Electrodes were ﬁtted in a BioSemi
(BioSemi, Amsterdam, The Netherlands) stretch-lycra cap. In addition, two electrodes were
placed on the left and right mastoids. The electro-oculogram (EOG) generated by eyemovements and blinks were recorded by FP1, as well as by electrodes placed below the right
eye and on the left and right outer canthi, all approximately 1 cm from the pupil. During
data acquisition, the Common Mode Sense active electrode and Driven Right Leg passive
electrode formed the ground, as per BioSemi’s design speciﬁcations. All bioelectric signals
were digitized at 512 Hz using ActiView software (BioSemi). Oﬄine Analysis were performed
using BrainVision Analyzer 2 (BrainProducts, Gilching, Germany). Scalp electrode recordings were re-referenced to the numeric mean of the mastoids and band-pass ﬁltered with

15

cutoﬀs of 0.1 and 30 Hz (12 dB/oct rolloﬀ). Ocular artifacts were then corrected using the
regression method developed by Gratton et al. [50]. Response-locked data were segmented
into individual epochs beginning 200 ms prior to the response and continued for 1000 ms.
Individual trials were rejected on the basis of excessive physiological activity: a voltage step
exceeding 50 µV between contiguous sampling points, a voltage diﬀerence of more than 200
µV within a trial, or a maximum voltage diﬀerence less than 0.5 V within a trial. Finally,
the response-locked EEG was averaged across trials to yield error- and correct-trial ERPs
for each site.

1.3

Overview of the contributions

The contributions of this dissertation can be divided into three parts: computation and
estimation of directed information, directed network inference using directed information and
conditional directed information, and community detection for multiple weighted directed
networks.
In chapter 2, computation and estimation of directed information are realized through
simpliﬁcation of the deﬁnition of DI. The major contribution of this work can be summarized
as follows:
1. Present the time-lagged directed information and modiﬁed directed information to
reduce the computational complexity of computing DI while still quantifying the causal
dependencies. Prove the relationship between the modiﬁed directed information and
transfer entropy. Evaluate the performance of modiﬁed DI for quantifying causality
for various realistic signal models with linear, nonlinear, and dynamic interactions.
2. Introduce a new directed information estimation method based on multi-information.
Provide a quantitative comparison of various DI estimation methods.
In the second part of the proposed research, network inference algorithms based on directed information and conditional directed information are introduced with the following
16

contributions:
1. Derive time-lagged conditional directed information and modiﬁed conditional directed
information to reduce the computational complexity of estimating directed relationships from real data.
2. Propose three network inference algorithms for linear and nonlinear network inference
and evaluate their performances on simulated network models.
In the third part of the proposed research, an improved community detection algorithm
for weighted and directed networks is introduced with the following contributions:
1. Extend a hierarchical community detection algorithm from undirected networks to the
directed case for identifying the modules in the eﬀective brain network.
2. Propose a group analysis method to obtain a common community structure across
subjects.
3. Evaluate the performance of the proposed community detection algorithm on both
simulated and real EEG data sets for understanding the organization of the eﬀective
connectivity networks in the brain.

17

Chapter 2
QUANTIFICATION OF EFFECTIVE CONNECTIVITY BY DIRECTED
INFORMATION

2.1

Introduction

In this chapter, we will focus on the quantiﬁcation of eﬀective connectivity to get a better
understanding of the functional inﬂuence in the brain. The main approaches used to quantify
the eﬀective connectivity between two time series are model based measures and informationtheoretic measures [51]. Granger causality based methods and dynamic causal modeling [13]
are two widely used model based measures. Granger causality is a widely used measure to
describe the causality between two time series. It deﬁnes a stochastic process X causing
another process Y if the prediction of Y at the current time point, Yn , is improved when
taking into account the past samples of X. This approach is appealing but gives rise to many
questions on how to apply this deﬁnition to real data [43]. Granger causality has been mostly
applied within a linear prediction framework using a multivariate autoregressive (MVAR)
model yielding methods such as directed transfer function (DTF), partial directed coherence
(PDC) and directed partial correlation [44, 14, 16, 52]. For example, Hesse et al. applied
time-varying Granger causality to EEG data and found that conﬂict situation generates
directional interactions from posterior to anterior cortical sites [14]. Kaminski et al. applied
DTF to EEG recordings of human brain during stage 2 sleep and located the main source
of causal inﬂuence [16]. Schelter et al. employed PDC to EEG recordings from a patient
suﬀering from essential tremor [53]. The extensions of Granger-causality based methods,
such as kernel Granger causality, generalized PDC (gPDC) and extended PDC (ePDC) have
also found numerous applications in neuroscience [54, 55, 56]. However, Granger-causality
based methods, especially those developed from MVAR models, are limited to capturing
linear relations or require a priori knowledge about the underlying signal models [23]. These
18

approaches may be misleading when applied to signals that are known to have nonlinear
dependencies, such as EEG data [17]. DCM, on the other hand, can quantify nonlinear
interactions by assuming a bilinear state space model. However, DCM requires a priori
knowledge about the input to the system [13, 57] and is limited to a network with small
size [12]. Thus, a model-free measure detecting both linear and nonlinear relationships is
desired.
Information theoretic tools [18, 19, 20], such as transfer entropy [18], address the issue
of model dependency and have found numerous applications in neuroscience [21, 22, 23].
Transfer entropy (TE) proposed by Schreiber computes causality as the deviation of the
observed data from the generalized Markov condition. Sabesan et al. employed TE to identify the direction of information ﬂow for the intracranial EEG data and suggested that
transfer entropy plays an important role in epilepsy research [24]. Wibral et al. applied TE
to magnetoencephalographic data to quantify the information ﬂow in cortical and cerebellar
networks [58]. Vicente et al. extended the deﬁnition of TE and measured the information ﬂow from X to Y by introducing a general time delay u and showed that TE has a
better performance in detecting the eﬀective connectivity for nonlinear interactions and signals aﬀected by volume conduction such as real EEG/MEG recordings compared to linear
methods [57]. The performance of transfer entropy depends on the estimation of transition
probabilities, which requires the selection of order or memory of the Markov processes X and
Y [24]. Directed transinformation (DT) introduced by Saito [19] measures the information
ﬂow from the current sample of one signal to the future samples of another signal given the
past samples of both signals. Hinrichs et al. used this measure to analyze causal interactions
in event related EEG-MEG experiments [23]. However, this measure does not discriminate
between totally dependent and independent processes [25]. Recently, directed information
proposed by Marko [26] and later re-formalized by Massey, Kramer, Tatikonda and others have attracted attention for quantifying directional dependencies [20, 26, 27, 28, 29].
Directed information theory has been mostly aimed towards the study of communication

19

channels with feedback. In recent years, new theoretical developments motivated the use
of this measure in quantifying causality between two time series. In particular, Amblard et
al. [29] recently showed how directed information and Granger causality are equivalent for
linear Gaussian processes and proved key relationships between existing causality measures
and the directed information. Therefore, there has been a growing interest in applying this
measure to applications in signal processing, neuroscience and bioinformatics. For example,
it has been successfully used to infer genomic networks [3] and to quantify eﬀective connectivity between neural spike data in neuroscience [4, 29, 59]. In order to detect both linear
and nonlinear relationships, in this chapter, we propose directed information as a powerful
measure to quantify the eﬀective connectivity in the brain.
The theoretical advantages of DI over existing measures have been noted in literature [4,
29, 59]. However, until now the implementation and beneﬁts of using DI for capturing the
eﬀective connectivity in the brain through neurophysiological data have not been illustrated
thoroughly and formally. We will mainly address three issues in this chapter. First, one
major issue with the application of directed information is the practical computation from
limited data. Current studies of directed information focus on the stationary Gaussian
processes due to the fact that the DI of a Gaussian process can be easily obtained even
with longer time series and limited sample sizes [30]. However, most complex systems are
nonlinear and not all nodes of the network follow Gaussian distributions. In this case,
directed information can be expressed in terms of mutual information or joint entropy and
estimators such as k-nearest neighborhood based methods or m-spacing based estimators can
be used. However, when the length of the signal increases, the computational complexity and
the bias of the DI estimator increases immensely. Therefore, a simpliﬁed expression for DI to
reduce the dimensionality of estimation is needed. In previous work, DI measure was applied
either to limited time series such as every two time samples [2] or to a known parametric
signal model to overcome this problem [4]. In this chapter, we show that applying directed
information to short-time windows such as every two time samples may lose most of the

20

causal dependencies between two random processes. In order to address this issue, we propose
modiﬁed directed information to simplify the expression of DI and reduce the computational
complexity while still quantifying the causal dependencies. In addition, we prove some
key relationships between transfer entropy and the modiﬁed directed information. Second,
since DI can be expressed in terms of entropy or mutual information, current applications,
such as genomic network inference and neural network inferrence, compute DI using either
entropy or mutual information based estimators [3, 60, 31]. However, traditional joint entropy
estimation methods of multiple random variables are inaccurate and ineﬃcient when the data
space is sparse. In order to overcome the inadequacy of entropy estimators for DI estimation,
we introduce an alternative representation of DI in terms of multi-information, which can
be estimated by extending the mutual information estimator proposed by Darbellay [61]
to multiple random variables using an adaptive partitioning of the observed space [42].
Moreover, diﬀerent applications put diﬀerent constraints on the estimation methods, so it is
important to evaluate the performance of the estimators in terms of bias and variance, and
select an appropriate DI estimator for diﬀerent systems or applications. In this chapter, we
oﬀer an extensive analysis of diﬀerent DI estimation methods in terms of the bias, variance,
computational eﬃciency and discrimination power through simulations.
Finally, once the problem for computation and estimation of DI has been addressed, we
focus on the application of DI on EEG data. Because of the relationship between Granger
causality and directed information, in this chapter, we will also compare the performance
of these two measures and investigate the advantage of DI over Granger-causality based
model measures. Theoretical developments only proved the equivalence between these two
measures for the case that the time series are distributed as Gaussian in a linear model.
However, to date there has not been much work that compares the actual performance of DI
and Granger causality based measures for realistic signal models, including both linear and
nonlinear interactions. This chapter addresses this issue by evaluating the performance of
DI and Granger causality based methods under a common framework without making any

21

assumptions about the data distribution.
In this chapter, we ﬁrst illustrate the problems related to the computation of DI. We
then propose a modiﬁed directed information measure that simpliﬁes the DI computation by
reducing the order of the joint entropy terms while still quantifying the causal dependencies.
In addition, we provide a DI estimator based on multi-information. We then evaluate the
performance of DI for quantifying the eﬀective connectivity for linear and nonlinear autoregressive models, linear mixing models, single source models and dynamic chaotic oscillators
in comparison to existing causality measures, in particular with Granger causality. Finally,
we apply our method to multichannel EEG data to detect the eﬀective connectivity in the
brain.

2.2
2.2.1

Modiﬁed directed information
Problems with the implementation of directed information

To apply directed information to real data, the ﬁrst issue we need to solve is the computation of DI. According to the deﬁnition of DI in equation (1.14), in practice we need
to estimate the conditional mutual information I(X n ; Yn |Y n−1 ), where n = 1, · · · , N . In
fact, I(X n ; Yn |Y n−1 ) quantiﬁes the causal information ﬂow from X to Y at time point n,
since I(X n ; Yn |Y n−1 ) = DI(X n → Y n ) − DI(X n−1 → Y n−1 ). In addition, as n → ∞,
I(X n ; Yn |Y n−1 ) is the directed information rate for a stationary process. I(X n ; Yn |Y n−1 )
can be expanded using entropy, mutual information, and multi-information. Therefore, DI
can be expressed:
• In terms of entropy as:
DI(X N → Y N ) =

=

N
∑

[(H(Y n ) − H(Y n−1 )) − (H(X n Y n ) − H(X n Y n−1 ))],

n=1
N
∑

(2.1)

[H(X n Y n−1 ) − H(X n Y n )] + H(Y N ).

n=1

22

• In terms of mutual information as:
DI(X N → Y N ) =

N
∑

[I(X n ; Y n ) − I(X n ; Y n−1 )].

(2.2)

n=1

• In terms of multi-information as:
DI(X N → Y N ) =

=

N
∑

[(I(X n , Y n ) − I(X n , Y n−1 )) − (I(Y n ) − I(Y n−1 ))],

n=1
N
∑

(2.3)

[(I(X n , Y n ) − I(X n , Y n−1 ))] − I(Y N ).

n=1

From the above equations, we can observe that the computation of DI requires the
estimation of joint probabilities of high dimensional random variables over time. If Xn
and Yn are normally distributed, the joint entropy can be estimated based on the covariance
matrices as follows:
DI(X N → Y N ) =

N
∑1
|cov(X1 · · · Xn Y1 · · · Yn−1 )||cov(Y1 · · · Yn )|
log
,
2
|cov(X1 · · · Xn Y1 · · · Yn )||cov(Y1 · · · Yn−1 )|

(2.4)

n=1

where |cov(X1 , X2 , · · · , Xn )| is the determinant of the covariance matrix of n random variables X1 , X2 , · · · , Xn and N is the length of the signal. The complexity of computing the
original deﬁnition of DI through equation (2.4) is O(N 4 ) (using LU decomposition [62]).
However, for EEG data, the distribution is usually not Gaussian. The non-parametric entropy and mutual information estimators, such as plug-in estimator, m-spacing estimator,
and Kozachenko and Leonenko (KL) estimator, can be used to estimate DI [63, 61]. When
the length of the signal increases, the computational complexity, the bias, and the variance
of these estimators increase immensely with limited sample sizes. Methods that can reduce
the dimension and simplify the computation of DI are needed.
In order to simplify the estimation of DI, we ﬁrst clarify the connection between the
deﬁnition of DI used in information theory and the deﬁnition as it applies to physical time
series. In a physical recording system, if X starts to inﬂuence Y after p1 time points or with
a delay of p1 samples, we need to record at least N + p1 time points to obtain N points of
23

the time sequence Y that have been aﬀected by X. The directed information rate from time
series X N +p1 to Y N +p1 can be deﬁned as [27],

DI∞ (X N +p1 → Y N +p1 )
N +p1
∑
1
=
lim
I(X n ; Yn |Y n−1 )
N + p1
N +p1 →∞

(2.5)

=

(2.6)

lim

N +p1 →∞

n=1
N +p1 ; Y
N +p1 −1 )
I(X
N +p1 |Y

lim

[H(YN +p1 |Y N +p1 −1 ) − H(YN +p1 |X N +p1 Y N +p1 −1 )]

lim

[H(YN +p1 |Yp1 +1:N +p1 −1 ) − H(YN +p1 |X N +p1 Yp1 +1:N +p1 −1 )] (2.8)

lim

[H(YN +p1 |Yp1 +1:N +p1 −1 ) − H(YN +p1 |X1:N Yp1 +1:N +p1 −1 )]

lim

I(X1:N ; YN +p1 |Yp1 +1:N +p1 −1 )

(2.10)

N
1 ∑
I(X n ; Yn+p1 |Yp1 +1:n+p1 −1 )
= lim
N →∞ N

(2.11)

= DI∞ (X1:N → Yp1 +1:p1 +N ),

(2.12)

=
=
=
=

N +p1 →∞

(2.7)

N +p1 →∞
N +p1 →∞

N +p1 →∞

(2.9)

n=1

where equation (2.8) comes from the fact that Y1:p1 is independent of YN +p1 , and equation
(2.9) is derived using the fact that XN +1:N +p1 has no eﬀect on YN +p1 because of the time
delay p1 between these two time series. For two physical recordings X and Y with length
N + p1 and a lag of p1 , the last equation shows that DI rate for these two time series is
equivalent to DI rate for two random processes with length N that are not synchronized in
time. In fact, Yp1 +1:p1 +N may be indexed as Y1:N when using the information theoretic
indexing, which indexes the signal not according to the physical time point but based on
when the receiver receives its ﬁrst piece of information. Therefore, directed information rate
computed by using physical time indices is equivalent to the directed information rate using
information theoretic indices for two systems that interact through a time delay. Moreover,
when the length of the signal is long enough, the directed information value using both
indices will be equivalent. Once the deﬁnition of directed information is extended from

24

random vectors to two physical time series, we propose time-lagged DI and modiﬁed DI to
simplify the computation of DI.

2.2.2

Time-lagged directed information

As we mentioned before, when the length N of the signal increases, the computational
complexity, the bias, and the variance of estimating DI increase immensely with limited
sample sizes. In addition, the directed information deﬁned for the physical system is actually
a DI with a lag of p1 samples over a time window with length N . Therefore, an intuitive
way to simplify the computation is to apply DI with lag p1 over a small window. For
example, in [2], the authors applied DI to gene X and Y at every two time samples, i.e.,
DIn (Xn Xn+1 → Yn Yn+1 ), with the assumption that the value of gene X is only inﬂuenced
by the values of the other genes at one previous time step, i.e. a ﬁrst order Markov model
assumption. However, when X inﬂuences Y with a delay of p1 , we apply DI to every two
samples of these two time series X N and Y N at the nth time sample with a time delay of
p1 (n > p1 ):
DIn (Xn−p1 Xn−p1 +1 → Yn Yn+1 )
= I(Xn−p1 ; Yn ) + I(Xn−p1 Xn−p1 +1 ; Yn+1 |Yn )

(2.13)

= H(Xn−p1 ) + H(Xn−p1 Xn−p1 +1 Yn ) + H(Yn Yn+1 )
− H(Xn−p1 Yn ) − H(Xn−p1 Xn−p1 +1 Yn Yn+1 ),
where n = p1 + 1, · · · , N − 1. However, in practice, the actual time lag of the two time series
is unknown, and the estimated time lag d is used to compute DI for every two samples, i.e.
replacing p1 with d in equation (2.13). Thus the main question we need to answer is how
much actual information ﬂow is captured if the estimated time lag d is used in equation (2.13).
To answer this question, we ﬁrst consider a single order bivariate linear autoregressive model

25

with delay p1 (in this case, the maximum order of the model is also equal to p1 ) as,
xi = ui ;

(2.14)

yi = bxi−p1 + vi
where ui and vi are white Gaussian noise samples following N (0, σ 2 ) and the order (delay)
of the model is p1 . The actual directed information value can be computed by estimating
the covariance matrix. For this model each term in the DI expression in equation (2.4) is
simpliﬁed as:

 1
− log (b2 + 1)σ 2 , n = 1, · · · , p

1
2

1
|cov(X1 · · · Xn Y1 · · · Yn−1 )|
log
=

2
|cov(X1 · · · Xn Y1 · · · Yn )|
0, n > p

1

(2.15)

1
|cov(Y1 · · · Yn )|
1
log
= log (b2 + 1)σ 2 .
2
|cov(Y1 · · · Yn−1 )|
2
Therefore,
I(X n ; Yn |Y1:n−1 ) =

=

1
|cov(X1 · · · Xn Y1 · · · Yn−1 )||cov(Y1 · · · Yn )|
log
2
|cov(X1 · · · Xn Y1 · · · Yn )||cov(Y1 · · · Yn−1 )|


0, n = 1, · · · , p

1

(2.16)

1
 log (b2 + 1)σ 2 , n > p

1
2

Based on equations (1.19) and (2.16), when p1 ≪ N , the directed information rate is:
DI∞ (X N → Y N ) = lim I(X N ; YN |Y N −1 ) =
N →∞

1
log (b2 + 1)σ 2
2

(2.17)

On the other hand, when DI measure is computed over every two samples of X N and
Y N , the time-lagged DI given by equation (2.13) when replacing p1 with d can be simpliﬁed
as follows:



0, d < p − 1 or d ≥ p + 1


1
1



DIn (Xn−d Xn−d+1 → Yn Yn+1 ) = 1 log (b2 + 1)σ 2 , d = p1 − 1
2




log (b2 + 1)σ 2 , d = p

1

26

(2.18)

The time-lagged DI rate per sample is deﬁned as:
N −1
∑
1
DIn (Xn−d Xn−d+1 → Yn Yn+1 ),
DId = lim
N →∞ 2(N − d − 1)

(2.19)

n=d+1

where d is the delay variable, d = 0, · · · , p and p is the largest possible time delay (p ≥ d).
The factor of 2 is introduced since the time-lagged DI rate is averaged over samples. In
addition, when d = p1 , DIp1 = 1 log(b2 + 1)σ 2 , which is equal to the directed information
2
rate, i.e.,
DI∞ (X N → Y N ) = DIp1 .

(2.20)

Therefore, motivated by the model in equation (2.14), for any general autoregressive
model with a single order, we deﬁne the time-lagged directed information over the whole
time series as follows:
T LDId (X N → Y N ) =

d
∑

I(X n ; Yn |Y n−1 ) + DId × (N − d)

n=1

=

=

d
∑
n=1
d
∑
n=1

N −d
I(X n ; Yn |Y n−1 ) + lim
N →∞ 2(N − d − 1)

N −1
∑

DIn (Xn−d Xn−d+1 → Yn Yn+1 )

n=d+1

N −1
1 ∑
DIn (Xn−d Xn−d+1 → Yn Yn+1 )
N →∞ 2

I(X n ; Yn |Y n−1 ) + lim

n=d+1

(2.21)
where d = 0, · · · , p. The ﬁrst part of righthand side of above equation is the initial information ﬂow from X N to Y N when n ≤ d. The second part is motivated by the fact that when
the information ﬂow becomes stable, DI along the time series can be approximated by the
time-lagged DI rate times the number of time samples. When d = p1 , where p1 is the actual
time delay between X and Y, the ﬁrst term in the righthand side of the above equation is
0. The TLDI will capture the same amount of information as computing DI over the whole
time series. Moreover, it performs much better than the two sample DI (d equals to 0) used
in previous work, especially when p1 is large. For example, for the model in equation (2.14),
when p1 ≥ 2, DI0 = 0, which does not capture any of the causal dependencies.
27

Therefore, compared to equation (2.1), the computational complexity of computing DI
using equation (2.21) is highly reduced. If X and Y are normally distributed, we see that
the complexity of using the original deﬁnition of DI in equation (2.4) is O(N 4 ) (using LU
decomposition [62]), while the complexity of computing TLDI for two time samples is O(N ).
Therefore, time-lagged DI is more computationally eﬃcient. However, the time-lagged DI is
equivalent to the original deﬁnition of DI when the estimated delay d is equal to the actual
time delay of the system p1 , i.e. the signals X and Y follow a single order model, and Yn
only depends on one past sample of itself, Yn−1 . However, these assumptions are not always
true. Therefore, we propose the modiﬁed DI to address these issues.

2.2.3

Modiﬁed directed information

Consider a general Markov model, where X N and Y N are time series with a lag of p1 and
p(Yn |X1:n−p1 , Yp1 +1:n−1 ) = p(Yn |Xn−p2 :n−p1 , Yn−p3 :n−1 ), where p2 ≥ p1 , p3 ≥ 1, p2 is the
order of the process X, and p3 is the order of the process Y. In this model, it is assumed
that X starts to inﬂuence Y with a delay of p1 samples and the order of the model is
max(p2 , p3 ). When the length of the signal N is large enough, then equation (2.12) can be
further simpliﬁed as,
DI(X N → Y N ) = DI(X1:N −p1 → Yp1 +1:N )
=

N
∑

I(X1:n−p1 , Yn |Yp1 +1:n−1 )

n=p1 +1

=

N
∑

(2.22)

[H(Yn |Yp1 +1:n−1 ) − H(Yn |X1:n−p1 Yp1 +1:n−1 )].

n=p1 +1

Since
p(Yn |X1:n−p1 , Yp1 +1:n−1 ) = p(Yn |Xn−p2 :n−p1 , Yn−p3 :n−1 ),

(2.23)

then X1:n−p2 −1 Yp1 +1:n−p3 −1 → Xn−p2 :n−p1 , Yn−p3 :n−1 → Yn follows a Markov chain.

28

According to Markov chain property,
I(X1:n−p2 −1 Y1:n−p3 −1 ; Yn |Xn−p2 :n−p1 Yn−p3 :n−1 )

(2.24)

= H(Yn |Xn−p2 :n−p1 Yn−p3 :n−1 ) − H(Yn |X1:n−p1 Yp1 +1:n−1 ) = 0,
which means H(Yn |Xn−p2 :n−p1 Yn−p3 :n−1 ) = H(Yn |X1:n−p1 Yp1 +1:n−1 ). Therefore,
DI(X N → Y N ) =

N
∑

[H(Yn |Yp1 +1:n−1 ) − H(Yn |X1:n−p1 Yp1 +1:n−1 )]

n=p1 +1

=

N
∑

[H(Yn |Yp1 +1:n−1 ) − H(Yn |Xn−p2 :n−p1 Yn−p3 :n−1 )]

n=p1 +1

≤

=

N
∑

(2.25)
[H(Yn |Yn−p3 :n−1 ) − H(Yn |Xn−p2 :n−p1 Yn−p3 :n−1 )]

n=p1 +1
N
∑

I(Xn−p2 :n−p1 ; Yn |Yn−p3 :n−1 ),

n=p1 +1

where the second equality uses the Markov property and the inequality comes from the fact
that conditioning reduces entropy. For a general Markov model, where X N and Y N are stationary statistical processes without instantaneous interaction, e.g. p(Yn |X1:n−p1 , Yp1 +1:n−1 )
= p(Yn |Xn−p2 :n−p1 , Yn−p3 :n−1 ), the modiﬁed directed information (MDI) is deﬁned as the
upper bound of DI:
M DI(X N → Y N ) =

N
∑

I(Xn−p2 · · · Xn−p1 ; Yn |Yn−p3 · · · Yn−1 ),

(2.26)

n=p+1

where in practice we let p1 = 1, p = max(p2 , p3 ) to reduce the number of parameters. Note
that letting p1 = 1 does not lose any of the information ﬂow compared to using the actual
time delay, p1 > 1. The only drawback of letting p1 = 1 is that the computational complexity
of estimating the joint entropies increases since the length of the window to compute MDI
increases and the dimensionality increases. The main reason why we let p1 = 1 is because
estimating the actual value for the delay accurately is not practical when the amount of data
is limited. In a lot of similar work such as in [57], diﬀerent values of p1 are tested to choose
the best one which is not computationally eﬃcient either.
29

According to equation (2.25), modiﬁed directed information is the upper bound of directed information, i.e. M DI ≥ DI. Moreover, MDI is a more general extension of time-lagged
DI and has two major advantages. First, MDI considers the inﬂuence of multiple past samples of Y on the DI value. Second, it takes into account models with multiple orders, i.e.
Y is inﬂuenced by diﬀerent time lags of X. The modiﬁed directed information extends the
length of the window from 2 to p, which is closer to the actual information ﬂow. When X
and Y are normally distributed, the computational complexity of the MDI is O(p3 N ) and
is more eﬃcient than that of the original deﬁnition of DI.

2.2.4

Modiﬁed directed information versus transfer entropy

Both the modiﬁed DI and transfer entropy are deﬁned based on a Markov signal model.
Therefore, in this subsection, we will explore the relationship between them in detail. Based
on the deﬁnition of transfer entropy given in equation (1.11), we should note that transfer
entropy is deﬁned for a physical recording system, therefore, instantaneous information exchange is not considered. In addition, the deﬁnition of TE implies a Markov assumption of
the system that the state of Yn only depends on the past l states of itself and the past m
states of process X, i.e. p(yn |y1:n−1 x1:n−1 ) = p(yn |yn−l:n−1 xn−m:n−1 ).
Therefore, to explore the relationship between transfer entropy and directed information,
the DI should be derived under the same assumptions of no instantaneous information exchange and a Markov model. For two random processes X and Y without instantaneous
information exchange, the directed information and directed information rate are expressed
as,
DI(X N → Y N ) = DI(DX N → Y N ) =

N
∑

I(X n−1 ; Yn |Y n−1 ),

n=1

(2.27)

DI∞ (X N → Y N ) = DI∞ (DX N → Y N ) = lim I(X N −1 ; YN |Y N −1 ).
N →∞

Based on the assumption that the system can be approximated by a Markov process,
we derive a formula to show the relationship between the rate of directed information and
30

transfer entropy as shown in the following theorem.
Theorem 1 If X N and Y N are two stationary Markov processes with p(yn |y1:n−1 x1:n−1 ) =
p(yn |yn−l:n−1 xn−m:n−1 ), then the upper bound of the directed information rate, i.e., modiﬁed
DI rate, is equal to the transfer entropy.
proof 1
Based on the Markov assumption of these two processes, i.e. p(yn |y1:n−1 x1:n−1 ) =
p(yn |yn−l:n−1 xn−m:n−1 ), X1:n−m−1 Y1:n−l−1 → Xn−m:n−1 Yn−l:n−1 → Yn follows a Markov
chain. According to Lemma 1,
I(Yn ; X1:n−m−1 Y1:n−l−1 |Xn−m:n−1 Yn−l:n−1 )

(2.28)

= H(Yn |Xn−m:n−1 Yn−l:n−1 ) − H(Yn |X1:n−1 Y1:n−1 ) = 0,
implying H(Yn |Xn−m:n−1 Yn−l:n−1 ) = H(Yn |X1:n−1 Y1:n−1 ). Therefore,
I(X n−1 ; Yn |Y n−1 ) = H(Yn |Y1:n−1 ) − H(Yn |X1:n−1 Y1:n−1 )
= H(Yn |Y1:n−1 ) − H(Yn |Xn−m:n−1 Yn−l:n−1 )
≤ H(Yn |Yn−l:n−1 ) − H(Yn |Xn−m:n−1 Yn−l:n−1 )

(2.29)

= I(Xn−m:n−1 ; Yn |Yn−l:n−1 )
n
= TX→Y ,

where the last equality comes from equation (1.12). The inequality follows from the fact that
conditioning reduces entropy and the equality holds when Y1:n−l−1 is conditionally independent of Yn given Yn−l:n−1 , i.e. when l is large enough that the inﬂuence of Y1:n−l−1 on Yn
can be ignored, or when n ≤ l, i.e. Yn−l:n−1 = Y1:n−1 .
The directed information rate in a physical recording system can be expressed as,
DI∞ (DX N → Y N ) = lim I(X N −1 ; YN |Y N −1 )
N →∞

≤ lim I(XN −m:N −1 ; YN |YN −l:N −1 )
N →∞

N
= lim TX→Y .
N →∞

31

(2.30)

N
Therefore, when l = m = N − 1, DI∞ (DX N → Y N ) = limN →∞ TX→Y , which is

aligned with previous work in [29]. For stationary Markov processes, when m and l are ﬁxed,
limN →∞ I(XN −m:N −1 ; YN |YN −l:N −1 ) is equal to the rate of modiﬁed DI in equation (2.26)
with p1 = 1, p2 = m, and p3 = l. Moreover, in order to reduce the computational complexity,
we usually let l = m. In this way, in practice, the limit (rate) of transfer entropy is the upper
bound of directed information rate and is equal to the modiﬁed DI rate.
In summary, transfer entropy and directed information are very closely related to each
other. Transfer entropy quantiﬁes the information gained at each time step by measuring the
deviation of the observed data from the generalized Markov condition. Therefore, the definition of transfer entropy implicitly assumes a stationary Markov process [29]. Compared
to transfer entropy, directed information quantiﬁes the sum of information obtained over
the whole time series [64] and does not make any assumptions about the underlying signal
model. Thus, theoretically, the original deﬁnition of directed information can be applied
to any signal model. In real applications, in order to simplify the computation of directed
information, we usually make certain assumptions about the underlying signal model such as
the modiﬁed DI proposed in this dissertation, which basically assumes a stationary Markov
process similar to transfer entropy. In addition, Amblard et al. proved that for a stationary process, directed information rate can be decomposed into two parts, one of which is
equivalent to the transfer entropy when l = m = n − 1 in equation (1.11) and the other to
the instantaneous information exchange rate [29]. In another words, for a physical system
without instantaneous interactions between its subsystems, the rate of these two measures,
directed information and transfer entropy, are equivalent asymptotically as the length of the
signal goes to inﬁnity. When l or m is ﬁxed and not equal to n − 1, transfer entropy rate is
the upper bound of directed information rate and is equal to the modiﬁed DI rate.

32

2.2.5

Order selection

For the implementation of MDI, we need to determine the maximum order of the model
p. Criterions such as Akaike’s Final Prediction Error (FPE) can be used to determine the
order of the signal model p. However, this criterion is based on the assumption that the
original signal follows a linear AR model and may lead to false estimation of the order when
the underlying signal model is nonlinear. Therefore, model-free order selection methods,
such as the embedding theorem [65], are needed. For the simpliﬁcation of computation
or parameter estimation, we are only interested in a limited number of variables that can
be used to describe the whole system. Suppose we have a time series (X1 , · · · , Xn ), the
time-delay vectors can be reconstructed as (Xn , Xn−τ , Xn−2τ , · · · , Xn−(d−1)τ ). Projecting
the original system to this lower dimensional state space depends on the choice of d and τ ,
and the optimal embedding dimension d is related to the order of the model p = d [57]. A
variety of measures such as mutual information can be used to determine τ . For discrete
time signals, usually the best choice of τ is 1 [66]. To determine d, Cao criterion based
on the false nearest neighbor procedure [57] is used to determine the local dimension. The
underlying concept of nearest neighbor is that: if d is the embedding dimension of a system,
then any two points that stay close in the d-dimensional reconstructed space are still close
in the (d + 1)-dimensional reconstructed space; otherwise, these two points are false nearest
neighbors [66, 57]. The choice of d, i.e., the model order p, is important for DI estimation.
If d is too small, we will lose some of the information ﬂow from X to Y. If it is too large,
the computational complexity of MDI will be very high, causing the bias and the variance
of the estimators to increase.

33

2.2.6

Normalization and signiﬁcance testing

Since DI(X N → Y N ) + DI(Y N → X N ) = I(X N ; Y N ) + DI(X N → Y N ||DX N ) and
DI(X N → Y N ) = DI(DX N → Y N ) + DI(X N → Y N ||DX N ) [27], then
DI(X N → Y N ) + DI(Y N → X N ) = DI(DX N → Y N ) + DI(X N → Y N ||DX N )

(2.31)

+ DI(DY N → X N ) + DI(Y N → X N ||DY N ).
Therefore,
DI(DX N → Y N ) + DI(DY N → X N ) + DI(Y N → X N ||DY N ) = I(X N ; Y N ),

(2.32)

where DI(Y N → X N ||DY N ) = DI(X N → Y N ||DX N ) indicating the instantaneous information exchange between processes X and Y. For a physical system without instantaneous
causality, i.e. I(X N → Y N ||DX N ) = 0, then DI(X N → Y N ) + DI(Y N → X N ) =
I(X N ; Y N ) and 0 ≤ DI(X N → Y N ) ≤ I(X N ; Y N ) < ∞. A normalized version of DI,
which maps DI to the [0, 1] range is used for comparing diﬀerent interactions,
ρDI (X N → Y N ) =

DI(X N → Y N )
DI(X N → Y N )
=
,
I(X N ; Y N )
DI(X N → Y N ) + DI(Y N → X N )

(2.33)

where for a unidirectional system X → Y with no instantaneous interaction between X
and Y, ρDI (X N → Y N ) = 1 and ρDI (Y N → X N ) = 0; otherwise, if there is no causal
relationship between the two signals, the values of ρDI (X N → Y N ) and ρDI (Y N → X N )
are very close to each other.
In order to test the null hypothesis of noncausality, the causal structure between X and
Y is destroyed. For each process with multiple trials, we shuﬄe the order of the trials of the
time series X 100 times to generate new observations X∗ , m = 1, · · · , 100. In this way, the
m
causality between X and Y for each trial is destroyed, and the estimated joint probability
changes [67]. We compute the DI for each pair of data (X∗ and Y). A threshold is obtained
m
at a α = 0.05 signiﬁcance level such that 95% of the directed information for randomized
pairs of data (DI(X∗ → Y)) is less than this threshold. If the DI value of the original pairs
m
of data is larger than this threshold, then it indicates there is signiﬁcant information ﬂow
from X to Y.
34

2.2.7

Performance of modiﬁed directed information

In this section, we compare the performances of three diﬀerent approaches, computing DI
using the original deﬁnition, proposed time-lagged DI and modiﬁed DI. For the proposed
TLDI and MDI, the order of the model p is determined by the Cao criteria. TLDI is
computed over diﬀerent time lags. The comparison is based on three diﬀerent simulation
models. Without loss of generality, we repeat each simulation 100 times to quantify the
mean and variance of diﬀerent computation approaches. In addition, for the linear models,
we obtain the bias by comparing the means of diﬀerent approaches with the theoretical DI
value.
First, we test the performance of the proposed time-lagged DI for a bivariate linear
autoregressive model given as follows:
Xi = u i ;

(2.34)

Yi = 0.5 × Xi−2 + vi ;
where ui and vi are white Gaussian noise samples following N (0, 1) and the order of the
model is 2. We generate 2048 realizations for each time series, and compute the DI value
using three diﬀerent approaches over a N = 8 block of time samples, respectively. The order
of the model p is equal to 2. TLDI is computed over diﬀerent time lags d = 0, · · · , 3. From
Figure. 2.1, we can observe that when d = 2, TLDI2 reaches its maximum value, which is
aligned with the order of the model. However, when d ̸= 2, the TLDId will lose most of the
causal dependencies. Therefore, computing DI over two samples with d = 0, as is done in
previous work [2], is not suﬃcient for high order models. In addition, from Table 2.1, we
can observe that TLDI2 and MDI have lower bias and are more computationally eﬃcient
(less computation time) compared to the original deﬁnition of DI. MDI is larger than the
theoretical DI as anticipated by the theoretical bound in Section 2.2. DI using the original
deﬁnition has the largest bias, which is due to the fact that the bias of each term in equation
(2.4) increases with the length of the signal for limited number of realizations.

35

0.7
0.6

DI value

0.5
0.4
0.3
0.2
0.1
0
Theoretical DI
DI

TLDI0 TLDI1 TLDI2 TLDI3

MDI

Figure 2.1: Average information ﬂow over 100 simulations for single order linear model
computed using the original deﬁnition of DI, TLDI and MDI. For interpretation of the
references to color in this and all other ﬁgures, the reader is referred to the electronic version
of this dissertation.

Table 2.1: Performance comparison for single order linear model
Bias

Variance (×10−3 )

Computation time (s)

DI

0.0095

0.6032

0.0511

TLDI2

0.0032

0.6356

0.0233

MDI

0.0036

0.5937

0.0235

Next we test the performance of DI computation over diﬀerent time windows on a multiple
order bivariate linear autoregressive model as follows:
Xi = 0.5 × Xi−1 + 0.4 × Xi−3 + ui ;

(2.35)

Yi = 0.6 × Xi−2 + 0.3 × Xi−4 + 0.5 × Yi−3 + vi ;
where ui and vi are white Gaussian noise samples following N (0, 1) and the maximum time
delay between X and Y is 4. We generate 2048 realizations for each time series and compute
DI using diﬀerent measures over a length of time samples N = 12. The causal information
ﬂow from X to Y at time sample i (I(X i ; Yi |Y i−1 )) is plotted in Figure. 2.2(a). We can

36

observe that, beginning at time sample 4 when X starts to inﬂuence Y, the MDI at each
time point is larger than DI, which is aligned with the inequality in equation (2.25). The DI,
TLDI with diﬀerent time delays d = 0, · · · , 4, and MDI (p = 4), averaged over simulations,
are shown in Figure. 2.2(b). We can observe that MDI is slightly larger than the theoretical
value of DI , because MDI is the upper bound of directed information. Moreover, we apply
the sign test to test whether there is a statistically signiﬁcant diﬀerence between MDI and
the actual DI, and the hypothesis that there is no diﬀerence between the two measures can
not be rejected at the 5% signiﬁcance level. Therefore, MDI can be used to replace DI to
reduce the computational complexity. From Table 2.2, we can observe that MDI has lower
bias and variance than TLDI, and is more computationally eﬃcient than using the original
deﬁnition of DI with the computation time being cut in half. Moreover, for the TLDI, it is
hard to choose a proper d when the model is multi-order.

0.35
0.3

Theoretical DI
Modified DI

2

1.5

0.15

1

0.1

0.05
0
0 2

2.5
DI value

I(Xi,Yi|Y i-1 )

0.25
0.2

3

0.5
4 6 8 10 12
Time points
(a)

0
Theoretical DI MDI TLDI0 TLDI1 TLDI2 TLDI3 TLDI4
DI
(b)

Figure 2.2: Modiﬁed DI for bivariate linear autoregressive model with multiple time lags.
(a) The increase in the amount of information ﬂow (I(X i ; Yi |Y i−1 )) at time point i. (b)
Average total information ﬂow over 100 simulations using the original deﬁnition of DI, TLDI
and MDI.

Finally, we evaluate the performance of MDI for a nonlinear autoregressive model,
Xi = 0.3 × Xi−1 + ui ;
√
0.2
Yi = 0.8 × Xi−1 +
+ vi ;
0.1 − 4 × eXi−2
37

(2.36)

Table 2.2: Performance comparison for multi-order linear model
Bias

Variance

Computation time (s)

DI

0.0216

0.0014

0.1037

TLDI2

0.3751

0.0029

0.0333

TLDI4

−0.2148

0.0022

0.0360

MDI

0.0699

0.0015

0.0458

where ui and vi are white Gaussian noise samples following N (0, 1) and the maximum time
delay between X and Y is 2. We generate 8192 realizations for each time series and compute
DI and MDI (p = 2) over a length of time samples N = 6. The results averaged over
100 simulations are shown in Figure. 2.3. We can observe that the MDI is larger than
computing DI using the original deﬁnition. Although the DI value computed over the whole
time series does not necessarily reﬂect the actual information ﬂow, for a stationary model,
when i > 2, the information ﬂow in the system becomes stable, i.e., I(X i , Yi |Y i−1 ) should
not change much. However, using the original deﬁnition of DI, this value is not stable as
shown in Figure. 2.3(a). This is due to the MI estimator used for computing DI, which
has bias and variance that increase with the number of joint variables for a limited sample
size [61]. On the other hand, the dimension and the number of joint pdfs are ﬁxed for
MDI as seen in equation (2.26), which also leads to the lower variance of MDI in Table 2.3.
In addition, the computation time of MDI is only one fourth of the original DI, which is
important for detecting the nonlinear causality of a complex network with large number of
nodes. Therefore, the MDI is preferred over DI computed over the whole time series because
of its reduced computational complexity and stable performance.
Table 2.3: Performance comparison for nonlinear model
Mean

Variance

Computation time (s)

DI

0.6109

0.0069

95.3133

MDI

1.2342

0.0010

24.4649

38

0.35
I(X i,Y i|Yi-1)

0.3

DI
Modified DI

DI value

0.25
0.2

0.15
0.1
0.05
0 1.5
1

1.6
1.4
1.2
1
0.8
0.6
0.4
0.2

2 2.5 3 3.5 4 4.5 5 5.5 6
Time points
(a)

0

DI

MDI
(b)

Figure 2.3: Modiﬁed DI for nonlinear autoregressive model with multiple time lags. (a) The
increase in the amount of information ﬂow (I(X i ; Yi |Y i−1 )) at time point i. (b) Average
total information ﬂow over 100 simulations using the original deﬁnition of DI and MDI.

2.3

Application of DI to bivariate signal models

After we address the computation problem of DI, in this section, we test the validity and
evaluate the performance of DI for quantifying the eﬀective connectivity. We generate ﬁve
diﬀerent simulations. We use these simulation models to compare DI with classical Granger
causality (GC) for quantifying causality of both linear and nonlinear autoregressive models,
linear mixing models, single source models, and dynamic Lorenz systems. The Matlab
toolbox developed by Seth is used to compute the GC value in the time domain. GC is
also normalized to the [0, 1] range for comparison purposes [68]. The performance of GC
depends on the length of the signal, whereas the performance of DI relies on the number of
realizations of time series. Therefore, for each simulation, the length of the generated signal
for implementing GC is equal to the number of realizations for DI. The signiﬁcance of DI
values are evaluated by shuﬄing along the trials, while the signiﬁcance of GC values are
evaluated by shuﬄing along the time series.

39

2.3.1

Simulated signal models

Example 1: Multiple order bivariate linear autoregressive model
In this example, we evaluate the performance of DI on a general bivariate linear model,
X(n) =

Y (n) =

p4
∑
i=1
p3
∑
i=1

αi X(n − i) + σx ηx (n − 1),
βi Y (n − i) + γ

p2
∑

X(n − i) + σy ηy (n − 1).

(2.37)

(2.38)

i=p1

In this bivariate AR model with a delay p1 and order p2 − p1 + 1, γ controls the coupling
strength between the signals X and Y. The initial values of X and Y, and the noise ηx and
ηy are all generated from a Gaussian distribution with mean 0 and standard deviation 1. All
coeﬃcients (αi , βi , σx and σy ) are generated from Gaussian distributions with zero mean
and unit variance with unstable systems being discarded. To evaluate the performance of
directed information, we generate the bivariate model 4096 times with the same parameters
but diﬀerent initial values. γ is varied from 0.1 to 1 with a step size of 0.1, p1 = 1 and p2 =
p3 = p4 = 5, i.e. Y is inﬂuenced by X through multiple time lags. Without loss of generality,
we repeat the simulation 10 times, and average DI(X N → Y N ) and DI(Y N → X N ) over
10 simulations for diﬀerent γ values. For each simulation, the threshold is evaluated by
trial shuﬄing and the average threshold is obtained. For GC, the length of the generated
signal is chosen as 4096, which is the same as the number of realizations for DI. The GC
values in two directions and the corresponding thresholds at the 5% signiﬁcance level are
obtained. The DI value in two directions averaged across 10 simulations with diﬀerent γ
are shown in Figure 2.4(a). The performance of GC is shown in Fig 2.4(b). The estimated
order of the model is p = 5, which is in accordance with the simulation model. We observe
that DI(X N → Y N ) is signiﬁcant for all values of γ. On the contrary, DI(Y N → X N ) is
less than the threshold, which indicates the acceptance of the null hypothesis that there is
no signiﬁcant causal information ﬂow from Y to X. Since GC uses a linear autoregressive
framework for quantifying causality, in this example, GC detects the causality relationship

40

between X and Y successfully, i.e. the information ﬂow from X to Y is signiﬁcant for all γ
while it is insigniﬁcant for the opposite direction. It is also interesting to note that GC and
DI exhibit similar behavior across diﬀerent values of γ, indicating the equivalency of the two

1
0.9
0.8
0.7
0.6
0.5
0.4
X->Y
0.3
Y->X
Threshold: X->Y
0.2
Threshold: Y->X
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
J
(a)

Granger causality

Directed information

measures for linear Gaussian signal models.

1
0.9
0.8
0.7
0.6
0.5
0.4
X->Y
0.3
Y->X
0.2
Threshold: X->Y
Threshold: Y->X
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
J
(b)

Figure 2.4: Application of directed information and Granger causality to bivariate linear
autoregressive model. (a) Directed information with diﬀerent γ. (b) Granger causality with
diﬀerent γ.

Example 2: Multiple order bivariate nonlinear autoregressive model
In this example, we evaluate the performance of DI on a general bivariate nonlinear
model,
X(n) =

Y (n) =

p4
∑
i=1
p3
∑
i=1

αi X(n − i) + σx ηx (n − 1),
βi Y (n − i) + γ

p2
∑
i=p1

(2.39)

1
+ σy ηy (n − 1). (2.40)
1 + exp(b1 + b2 X(n − i))

For this bivariate nonlinear AR model, the setting for the coupling strength γ and the
generation of X, Y, ηx , ηy , αi , βi , σx , σy , p1 , p2 , p3 and p4 are the same as in Example
1. Y and X interact nonlinearly through the sigmoid function. Parameters of this function
b1 and b2 control the threshold level and slope of the sigmoidal curve, respectively. We set
b1 = 0 and b2 = 50. DI value and its threshold are averaged over 10 simulations for diﬀerent
41

γ. The GC values in two directions and the corresponding thresholds at 5% signiﬁcance
level are obtained. The performance of DI and GC for the nonlinear autoregressive model
in equations (2.39) and (2.40) averaged across 10 simulations with diﬀerent γ are evaluated
as shown in Figure 2.5. The estimated order of the model is 5. We observe that when
γ is less than 0.3, the coupling strength between X and Y is weak and the DI value in
both directions is not signiﬁcant. As γ increases, DI(X N → Y N ) increases and becomes
signiﬁcant. DI(Y N → X N ) decreases with increasing γ and is still less than the threshold as
expected. The results indicate increased unidirectional information ﬂow from X to Y with
increasing γ and show that detecting the information ﬂow in nonlinear processes is more
diﬃcult especially when the coupling strength is low. GC fails to detect the information ﬂow
from X to Y for all γ. Since GC is implemented in a linear framework, the estimated order
and the model itself do not match with the nonlinearity of the signal. Therefore, it cannot

1
0.9
0.8
0.7
0.6
X->Y
Y->X
0.5
Threshold: X->Y
0.4
Threshold: Y->X
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
J
(a)

Granger causality

Directed information

detect nonlinear causality.

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
X->Y
0.2
Y->X
Threshold: X->Y
0.1
Threshold: Y->X
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
J
(b)

Figure 2.5: Application of directed information and Granger causality to bivariate nonlinear
autoregressive model. (a) Directed information with diﬀerent γ. (b) Granger causality with
diﬀerent γ.

Example 3: Linear Mixing model
In this example, we test the eﬀectiveness of DI in inferring eﬀective connectivity when
there is linear mixing between two signals. Linear instantaneous mixing is known to exist in

42

human noninvasive electrophysiological measurements such as EEG or MEG. Instantaneous
mixing from coupled signals onto sensor signals by the measurement process degrades signal
asymmetry [57]. Therefore, it is hard to detect the causality between the two signals. For
unidirectional coupled signal pairs X → Y described in equations (2.37) to (2.40), we create
two linear mixtures Xϵ and Yϵ as follows,
Xϵ (n) = (1 − ϵ)X(n) + ϵY (n),

(2.41)

Yϵ (n) = ϵX(n) + (1 − ϵ)Y (n),

(2.42)

where ϵ controls the amount of linear mixing and is varied from 0.05 to 0.45 with a step size
of 0.05, and γ is ﬁxed to 0.8 for both models. When ϵ = 0.5, the two signals are identical.
Both DI and GC are used to quantify the information ﬂow between Xϵ and Yϵ in the two
directions. The DI value and GC value averaged across 10 simulations with changing linear
mixing coeﬃcient ϵ for both linear and nonlinear AR models are shown in Figure 2.6. The
estimated order of the model is 5 as before. When ϵ = 0.5, the two observed mixing signals
are identical and we expect to see no signiﬁcant information ﬂow in the two directions. We
observe that for the linear AR model, directed information detects the causality between Xϵ
and Yϵ when ϵ is smaller than 0.4. When ϵ is larger than 0.4, the causality between Xϵ
and Yϵ is hard to detect because of the strong mixing, i.e., Xϵ and Yϵ are almost identical,
and the information ﬂow in both directions becomes insigniﬁcant. Compared to DI, GC
only detects the causality from Xϵ to Yϵ when the mixing is weak (ϵ < 0.2), indicating that
GC is more vulnerable to linear mixing. It is probably due to the fact that GC is sensitive
to the mixture of signals and the assumed signal model does not match with the original
signal [69]. For the nonlinear AR model, DI fails to detect causality when ϵ is larger than
0.1, which indicates that linear mixing of nonlinear source models makes it harder to detect
eﬀective connectivity compared to mixing of linear source models. On the other hand, GC
fails to detect any causality even when ϵ = 0, since it cannot detect nonlinear interactions.
Example 4: Single source model

43

Granger causality

1
X->Y
0.9
Y->X
Threshold: X->Y
0.8
Threshold: Y->X
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.4 0.45
H
(c)

Granger causality

Directed information
Directed information

1
0.9
0.8
0.7
0.6
X->Y
0.5
Y->X
Threshold: X->Y
0.4
Threshold: Y->X
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.4 0.45
H
(a)

1
0.9
0.8
0.7
0.6
X->Y
Y->X
0.5
Threshold: X->Y
Threshold: Y->X
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.30.350.4 0.45
H
(b)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
X->Y
0.2
Y->X
Threshold: X->Y
0.1
Threshold: Y->X
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.4 0.45
H
(d)

Figure 2.6: Application of directed information and Granger causality to linear mixing for
both linear and nonlinear autoregressive models. (a) Directed information with diﬀerent ϵ
for the linear mixing of linear AR model. (b) Granger causality with diﬀerent ϵ for the linear
mixing of linear AR model. (c) Directed information with diﬀerent ϵ for the linear mixing of
nonlinear AR model. (d) Granger causality with diﬀerent ϵ for the linear mixing of nonlinear
AR model.

A single source is usually observed on diﬀerent signals (channels) with individual channel
noises [57], which is common in EEG signals due to the eﬀects of volume conduction. In this
case, false positive detection of eﬀective connectivity occurs for methods such as Granger
causality [69], which means GC has low speciﬁcity. We generate two signals Xϵ and Yϵ as
follows to test the speciﬁcity of DI when there is no signiﬁcant information ﬂow from one

44

signal to the other signal.
S(n) =

p4
∑

αi S(n − i) + ηS (n);

(2.43)

i=1

Xϵ (n) = S(n);

(2.44)

Yϵ (n) = (1 − ϵ)S(n) + ϵηY (n);

(2.45)

where S(n) is the common source generated by an autoregressive model, order p4 = 5, αi
and ηS (n) are generated from a Gaussian distribution with mean 0 and standard deviation
1. S(n) is measured on both sensors Xϵ and Yϵ . Yϵ is further corrupted by independent
Gaussian noise ηY (n) with 0 mean and unit variance. ϵ controls the signal to noise ratio
(SNR) in Yϵ and is varied from 0.1 to 0.9 with a step size of 0.1, corresponding to SNR in
the range of −19 ∼ 19 dB. The DI value and GC value averaged across 100 simulations for
changing ϵ for a single source model are shown in Figure 2.7. The estimated order of the
model is 5. In addition, the false positive rate using both DI and Granger causality with
increasing ϵ is also calculated. We observe that the information ﬂow in two directions using
DI are less than the threshold for all values of ϵ, which indicates the acceptance of the null
hypothesis that there is no signiﬁcant causal information ﬂow from Xϵ to Yϵ or Yϵ to Xϵ .
Note that DI is normalized by the mutual information. For a common source model, the
instantaneous information exchange between Xϵ and Yϵ contributes mostly to the mutual
N
information between Xϵ and Yϵ . Thus according to equation (2.32), DI(DXϵ → YϵN )
N
and DI(DYϵN → Xϵ ) normalized by mutual information are close to 0 and less than the

threshold from the randomized data pairs. The false positive rate of DI is 0 for all ϵ.
Therefore, DI is able to discriminate between instantaneous mixing from actual causality
and is very robust to noise. For GC, when ϵ is small (< 0.2) or large (> 0.9), the value of
GC is less than or very close to the threshold in both directions thus indicating that there
is no causal information ﬂow between the two processes. However, GC fails to accept the
null hypothesis when ϵ is between 0.3 to 0.9 and detects a non-existing eﬀective connectivity.
GC reaches its maximum value when ϵ = 0.5. This is due to the fact that GC is close to
45

0 when two processes Xϵ and Yϵ are independent or identical, i.e. when ϵ = 1 and ϵ = 0.
Based on the deﬁnition of GC, the prediction of Yϵ at the current time point will not be
improved by taking into account the past samples of Xϵ for these processes [58]. Therefore,
as ϵ increases from 0 to 0.5, Xϵ becomes the most diﬀerent from Yϵ , therefore it can provide
more new information about Yϵ and the GC increases. As ϵ increases from 0.5 to 1, Xϵ
becomes independent of Yϵ , and the GC decreases. The false positive rate of GC is not
equal to 0 for all values of ϵ, which indicates that it has lower speciﬁcity compared to DI.
Therefore, GC is not robust to the eﬀect of a common source and may infer false positive
eﬀective connectivity. This simulation indicates that DI is more sensitive and discriminative
about the information ﬂow patterns in the presence of volume conduction, which means it
is a more promising method to capture the eﬀective connectivity for real EEG data.
Example 5: Nonlinear dynamic system
In this example, we illustrate the applicability of DI to coupled Lorenz oscillators with
a certain delay. The Lorenz oscillator is a three-dimensional dynamic system that exhibits
chaotic behavior. Synchronization of two Lorenz systems has been widely investigated for
the analysis of EEG data, because the dynamic interactions related to the behavior of the
cortex can be exempliﬁed by these coupled systems [70]. In the following, we examined two
asymmetric coupled Lorenz oscillators (X1 , Y1 , Z1 ) and (X2 , Y2 , Z2 ) as follows [71],
˙
X1 (t) = −A(X1 (t) − Y1 (t)),

(2.46)

Y˙1 (t) = RX1 (t) − Y1 (t) − X1 (t)Z1 (t),

(2.47)

˙
Z1 (t) = X1 (t)Y1 (t) − BZ1 (t),

(2.48)

˙
X2 (t) = −A(X2 (t) − Y2 (t)) + βX1 (t − tp ),

(2.49)

Y˙2 (t) = RX2 (t) − Y2 (t) − X2 (t)Z2 (t),

(2.50)

˙
Z2 (t) = X2 (t)Y2 (t) − BZ2 (t),

(2.51)

where each equation is a ﬁrst-order diﬀerential equation. A = 10, R = 28, B = 8 , and
3
tp = 0.02 represents the time delay between two coupled components of these two oscillators,
46

0.6

Granger causality

0.8

X->Y
Y->X
Threshold: X->Y
Threshold: Y->X

0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
H
(a)

1
0.8

1
0.8
0.6
0.4

X->Y
Y->X
Threshold: X->Y
Threshold: Y->X

0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
H
(b)

False positive rate (GC)

Directed information
False positive rate (DI)

1

X->Y
Y->X

0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
H
(c)

1
0.8
0.6
0.4
0.2
X->Y
Y->X
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
H
(d)

Figure 2.7: Application of directed information and Granger causality to single source model.
(a) Directed information with diﬀerent ϵ for the single source model. (b) Granger causality
with diﬀerent ϵ for the single source model. (c) False positive rate for directed information
with diﬀerent ϵ for the single source model. (d) False positive rate for Granger causality
with diﬀerent ϵ for the single source model.

i.e. X1 and X2 . β corresponds to the coupling strength and is varied from 0.1 to 1 with
a step size of 0.2. The diﬀerential equations are numerically integrated with a time step of
0.01 using Euler’s method [72], corresponding to a delay of 2 time samples between X1 and
X2 . The initial conditions of these six components are randomly generated from a Gaussian
distribution with zero mean and unit variance. We generate 100 samples and the ﬁrst 90
samples are discarded to eliminate the initial transients. We compute the information ﬂow
in two directions over 10 time points and the signiﬁcance of the obtained DI value is veriﬁed
by trial shuﬄing. The DI values and GC values between X1 and X2 of two asymmetric
coupled Lorenz systems are computed with coupling strength β being set from 0.1 to 1. The

47

estimated order of the model is 3. Though this is larger than the actual model order, our
method will not lose any information except for the increased computational complexity.
The results are shown in Figure 2.8. The results show that DI values from X1 to X2
increase with the coupling strength β and are signiﬁcant for all values of β. In addition,
there is no signiﬁcant causal information ﬂow from X2 to X1 . Therefore, DI can eﬀectively
detect the causality in a nonlinear dynamic system. On the contrary, GC can not detect any
signiﬁcant information ﬂow for all β values. It is due to the fact that the model selected for

1
X->Y
0.9
Y->X
Threshold: X->Y
0.8
Threshold: Y->X
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
E
(a)

Granger causality

Directed information

implementing GC is not consistent with the dynamic characteristics of the system.

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
X->Y
0.2
Y->X
Threshold: X->Y
0.1
Threshold: Y->X
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
E
(b)

Figure 2.8: Application of directed information and Granger causality to two asymmetric
coupled Lorenz oscillators. (a) Directed information with diﬀerent β. (b) Granger causality
with diﬀerent β.

2.3.2

Biological data

In this subsection, we examine EEG data from ten undergraduates at Michigan State University drawn from an ongoing study of relationships between the error-related negativity
(ERN) and individual diﬀerences1 such as worry and anxiety. ERN is a brain potential
response that occurs following performance errors in a speeded reaction time task [48]. All
EEG data are collected as described in Chapter 1. Once the data are obtained, for each
1 Participants for the present analysis were drawn from samples reported on in [46, 47].

48

subject, the EEG data are preprocessed by the spherical spline current source density (CSD) waveforms to sharpen event-related potential (ERP) scalp topographies and eliminate
volume conduction [73]. In addition, a bandpass ﬁlter is used to obtain signals in the theta
band. In this study we focus on 33 electrodes corresponding to the frontal, central and
parietal regions of the brain. For each pair of 33 electrodes X and Y for each subject, the
eﬀective connectivity is quantiﬁed by computing the modiﬁed DI over 70 trials and a model
order of p in the theta band. The model order or the length of the time window p is determined by the Cao Criterion. We also apply Granger causality to the same data and compare
its performance with directed information.
Previous work indicates that there is increased information ﬂow associated with ERN for
the theta frequency band (4 − 8 Hz) and ERN time window 25 − 75 ms for Error responses
compared to correct responses in particular between mPFC and lPFC regions [74]. In addition, Cavanagh et al. have shown that there is increased synchronization for error trials
between electrode pairs, such as FCz-F5 and FCz-F6, compared to the synchrony between
FCz-CP3 and FCz-CP4 [75]. The DI and GC values for each pair of electrodes averaged
over 10 subjects are computed over a time window of 53 time points (100ms). The estimated
order of the model for each electrode pairs is 3. In order to control the error rates for multiple
hypothesis testing for all pairs of electrodes, the method proposed by Genovese et al. is used
in this dissertation [76]. To implement this procedure, for two electrodes with time series X
and Y, we ﬁrst shuﬄe the order of the trials of X 100 times to generate new observations
X∗ , m = 1, · · · , 100. The P -value of DI(X → Y) is obtained by comparing it with DI
m
values from randomized pairs of data DI(X∗ → Y), m = 1, · · · , 100. We then obtain the
m
threshold Pr for all P-values (33 × 33 × 10) by controlling the FDR bound q as 0.05. For
DI(X → Y), if the P -value is less than Pr , then the directed information ﬂow from X to Y
is signiﬁcant; otherwise, it is not signiﬁcant. Electrode pairs between which the information
ﬂow is signiﬁcant in at least one of the ten subjects are shown in Figure 2.9(b). We also test
the signiﬁcance of Granger causality in the same way. When the FDR is controlled at 0.05,

49

DI
F5 F3 F1

Fz

F2 F4 F6

0.6
0.55

FC5 FC3 FC1 FCz FC2 FC4 FC6
C5 C3 C1 Cz C2 C4 C6

0.5

CP5 CP3 CP1 CPz CP2 CP4 CP6

0.45

P3 P1 Pz

P2 P4

0.4
0.35

(a)

(b)
GC

F5 F3 F1
Fz F2

F4 F6

0.6
0.55

FC5 FC3 FC1 FCz FC2 FC4 FC6
C5 C3 C1 Cz C2 C4 C6

0.5

CP5 CP3 CP1 CPz CP2 CP4 CP6

0.45

P3 P1 Pz P2 P4

0.4
0.35

(c)

(d)

Figure 2.9: Application of directed information and Granger causality to EEG data. (a)
Pairwise directed information. (b) Electrode pairs with signiﬁcant DI values. (c) Pairwise
Granger causality. (d) Electrode pairs with signiﬁcant GC values. For (b) and (d), green
dots indicate the location of the particular node and white regions correspond to signiﬁcant
information ﬂow from that particular electrode to other electrodes. The name of each particular node in (b) and (d) is identical to the name in (a) and (c). The details of the signiﬁcant
electrode interactions are shown in Table 2.4.

the information ﬂow between electrode pairs is signiﬁcant if the P -value of DI or GC is less
than 0.01. Electrode pairs that have signiﬁcant causality relationship using both measures
are shown in Figure 2.9. In Figure 2.9(a) and Figure 2.9(c), each small circle shows the

50

Table 2.4: Electrode pairs in the region of interest with signiﬁcant DI values
From

To

From

To

F5

F1 FC2 CPz CP4 P3

C5

F6 FC5 Cz CP4

F3

FC3 CP4

C3

FC2 C6 P1

F1

C1 Cz Pz

C1

FC1 C6

FZ

F5

CZ

F5 C2 CP4

F2

FC3 FC6 C5 CP1

C2

FC6

F4

F6 C4

C4

P2

F6

F2 FC3 FCz Cz

C6

Pz

FC5

Fz C3 C2 CP6

CP5

Cz C4 CP3

FC3

CP1

CP3

C5 CPz P4

FC1

F4 FC3 C2 CP1 CP4

CP1

F6 FCz P3

FCZ

C3 CP1

CPz

FC6 C6 CP5 CP4 P1

FC2

F3 C1 C6 CP2 CP4 P3

CP2

F6 FCz FC4 CP1

FC4

C5

CP4

FC5 FCz C4

FC6

C5 C4 CP1

CP6

F5

P3

P4

P1

F2 C6 CP2

Pz

F5 F4 FCz

P2

FC4 C5

P4

F3 F4 FC3 FC2 FC4 Pz P2

directed information and Granger causality from a particular electrode to other electrodes.
In Figure 2.9(b) and Figure 2.9(d), each small circle shows electrode pairs that have signiﬁcant causality relationship. The details of the signiﬁcant electrode interactions are also
shown in Table 2.4. The results indicate that DI detects strong information ﬂow from the
frontal region (e.g. F5, F6) to the frontal-central region (e.g. FC2, FCz) corresponding to
the lateral prefrontal cortex (lPFC) and medial prefrontal cortex (mPFC). In addition, the
central-parietal region (e.g. CPz, CP1, CP2) around the midline, corresponding to the mo51

tor cortex, has strong inﬂuence on the central and frontal regions (e.g. FCz, F6) since this
is a speeded response task involving the motor cortex. These results are aligned with the
previous work in [75], which shows that error processing is controlled by the communication
between the lateral prefrontal cortex and medial prefrontal cortex. When GC is applied to
the same data, the information ﬂow pattern around the midline is similar to the DI. However,
the information ﬂow from the lateral prefrontal cortex to the rest of the brain is signiﬁcant.
On one hand, the similar patterns of connectivity using both measures verify the validity of
proposed DI computation algorithm. On the other hand, GC shows signiﬁcance over a wide
region of the brain especially in the lateral areas compared to DI, which may be due to GC’s
low speciﬁcity to volume conduction in the form of a common source. Previous work and our
simulation in Example 4 have indicated that Granger causality based measures may infer erroneous eﬀective connectivity in the case of the common source as seen in EEG data [69, 57].
However, without ground truth, we cannot conﬁrm that some links reported as signiﬁcant
by GC are spurious and due to volume conduction in a conclusive manner, but the results
from DI agree more with the suggestions in [75], that most of the increase in connectivity
during cognitive control, i.e. ERN, should be between medial prefrontal cortex and lateral
prefrontal cortex, compared to the results of GC. Therefore, DI is more sensitive and discriminative about the information ﬂow patterns compared to GC for real neurophysiological
data.

2.4

Estimation of directed information

2.4.1

Estimation based on entropy estimation

According to equation (2.1), for DI estimation, the entropy estimator should be applied
to both marginal and joint entropy estimation. In this subsection, we will review three
commonly used entropy estimators and comment on their applicability to DI estimation.
Plug-in estimator

52

The entropy H(f ) for a continuous probability density function f (x) is given by:
∫
H(f ) = −

f (x) log f (x)dx

(2.52)

The plug-in estimates of entropy are based on a consistent estimation fN of f and the
estimation of fN depends on N realizations (x1 , · · · , xN ) of X. Ordinary histogram is
the most commonly used density estimation method. First the minimum to maximum of
N realizations of X is divided into bins, where Bk = [tk , tk+1 ) denotes the k-th bin and
h = tk+1 − tk , then the approximation of the pdf is given by [77]:
ν
ˆ
f (x) = k for tk ≤ x < tk+1
Nh

(2.53)

where νk is the number of data points that fall in the k-th bin. The accuracy of this method
depends on both the proper bin size and sample size. When the samples are barely suﬃcient,
it has a large bias, which becomes worse as the dimensionality of the observed space increases.
An alternative to the histogram is Kernel density estimator (KDE) written as [77]:
1 ∑
x − xi
ˆ
f (x) =
K(
)
Nh
h
N

(2.54)

i=1

where K is the kernel function and h is the width of the kernel (the smoothing parameter).
Some common kernel types include Uniform, Epanechnikov, Gaussian, and Laplacian Kernels. The quality of KDE depends on the choice of the smoothing parameter and the choice
of kernel. Although KDE becomes time consuming for joint entropy estimation, it can still
give an accurate estimate of DI if proper parameters are chosen.
M-spacing estimator
M-spacing estimator is a nonparametric estimator which estimates the entropy directly
from i.i.d data samples without approximating the pdf. Consider samples of a scalar random
variable X = (x1 , · · · , xN ) rearranged in non-decreasing order x(1) ≤ · · · ≤ x(N ) . M spacing, is then deﬁned to be x(i+m) − x(i) , for 1 ≤ i < i + m ≤ N . If m is a function of N ,
one may deﬁne the mN -spacing as x(i+m ) − x(i) . The mN -spacing entropy estimator can
N
53

be deﬁned as [63]:
1
ˆ
HN (x1 , · · · , xN ) =
N

N −mN

∑

log(

i=1

N
(x
− x(i) )).
mN (i+mN )

(2.55)

The m-spacing estimators of entropy are based on the intuition that sums of small random
intervals have consistent behavior. In order to estimate the joint entropies of multiple random
variables, multi-dimensional spaces with constant expected probability mass are generated
by constructing a Voronoi or Delaunay region [63]. Then the following estimator is used:
ˆ
HHyper =

m
∑ C(U i )
i=1

N

log

N A(U i )
C(U i )

(2.56)

where C(U i ) is the number of (ﬁnite volume) Voronoi regions in a Hyper-Region U i , N =
∑
i
i
i
i C(U ), A(U ) is the d-dimensional volume of Voronoi region U . The calculation of
volumes needed for equation (2.56) is exponential in the dimension. According to equation (2.1), estimation of H(Xk−d Xk−d+1 Yk Yk+1 ) which is a 4-dimensional problem makes
the calculation of volumes complicated and slow in DI estimation.
Kozachenko and Leonenko (KL) estimator
Nearest neighbor (NN) distances based entropy estimators was introduced by Kozachenko
and Leonenko, also known as KL estimator [78]. For a random variable X with N observations, X = (x1 , · · · , xN ), the distance of each point (observation) xi to any other points xj ,
j = 1, · · · , N, j ̸= i, deﬁned as di,j = ∥xi − xj ∥, is found. Then the distance to the N − 1
neighbors of each point are ranked: di,j1 ≤ · · · ≤ di,jN −1 . H(X) can be estimated as the
average distance to the k-nearest neighbors, averaged over all xi . The KL entropy estimator
is deﬁned as:

d ∑
ˆ
H(X) = −ψ(k) + ψ(N ) + log(cd ) +
N log ϵ(i)
N

(2.57)

i=1

where ψ(x) is the digamma function, d is the dimension of X and cd is the volume of the
d-dimensional unit ball, ϵ(i) is twice the distance from xi to its kth neighbor. The algorithm
spends most of its time in spatial queries, which is unacceptable for DI estimation because
the dimension of the point xi can be larger than four.
54

As it can be seen from the above discussions, the nonparametric entropy estimators
are either complex or time consuming when applied to high dimensional data. Therefore,
methods with less complexity and high eﬃciency are needed for estimating DI.
Universal estimator
Recently, Zhao et al. proposed an universal algorithm to estimate directed information
for stationary ergodic processes by using sequential probability assignment and context tree
weighting [79]. For this estimator, DI is obtained by estimating H(Y) and H(Y||X) separately and DI(X → Y) = H(Y) − H(Y||X). Given the realizations of X and Y and the
X,Y

on (X, Y), the sequential probability assignments

X,Y

are used to calculate the estimate of H(Y) and

Y
universal source code Cn on Y and Cn

Y
QY and QX,Y induced by Cn and Cn

H(Y||X). Zhao et al. employed the context tree weighting as the universal source coding
scheme and the universal probability assignments can be constructed from a universal coding scheme [79]. This algorithm requires both the realizations of the signal and an universal
source code scheme with low complexity and fast convergence rates. In addition, the original
context tree weighting is for binary sequences, and has to be extended for discrete signals
with continuous values and multiple realizations.

2.4.2

Estimation based on mutual information and multi-information

DI can also be expressed in terms of mutual information and multi-information, which require the estimation of the common information between two length N random vectors
(I(X N ; Y N )) or among multiple one-dimensional random variables (I(X 1 , · · · , X N , Y 1 , · · · ,
Y N )). Estimators based on adaptive partitioning, which do not require parameter selection,
region construction and N N -search, are eﬃcient for high dimensional data and are used in
both methods for estimating DI.
Mutual information estimation
According to equation (2.2), directed information can be written in terms of mutual information. The most straightforward approach for estimating MI is partitioning the supports

55

of X and Y into bins of ﬁnite size, and approximating equation (1.1) by the ﬁnite sum:
I(X; Y ) ≈ Ibinned (X; Y ) ≡

∑
i,j

p(i, j) log

p(i, j)
px (i)py (j)

(2.58)

Ibinned (X; Y ) is obtained by counting the number of samples falling into the various bins,
which is similar to the histogram. If we let N → ∞ then all bin sizes tend to zero, and the
right hand side of Equation (2.58) will converge to I(X; Y ). If the distributions are fractal,
this convergence might no longer be true.
Darbellay presented a data-dependent nonparametric estimator of the mutual information
based on an adaptive partitioning of the observed space [61]. The basic concept of the method
is to build a succession of ﬁner partitions of the high-dimensional observed space and stop the
reﬁnement process on any hyperrectangle when the local independence has been achieved.
Multi-information estimator
In equation (2.3), we introduce an alternative representation of DI in terms of multiinformation, which can be estimated by extending the adaptive data-dependent partitioning
method to multiple random variables [42]. The procedure for multi-information estimation
is as follows:
• X1 , · · · , Xd are d one-dimensional random variables with N observations (xi (1), · · · ,
xi (j), · · · , xi (N )), i = 1, · · · , d, j = 1, · · · , N . First, N observations of each random
variable are rank ordered separately. zi (j) is the rank of xi (j) with respect to the
other N − 1 samples from the same random variable, zi (j) ∈ {1, · · · , N }. Then the
estimation of multi-information among X1 , · · · , Xd has turned into the estimation of
multi-information among Z1 , · · · , Zd , which is the ranked sample space of X1 , · · · , Xd .
• In the (Z1 , · · · , Zd ) space, a dyadic partitioning of the space is iteratively done until the
sample distribution of each hypercube is conditionally independent. Given N samples
of Z = (Z1 , · · · , Zd ) ∈ Zd , let Zd be the initial one-cell partition. Then every cell is
to be partitioned by marginal equiquantization and not partitioned further unless it
achieves conditional independence. For example, for a cell C, each edge (margins) of
56

the d-dimensional cell is divided into α (α = 2 in general) intervals with approximately
the same number of the points in each marginal subintervals to obtain αd subcells Ck ,
k = 1, · · · , αd . If α = 2, the partition point is the midpoint of each edge in the Z
space. The lower (L) and upper (U ) bounds of each dimension of the subcell Ck are
k
k
zi (L) and zi (U ), i is the ith dimension of Ck , i = 1, · · · , d. The total number of points

in cell Ck and the marginal number of points in each dimension are:
NZ (Ck ) = number of points z such that
k
k
zi (L) < zi < zi (U ) for all i = 1, · · · , d
k
k
NZ1 (Ck ) = number of points z such that z1 (L) < z1 < z1 (U )
.
.
.
k
k
NZ (Ck ) = number of points z such that zd (L) < zd < zd (U )
d

The χ2 goodness-of-ﬁt test at the 3% signiﬁcance level is applied to all subcells to test
the local independence of cell C, that is,
d

α
∑ (NC − NC /αd )2
k

NC /αd

k=1

≤ χ2 (αd − 1),
0.97

(2.59)

where NC is the number of points in cell C, and NC is the number of points in
k
subcell Ck , k = 1, · · · , αd , respectively. If the condition in equation (2.59) is fulﬁlled,
the hypothesis of conditional independence is accepted and the cell C is not subjected
to further partitioning.
• Once the conditional independence has been achieved, it can be shown that:
d

ˆ
I(Z1 , . . . , Zd ) =

α
∑

PZ1 ,··· ,Z (Ck ) log
d

k=1

=

1
N

αd
∑
k=1

NZ (Ck ) log

PZ1 ,··· ,Z (Ck )
d

PZ1 (Ck ), · · · PZ (Ck )
d

(2.60)

NZ (Ck )
+ (d − 1) log N,
NZ1 (Ck ) · · · NZ (Ck )
d

where PZ1 ,··· ,Z (Ck ) is the probability that the N d-dimensional points (z1 , · · · , zd )
d
falls into each hypercube Ck , PZ1 ,··· ,Z (Ck ) = NZ (Ck )/N , and PZ1 (Ck ), · · · , PZ (Ck )
d
d
are the corresponding marginal probabilities, PZi (Ck ) = NZi (Ck )/N .
57

Though the mutual information and multi-information oﬀer two diﬀerent ways to express
DI, the procedure for the two estimation methods are very similar. However, according to
equation (2.3), DI estimation based on multi-information has one more term I(Y N ) to
estimate, which makes it computationally slightly more expensive.

2.4.3

Performance comparison of estimators

In this section, we compare diﬀerent directed information estimation methods using simulated data to verify their eﬀectiveness. In order to reduce the computational complexity for
each simulation, DI for every successive two time points are computed.
In the ﬁrst example, we consider a ﬁrst order linear autoregressive (AR) model deﬁned
as:
Xi = 0.5 × Xi−1 + ui ;

(2.61)

Yi = 0.2 × Yi−1 + 0.8 × Xi−1 + vi ,
where ui and vi are white Gaussian random processes with standard deviation of 0.3 and X
and Y have zero initial conditions. We generate 128 realizations of X and Y and compute
the DI over 20 time samples. In order to evaluate the bias and variance of each estimator,
the same model is replicated 100 times. From equation (2.61), it is obvious that X is the
driver of Y, so we expect the directed information from X to Y to be greater than from Y
to X. First, the averaged DI for each estimator at two successive time points is shown in
Figure. 2.10.
Second, in order to compare the performance of diﬀerent methods and verify the advantages of information based estimators, we take the average DI value along the whole time
sequence and evaluate the bias, variance, computational eﬃciency and discrimination power
of each estimator. The bias of each method can be obtained by comparing with using covariance matrices, which has the closest result to the actual DI value since X and Y are normally
distributed in this simulation. The computational eﬃciency quantiﬁed by the average run
time of each method and the discrimination power (DP) quantiﬁed by the diﬀerence of mean
58

1
Covariance Matrix
Histogram
Kernel
M-spacing
Mutual Information
Multi-information

0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0

2

4

6

8

10 12 14 16

18

20

Figure 2.10: Averaged information ﬂow for linear model using diﬀerent estimators(128 realizations)

between the DIs in the two directions normalized by the standard errors deﬁned as:
DI(X → Y) − DI(Y → X)
,
DP = √
S2
+S 2
DI(X→Y) DI(Y→X)
N

where DI(X → Y) = N 1
−1

(2.62)

∑N −1

k=1 DIk (X → Y) and SDI(X→Y) is the standard deviation

of the DI values along the time sequence, DIk (X → Y) is the directed information value
from X to Y at a small time window k ∼ k + 1. In this simulation we use Gaussian kernel,
whose width is chosen by likelihood cross-validation (LCV) for KDE. As it can be seen
from Figure. 2.10 and Table 2.5, the histogram and m-spacing estimator have higher bias
compared to the other methods. The histogram is very dependent on the number of bins, i.e.,
if we use a smaller bin size such as 5, it will have a lower bias. Kernel method has a slightly
upward bias while MI and multi-information based estimation methods have downward bias,
but the absolute diﬀerence is nearly the same. However, the latter ones are faster than the
Kernel method.
We also increase the number of trials to show the dependency of each estimator on the
number of data samples. The result for 1024 realizations is shown in Table 2.6. When

59

Table 2.5: Performance Comparison for linear model (128 realizations)
Mean

Variance(×10−3 )

Covariance

0.7058

0.1264

0.0155

15.0449

Histogram

0.9856

0.0007

6.4049

2.7736

Kernel

0.7724

0.1607

8.6807

7.9381

M-spacing

0.9790

0.0006

55.6460

11.4727

MI

0.6185

0.4705

2.7546

5.3127

Multi-information

0.6382

0.7489

2.7355

4.8684

Time(s)

DP

the number of trials increases, the bias of all of the estimators decrease. Moreover, the MI
and multi-information based methods outperform Kernel methods in bias and get very close
to the performance of the Covariance Matrix method without any prior knowledge of the
distribution of the data.
Table 2.6: Performance Comparison for linear model (1024 realizations)
Mean

Variance(×10−3 )

Covariance

0.7009

0.0000

0.0113

19.9898

Histogram

0.9033

0.0000

11.6406

12.4591

Kernel

0.7622

0.0000

195.1709

22.6683

M-spacing

0.9809

0.0000

613.2450

27.2449

MI

0.7035

0.0000

8.0961

12.5346

Multi-information

0.7096

0.1000

7.6563

11.3131

Time(s)

DP

In the second example, we consider a ﬁrst order nonlinear autoregressive model given
below and compare the proposed method with the regular histogram estimation for every
two pairs of X and Y.
Xi = 0.5 × Xi−1 + ui ;
2
Yi = 0.2 × Yi−1 + 0.8 × Xi−1 + vi ,

60

(2.63)

where ui and vi are distributed as in the ﬁrst example. The performance of each estimator for
128 and 1024 realizations of X and Y are shown in Tables 2.7 and 2.8. We can observe that
the MI and multi-information based methods run much faster than the others. We should
also note that, the multi-information based method has much stronger discrimination power
than MI estimator, though the discrimination power is very low when the number of trials is
very small, which makes it hard to infer the causality between X and Y accurately. However,
the discrimination power becomes stronger when the number of trials increases to 1024.
Table 2.7: Performance Comparison for nonlinear model (128 realizations)
Mean

Variance(×10−3 )

Histogram

0.9842

0.0018

2.5665

1.3349

Kernel

0.5503

0.6849

3.5845

2.9192

M-spacing

0.9631

0.0012

30.4351

4.1032

MI

0.3126

1.2990

0.7148

0.0802

Multi-information

0.3184

1.7980

0.6970

0.6136

Time(s)

DP

Table 2.8: Performance Comparison for nonlinear model (1024 realizations)
Mean

Time(s)

DP

Histogram

0.8633

0.0640

11.1156

3.0179

Kernel

0.5368

0.0809

191.9125

11.2326

M-spacing

0.9676

0.0001

602.7246

13.7146

MI

0.3507

0.2887

5.5128

2.1319

Multi-information

2.5

Variance(×10−3 )

0.3494

0.5902

5.2984

4.0577

Conclusions

In this chapter, we presented the time-lagged directed information and modiﬁed directed
information to reduce the computational complexity of computing DI while still quantifying
61

the causal dependencies. These simpliﬁed measures are derived for stationary statistical
processes with limited order and it is proven that the rate of modiﬁed DI is equal to the
transfer entropy rate. The simulation results presented above indicate that the MDI measure
is more suitable for the approximation of DI when the order of the model is unknown or
when there are multiple time lags compared to TLDI. Even though the MDI is shown to be
an upper bound for the actual DI, it achieves a better performance compared to TLDI in
terms of bias and is comparable in terms of the computational complexity. Moreover, we
also introduce a new directed information estimation method based on multi-information
and provide a quantitative comparison of various DI estimation methods. Considering various factors including bias, variance, computational speed and discrimination power, the MI
and multi-information based DI estimation methods have similar performance and are better
than the others. Moreover, the multi-information based estimator outperforms MI estimator
in discriminating nonlinear causal relationships. Finally, in order to illustrate the advantages of DI, we applied directed information measure to identify the causality relationships
for both linear and nonlinear AR models, linear mixing models, single source models and
Lorenz systems, and compare its performance with Granger causality. Directed information
is shown to be more eﬀective in detecting the causality of diﬀerent systems compared to
Granger causality. We also applied the directed information measure on EEG data from a
study containing the error-related negativity to infer the information ﬂow patterns between
diﬀerent regions. The results showed that the directed information measure can capture
the eﬀective connectivity in the brain between the mPFC and lPFC areas as predicted by
previous work.
There are still remaining issues with the implementation of directed information. First,
the performance of directed information relies on accurate estimation from limited sample
sizes that introduces bias to the estimated values. This problem can be addressed by either using parametric density models or improving existing mutual information and entropy
estimators. Recently, Zhao et al. proposed an universal algorithm to estimate directed in-

62

formation for stationary ergodic processes by using sequential probability assignment, which
may be used to improve the eﬀective connectivity results discussed in this dissertation [79].
Current applications of this algorithm are for binary sequences, therefore, it has to be extended for discrete signals with continuous values and multiple realizations. Second, the
performance of directed information relies on the selection of the model order. If the order
of the model is too small, it will lose the information from X to Y. If it is too large, the
computational complexity is very high. In addition to classical embedding dimension determination methods such as the Cao criterion used in this dissertation, Faes et al. proposed a
sequential procedure to determine the embedding dimension of multivariate series [80]. This
method is based on an information-theoretic technique and shows promising performances
for various signal models, which may be extended to DI computation in the future. Third,
directed information does not discriminate between direct and indirect interactions among
multivariate time series. However, this is not a shortcoming of DI since DI does not assume
any particular signal interaction model: bivariate or multivariate. Similar to other information theoretic measures, such as mutual information, whether the particular measure can
identify interactions between multiple processes depends on how the measure is applied. For
example, in the case of mutual information, though the original deﬁnition is for two random
processes X and Y, it is possible to extend it to multiple processes [60]. Similarly, we can
apply DI over multiple processes using conditional directed information such as the deﬁnition
given by Kramer. We address this issue in the next chapter by using conditional directed
information and develop algorithms to infer the actual network. Similarly, GC originally
is deﬁned for two time series that a stochastic process X causing another process Y if the
prediction of Y at the current time point, Yn , is improved when taking into account the past
samples of X. However, in application it has been extended to multiple processes through
the use of multivariate AR models, such as PDC. We also compare the performance of our
algorithm based on conditional directed information with PDC in the next chapter.

63

Chapter 3
DIRECTED NETWORK INFERENCE BASED ON DIRECTED
INFORMATION

3.1

Introduction

In many complex systems, such as the brain network, another interesting problem is to reveal the actual causal structure of the network, i.e. eﬀective network inference, rather than
quantifying pairwise causal relationships which is not suﬃcient to discriminate direct interactions from indirect interactions. Eﬀective network inference algorithms can be categorized
into three groups: pair-wise algorithms, equation-based algorithms and network-based algorithms [81], which are based on diﬀerent measures to quantify the causality. The pair-wise
algorithms try to ﬁnd pairs of variables that are correlated and inﬂuence the behavior of
each other. Cross-correlation and its extension in the frequency plane, i.e. coherence, are
the most traditional measures to capture the causal relationships in neural networks, brain
networks and so on [82, 83]. However, these approaches have two drawbacks: (1) they assume the linearity of the relationship between the variables, which is not always true, e.g.
EEG signals are known to have nonlinear dependencies [17]; (2) they quantify the relationship between two variables without considering the eﬀect of the third variable, which may
generate false positive connections in a network. Equation-based algorithms use a model,
such as multivariate autoregressive model (MVAR) and dynamic causal model, to relate
the values of variables [44, 84]. For example, partial directed coherence (PDC) and direct
transfer function (DTF), derived from MVAR, are widely used to determine the neural networks from multivariate recordings [44, 85]. However, these equation-based methods require
a priori knowledge of the dynamics of the data generating systems (models) and suﬃcient
time samples to build the model. The network-based algorithms are dedicated to ﬁnding the
best network, such as Boolean networks or Bayesian network, to describe the observation64

al data [81]. One of the representative algorithms, i.e., Dynamic Bayesian network (DBN)
inference algorithm [86], was ﬁrst developed to infer nonlinear transcriptional regulatory
networks and was later used successfully at reconstructing nonlinear neural information ﬂow
networks [81, 87]. One major issue with this kind of algorithms is that the computational
complexity to learn the structure of a network proves to be NP-hard [88]. Therefore, it can
only be applied to relatively small networks [89, 90].
Recently, information-theoretic approaches have been widely used for the inference of
large networks in the bioinformatics community [91]. Most of these methods rely on estimating the mutual information between variables from data to quantify the dependency.
Unlike correlation-based algorithms, information-theoretic approaches can quantify the nonlinear dependencies [92, 60]. The Chow-Liu tree algorithm is the ﬁrst to adopt the mutual
information in probabilistic model design with minimum spanning tree, which has a low number of edges even for non-sparse target networks [91]. The Relevance network (RELNET)
extends these ideas by determining the relevant connections such that a pair of nodes X and
Y is connected when the mutual information is above a threshold. However, this method
may infer false connections when two nodes are indirectly connected through an intermediate
node [93]. The algorithm for the reconstruction of accurate cellular networks (ARACNE)
addresses this problem by using the data processing inequality for mutual information [92].
Zhao et al. also address the same problem using conditional mutual information [60]. However, since mutual information is a symmetric quantity, all of these methods are limited to
inferring undirected networks. Quinn et al. extend the Chow-Liu tree for random variables
to the causal dependence tree for random processes, and show that the best causal dependence tree approximation is the one which maximizes the sum of directed information on its
edges [94]. Similar to the Chow-Liu tree, this method also has a low number of edges since
not all real networks follow the tree structure, and requires a priori knowledge of the root
in some implementations. In this thesis, we propose a directed acyclic network inference
algorithm based on estimating directed information between signals over time in order to

65

quantify both the connectivity and causality in networks. Since we addressed the problem of
computation and estimation of DI in the previous chapter, one major problem remains with
the application of the directed information to infer the directed network, i.e., discriminating
between direct and indirect connections in the network [95]. We propose time-lagged conditional directed information and modiﬁed conditional directed information based inference
algorithms to address this problem.
The organization of this chapter is as follows. First, we introduce the concept of conditional directed information. Then we propose time-lagged conditional directed information
and modiﬁed conditional directed information to reduce the computational complexity of this
measure. Finally, three diﬀerent network inference algorithms are given and the comparison
with existing methods are shown through simulated network models.

3.2
3.2.1

Conditional directed information
Background

Based on the deﬁnition of directed information, if X is the direct cause of Y, then the value
of DI(X → Y) is signiﬁcant indicating causality between the two processes. However, a
signiﬁcantly large DI value does not guarantee that X directly causes Y; X and Y may
interact indirectly through other nodes [31]. Kramer extended the directed information to
three variables using causal conditional directed information (CDI) DI(X N → Y N ∥ Z N ),
which measures the information ﬂow from X N to Y N when causally conditioned on sequence
Z N [27]:
DI(X N → Y N ∥ Z N ) = H(Y N ||Z N ) − H(Y N ||X N Z N )
=

=

N
∑

[H(Yn |Y n−1 Z n ) − H(Yn |Y n−1 X n Z n )]

n=1
N
∑

I(X n ; Yn |Y n−1 Z n ),

n=1

66

(3.1)

where I(X; Y |Z) is the conditional mutual information between two random variables X and
Y given Z. This deﬁnition diﬀers from the conditional mutual information I(X N ; Y N |Z N )
only in that X n and Z n replace X N and Z N in each term on the right of equation (3.1).
When H(Yn |Y n−1 Z n ) = H(Yn |Y n−1 X n Z n ), DI(X N → Y N ∥ Z N ) = 0, i.e., given the
observation of the third time series Z up to the current time point n, X does not provide
any information about Y.

3.2.2

Motivational example

Directed information can reveal the causal dependencies between two variables, but it can
not distinguish between direct and indirect causality. Indirect interactions will cause false
positive connections for network reconstruction. To address the problem, we ﬁrst specify how
conditional directed information can help in network inference problems using two simple
motivational models. Then, time-lagged conditional directed information and modiﬁed conditional directed information are introduced to solve the computational complexity problem
while still being able to remove the indirect connections.
First, consider the following two trivariate models: a hub and chain model. A trivariate
autoregressive model following a hub pattern is as follows:
Xi = ui ,
Yi = bXi−m + vi ,

(3.2)

Zi = cXi−n + wi ,
where ui , vi and wi are i.i.d processes following a Gaussian distribution N (0, σ 2 ). In this
model, nodes Y and Z are indirectly connected through a hub (parent) node X, and X
interacts with them with diﬀerent time delays (Figure. 3.1). The trivariate model following

67

a chain pattern is given as:
Xi = ui ,
Yi = bXi−m + vi ,

(3.3)

Zi = cYi−(n−m) + wi ,
where n > m, ui , vi and wi are i.i.d processes following a Gaussian distribution N (0, σ 2 ). In
this model, X interacts with Z through intermediate (proxy) node Y .
For both models, the time lags between X → Y, X → Z and Y → Z are m, n and
n − m, respectively. The directed information rate between any two variables is larger than
0. The DI values and time lags between all of the pairs are shown in Figure 3.1. For network
inference, if we determine there is a connection between two nodes when the DI value is larger
than 0, then the inferred network for both hub and chain models will be the same. Therefore,
only employing DI, it is hard to discriminate between the direct and indirect connections
which may cause incorrect network inference. We introduce conditional directed information
to remove the indirect connections. For the hub case, there is no information ﬂow from Y
to Z directly and DI(Y N → Z N ||X N ) = 0. While for the chain case, X interacts with Z
through intermediate (proxy) node Y and DI(X N → Z N ||Y N ) = 0 (See Appendix A for
details). Thus, CDI helps to discriminate between the two connection patterns.

3.2.3

Computation of conditional directed information

Similar to directed information, the computational complexity of computing conditional
directed information also increases with the signal length. Thus, we proposed a simpliﬁed
approach to compute CDI, while still reﬂecting the actual information ﬂow from X to Y
inﬂuenced by Z. In order to reduce the computational complexity, in practice, estimation of
CDI is limited to every two time samples of three processes X, Y and Z. We can deﬁne the

68

X

X

m

m
n

n

Y

Y
n-m

DI(X->Y)= log(b2+1)
DI(X->Z)=log(C2+1)

Z

n-m

Z
2+1)
DI(X->Y)=log(b
2
DI(X->Z)=log[(b2 c2 +c2 +1)/(c +1)]

2
2
DI(Y->Z)=log[(b2+1)(c2+1)/(b +c2+1)] DI(Y->Z)=log(b c2 +c2+1)
22 2
2
2
2
log[(b2+1)(c2+1)/(b +c2+1)]<log(b2+1) log[(b c +c +1)/(c +1)]<log(b+1)
2c2 +c2+1)/(c +1)]<log(bc2 +c2+1)
2
2
2
log[(b2+1)(c2+1)/(b +c2+1)]<log(c2+1) log[(b
(b)
(a)
Figure 3.1: The DI value and time lag for both hub and chain models. (a) X drives Y
and Z with diﬀerent time delays; (b) X drives Y, Y drives Z. Solid line indicates an
actual connection in the real network and dash line represents no connection. The values
shown inside and outside the triangle are the DI value and time lag between two processes,
respectively.

time-lagged CDI resembling the time-lagged DI as follows:
DIk (Xk−d Xk−d+1 → Yk Yk+1 ||Zk−l Zk−l+1 ) = H(Xk−d Zk−l ) + H(Yk Zk−l )
− H(Xk−d Yk Zk−l ) − H(Zk−l ) + H(Xk−d Xk−d+1 Yk Zk−l Zk−l+1 )
+ H(Yk Yk+1 Zk−l Zk−l+1 ) − H(Yk Zk−l Zk−l+1 ) − H(Xk−d Xk−d+1 Yk Yk+1 Zk−l Zk−l+1 ).
(3.4)
where Xk−d and Zk−l refer to the (k − d)th and (k − l)th time samples of random processes
X and Z, respectively, k = max [d, l] + 1, · · · , N − 1, d, l = 0, · · · , L − 1, which are the time
lag variables, and N is the length of the signal.
When X, Y and Z are not normally distributed, the conditional directed information

69

can be expressed and estimated in terms of mutual information as follows:
DIk (Xk−d Xk−d+1 → Yk Yk+1 ||Zk−l Zk−l+1 ) = I(Xk−d ; (Yk Zk−l )) − I(Xk−d ; Zk−l )
+ I((Xk−d Xk−d+1 ); (Yk Yk+1 Zk−l Zk−l+1 )) − I((Xk−d Xk−d+1 ); (Yk Zk−l Zk−l+1 )).
(3.5)
To quantify the same amount of causally conditional dependency and be able to distinguish the indirect connections from direct connections, proper values of d and l should
be chosen for the computation of time-lagged conditional directed information. Actually,
when the time delay between any two variables is known (time delay between X → Y is m,
X → Z is n and Y → Z is n − m), then for the hub model, DIk (Yk−(n−m) Yk−(n−m)+1 →
Zk Zk+1 ||Xk−n Xk−n+1 ) = 0. While for the chain model, DIk (Xk−n Xk−n+1 → Zk Zk+1 ||
Yk−(n−m) Yk−(n−m)+1 ) = 0 (See Appendix B for details). Thus the time-lagged conditional
directed information can detect the indirect connections with the CDI value being equal to
0, which is aligned with what we expect from the model. Therefore, to remove the indirect
connections in a network inference problem, we need to calculate the time-lagged conditional
DI for both chain and hub cases respectively. For three connected nodes with signiﬁcant DI
values, if we do not know whether it is a hub or chain pattern but know the time lags between
any two nodes, e.g. the time lag from X to Y is m, X to Z is lc , and Y to Z is lh , then we
can calculate the time-lagged CDI (TLCDI) for both chain and hub cases along the whole
time series as follows:
• For the hub case:
1
T LCDIY,Z||X =
2

N −1
∑

DI(Yk−l Yk−l +1 → Zk Zk+1 |Xk−lc Xk−lc +1 ),
h
h

k=max [lc ,lh ]+1

(3.6)
where N is the length of the signal.

70

• For the chain case:
T LCDIX,Z||Y =

1
2

N −1
∑

DI(Xk−lc Xk−lc +1 → Zk Zk+1 |Yk−l Yk−l +1 ),
h
h

k=max [lc ,lh ]+1

(3.7)
where N is the length of the signal.
Once the time-lagged conditional directed information is obtained for both cases, it can
be applied to the network inference problem.

3.2.4

Modiﬁed conditional directed information

The TLCDI is less computationally complex than the original deﬁnition of CDI, but is limited to trivariate autoregressive models with single order. Similar to the modiﬁed directed
information in the previous chapter, we propose the modiﬁed conditional directed information. Consider a general Markov model, where random process Y N is inﬂuenced by X N and
Z N such that p(Yn |X1:n , Y1:n−1 Z1:n ) = p(Yn |Xn−p2 :n−p1 , Zn−p4 :n−p3 Yn−p5 :n−1 ). In this
model, it is assumed that X (Z) starts to inﬂuence Y with a delay of p1 (p3 ) samples and
this inﬂuence lasts for p2 − p1 + 1 (p4 − p3 + 1) time samples, where p2 ≥ p1 and p4 ≥ p3 .
p5 is the order of Y . The upper bound of each term of conditional directed information is
as follows:
I(X n ; Yn |Y n−1 Z n ) = H(Yn |Y n−1 Z n ) − H(Yn |Y n−1 X n Z n )
= H(Yn |Y n−1 Z n ) − H(Yn |Xn−p2 :n−p1 , Zn−p4 :n−p3 Yn−p5 :n−1 )

(3.8)

≤ H(Yn |Zn−p4 :n−p3 Yn−p5 :n−1 ) − H(Yn |Xn−p2 :n−p1 , Zn−p4 :n−p3 Yn−p5 :n−1 )
= I(Xn−p2 :n−p1 ; Yn |Zn−p4 :n−p3 Yn−p5 :n−1 ),
where the second equality comes from the Markov property and the inequality is true since
conditioning reduces entropy. Therefore, similar to modiﬁed directed information, we deﬁne
the upper bound of the conditional directed information as the modiﬁed conditional directed
information (MCDI) with p1 = p3 = 1 and p = max(p2 , p4 , p5 ) to reduce the number of

71

parameters,
M CDIX,Y ||Z =

N
∑

I(Xn−p:n−1 ; Yn |Yn−p:n−1 Zn−p:n−1 ).

(3.9)

n=p+1

Note that letting p1 = p3 = 1 does not lose any of the information ﬂow compared to using
the actual time delay, p1 > 1 and p3 > 1. The only drawback of letting p1 = p3 = 1 is that
the computational complexity of estimating the joint entropies increases since the length of
the window to compute MCDI increases and the dimensionality increases. The main reason
why we let p1 = p3 = 1 is because estimating the actual value for the delay accurately is not
practical when the amount of data is limited.
For the hub case, when p(Zn |Zn−p:n−1 Xn−p:n−1 Yn−p:n−1 ) = p(Zn |Xn−p:n−1 , Zn−p:n−1 ),
then
M CDIY,Z||X =

N
∑

I(Yn−p:n−1 ; Zn |Zn−p:n−1 Xn−p:n−1 )

n=p+1

=

N
∑

[H(Zn |Zn−p:n−1 Xn−p:n−1 ) − H(Zn |Zn−p:n−1 Xn−p:n−1 Yn−p:n−1 )]

n=p+1

=

N
∑

[H(Zn |Zn−p:n−1 Xn−p:n−1 ) − H(Zn |Zn−p:n−1 Xn−p:n−1 )]

n=p+1

= 0.
(3.10)
For the chain case, when p(Zn |Zn−p:n−1 Xn−p:n−1 Yn−p:n−1 ) = p(Zn |Yn−p:n−1 , Zn−p:n−1 ),

72

then
M CDIX,Z||Y =

N
∑

I(Xn−p:n−1 ; Zn |Zn−p:n−1 Yn−p:n−1 )

n=p+1

=

N
∑

[H(Zn |Zn−p:n−1 Yn−p:n−1 ) − H(Zn |Zn−p:n−1 Xn−p:n−1 Yn−p:n−1 )]

n=p+1

=

N
∑

[H(Zn |Zn−p:n−1 Yn−p:n−1 ) − H(Zn |Zn−p:n−1 Yn−p:n−1 )]

n=p+1

= 0.
(3.11)
Therefore, the modiﬁed CDI can discriminate the direct connection from the indirect
connection. MCDI is a general extension of TLCDI, which takes the inﬂuence of multiple
time samples into account.

3.3

Network inference algorithms

In the previous section, we proposed simpliﬁed version of directed information and conditional directed information with minimal computational complexity to quantify the causal
dependencies. In the following section, we propose three algorithms for directed network
inference based on DI and CDI. As discussed before, using directed information alone will
cause false connections. Any three nodes in a network can interact with each other through
two possible scenarios as equations (3.2) and (3.3) show: three nodes interacting through a
hub node Y ← X → Z or interacting as a chain X → Y → Z. As discussed in the previous
section, in both cases, if three pairs (X, Y), (Y, Z), (X, Z) all have large DI values, they will
infer the same directed network. Therefore, directed information fails to discriminate between direct and indirect dependencies. This leads to two questions to address the problem:
(1) If we can determine the time lag between any two nodes, can we infer the network based
on the time lag information? (2) Since the conditional directed information can evaluate the
amount of information ﬂow from X to Y given Z, can we use this information for network
73

inference? We introduce three approaches to remove indirect causal connections in a network
based on the time lag and conditional directed information.

3.3.1

Algorithm based on directed information and conditional directed information

The ﬁrst inference algorithm uses directed information and conditional directed information
with no time lag to infer the direct interactions between nodes [95]. In the ﬁrst step (lines
3 − 14) of Algorithm 1 in Figure 3.2, we calculate the TLDI between two nodes i and j with
d = 0 in equation (2.21), which corresponds to applying DI to every two time samples of
i and j without time delay. To test the signiﬁcance level of the obtained value of DI, the
p-value is determined by using the distribution of DI values computed from trial shuﬄing
data sets. If the p-value of DI(i → j) is less than the threshold of p-value, i.e. α, which
is determined by controlling the FDR bound q as 0.05, the information ﬂow from node i
to j is assumed to be signiﬁcant. In order to identify the direct vs. indirect connections
(lines 15 − 25), for each node pair i, j, we evaluate the conditional directed information
given any other node k for all k ̸= i, j. If node k does not interact with nodes i and j, then
DI(Xi → Xj |Xk ) should be close to DI(Xi → Xj ). Otherwise, if k is an intermediate or
hub node between i and j, DI(Xi → Xj |Xk ) should be close to 0. Therefore, we compute
DI(Xi → Xj |Xk )/DI(Xi → Xj ), which is close to 1 when there are no interactions between
k and i, j. We rank order this quantity from high to low for all k. The connections with the
highest ranks are kept until a desired number of connections, ec, is achieved.

3.3.2

Algorithm based on time lag

From equation (2.18), we can see that in some cases when d = 0, applying DI to every two
time samples of X and Y can not capture the same amount of information as the actual
causal information. Therefore, the proposed Algorithm 1 (Figure 3.2) based on DI and CDI
(d = 0) without considering time lag will encounter the problem of not being able to capture

74

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:

Input time series for M nodes, ec is the expected number of connections;
Initialize D ∈ Rn×n , C ∈ [0, 1](n×n) , cD ∈ Rn×n as zero matrices;
for i = 1 to M do
for j = 1 to M do
Di,j ⇐ T LDI0 (i → j);
Shuﬄe the trials of the time series of node i;
Calculate the p-value of each node pairs Pi,j ;
if Pi,j < α then
Ci,j = 1;
else
Ci,j = 0;
end if
end for
end for
for i = 1 to M do
for j = 1 to M do
if Ci,j == 1 then
(
)
cDi,j ⇐ min

DI(xi →xj |xk )
Di,j

, k ̸= i, j;

19:
else
20:
cDi,j ⇐ 0;
21:
end if
22:
end for
23: end for
24: ca = cD(:);
25: cb = sort(ca ) in descending order and keep the top ec connections

Figure 3.2: Algorithm 1: directed network inference based on DI and CDI

the causal dependencies between two random processes when there is a nonzero time lag.
Therefore, we propose an inference algorithm which utilizes time-lagged directed information
and time lag to infer the direct interactions between nodes. We ﬁrst show how time lag can
be used to infer connections and then introduce the algorithm. In a complex system, if we
assume that any node pair exchanges information through the fastest available path, then
the indirect causal relationships can be detected by determining the time lag between two
variables in the system. According to equation (2.18), for the single order linear model, the

75

time lag between two time series can be found as follows:
T L(X → Y) = arg max
d

N −1
∑

DI(Xk−d Xk−d+1 → Yk Yk+1 ).

(3.12)

k=d+1

where d = 1, · · · , L and N is the length of the signal. In order to illustrate the inference
of causality using time lag, we will consider two trivariate autoregressive models considered
in equations (3.2) and (3.3). For both cases, the time lag and the corresponding DI values
of direct and indirect connections are shown in Figure 3.1. We can observe that although
the DI value of any two of the random variables is signiﬁcant, the value for the indirect
connections (Y → Z in Figure 3.1(a), X → Z in Figure 3.1(b)) are smaller than the direct
connections. Moreover, it is obvious that in these two models, the three time lags show
a linear relationship. Therefore, in a real system with the assumption that any node pairs
exchange their information through the fastest available route, for the chain case (X → Y →
Z), if T L(X → Z) ≥ T L(X → Y) + T L(Y → Z), then Y is the intermediate node and there
is no direct causal relationship between X and Z. While for the hub case (Z ← X → Y),
if T L(Y → Z) ≥ T L(X → Z) − T L(X → Y) > 0, there is no direct causal relationship
between Y and Z.
Based on this observation, we introduce an algorithm based on time-lagged DI and time
lag to infer the directed network which is shown in Figure 3.3. In practice, for the model
with maximum order p > 1, the time-lagged DI is obtained by averaging across diﬀerent
1 ∑p
lags, i.e. p+1 d=0 T LDId (X N → Y N ). In the ﬁrst step (lines 3 − 15), we calculate the
time-lagged directed information DIi,j between two nodes i and j and the corresponding
time delay T Li,j based on equations (2.21) and (3.12), respectively. p-value of DIi,j is
determined for each pair of nodes. Connections with p-values less than the threshold α are
kept with α determined by controlling the FDR bound q. In order to identify the direct vs.
indirect connections (lines 16 − 27), for each node i, we consider connected triplets. For node
i, if T Lj,k ≥ T Lj,i + T Li,k and DIj,k is smaller than both DIj,i and DIi,k , then i is the
intermediate node (j → i, i → k) and there is no connection between j and k. Otherwise, if

76

1: Input time series for M nodes;
2: Initialize D ∈ Rn×n , T L ∈ (0, L − 1)n×n , P ∈ (0, 1)n×n , C ∈ [0, 1](n×n) , as zero

matrices;
3: for i = 1 to M do
4:
for j = 1 to M do
5:
DIi,j ⇐ T LDI(i → j);
6:
T Li,j ⇐ T L(i → j);
7:
Shuﬄe the trials of the time series of node i;
8:
Calculate the p-value of each node pairs Pi,j ;
9:
if Pi,j < α then
10:
Ci,j = 1;
11:
else
12:
Ci,j = 0;
13:
end if
14:
end for
15: end for
16: for i = 1 to M do
17:
for j = 1 to M do
18:
for k = 1 to M do
19:
if Cj,i = 1 and Ci,k = 1 and Cj,k == 1 and T Lj,k ≥ T Lj,i + T Li,k and
DIj,k < min (DIj,i , DIi,k ) then
20:
Cj,k = 0;
21:
end if
22:
if Ci,j = 1 and Ci,k = 1 and Cj,k == 1 and T Lj,k ≥ T Li,k − T Li,j and
DIj,k < min (DIi,j , DIi,k ) then
23:
Cj,k = 0;
24:
end if
25:
end for
26:
end for
27: end for

Figure 3.3: Algorithm 2: directed network inference based on TLDI and TL
T Lj,k ≥ T Li,k − T Li,j > 0 and DIj,k is smaller than both DIi,j and DIi,k , then node i is
the hub node (i → j, i → k) and the connection between j and k should be removed.

3.3.3

Algorithm based on modiﬁed time-lagged directed information and conditional directed information

Algorithm 2 (Figure 3.3)based on time-lagged directed information and time lag information
can capture more causal dependencies than Algorithm 1. However, when the relationship

77

between two variables is complex, e.g. nonlinear or multi-order, it is hard to detect the
exact time lag because of the complexity of the underlying model and limited sample sizes.
Therefore, we propose an inference algorithm of using MDI and MCDI. To explain the
algorithm clearly, we consider the hub and chain cases again. For triple connected nodes i,
j, k (i → j, i → k, j → k), if j is the intermediate node, DI(i → k|j) = 0 for the ideal
case. However, because of noise and the bias of the estimator, it will not be exactly equal
to 0. Therefore, in order to conﬁrm j is the intermediate node and remove the indirect
connection i → k, two conditions should be satisﬁed: (1) DI(i → k|j) < DI(j → k|i); (2)
DI(i → k) < DI(j → k). Similarly, if i is the hub node, then DI(i → k|j) > DI(j → k|i)
and DI(i → k) > DI(j → k). If the two conditions contradict each other, we remove the
connection with the smallest value of DI.
Based on this analysis, our Algorithm 3 (Figure 3.4) is described as follows: First (lines
3−14), we calculate the modiﬁed time-lagged directed information Di,j from i to j according
to equation (2.25). If the p-value of Di,j is larger than α, then there is no directed path
from i to j. To remove the indirect causality (lines 15 − 39), for each connected triplet
nodes i, j and k without a loop (i → j, i → k, j → k), which include both chain and hub
connection patterns, we calculate the modiﬁed time-lagged conditional directed information
(hub: cDjk|i , chain: cDik|j ). If cDjk|i and Dj,k are less (greater) than cDik|j and Di,k ,
respectively, we keep connection i → k (j → k). Otherwise, if Di,j is the largest value,
we remove both connections i → k and j → k; if not, we remove the connection with the
smallest DI value.

3.3.4

Validation

In order to evaluate the performance of the diﬀerent network inference algorithms, F-score
is adopted [91]. The F-score can be interpreted as a weighted average of the precision and
recall and is deﬁned as:
F =

2pr
TP
TP
,p =
,r =
.
p+r
TP + FP
TP + FN
78

(3.13)

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:

Input time series for M nodes;
Initialize D ∈ Rn×n , P ∈ (0, 1)n×n , C ∈ [0, 1](n×n) as zero matrices;
for i = 1 to M do
for j = 1 to M do
Di,j ⇐ M DI(i → j);
Shuﬄe the trials of the time series of node i;
Calculate the p-value of each node pairs Pi,j ;
if Pi,j < α then
Ci,j = 1;
else
Ci,j = 0;
end if
end for
end for
for i = 1 to M do
for j = 1 to M do
for k = 1 to M do
if Ci,j = 1 and Ci,k = 1 and Cj,k == 1 then
Hub: cDj,k|i = M CDIY Z|X ;
Chain: cDi,k|j = M CDIXZ|Y ;
if cDjk|i < cDik|j AND Dj,k < Di,k then
Cj,k = 0
else if cDjk|i > cDik|j AND Dj,k > Di,k then
Ci,k = 0
else
if Di,j > max (Di,k , Dj,k ) then
Ci,k = 0; Cj,k = 0;
else if Di,j < min (Di,k , Dj,k ) then
Ci,j = 0
else if Di,k < Dj,k then
Ci,k = 0
else
Cj,k
end if
end if
end if
end for
end for
end for
Figure 3.4: Algorithm 3: directed network inference based on MDI and MCDI

79

A positive label predicted by the algorithm is considered as true positive (TP) or false positive
(FP) depending on whether there exists a corresponding edge in the true network or not.
Similarly, a negative label can be a true negative (TN) or false negative (FN) depending
on the true network. In this dissertation, we will compute the F-score before and after
introducing CDI or time lag for each algorithm.

3.4
3.4.1

Results
Synthetic data: Linear network

In order to test the eﬀectiveness of the proposed algorithms, we ﬁrst consider a linear multivariate autoregressive model to reduce the impact of DI estimation on the accuracy of
network inference. The following linear autoregressive model is considered [96],



 

 Y1 (n) 
 Y1 (n − i)












 Y2 (n) 
 Y2 (n − i)
p
∑




=
Ai 



 .  i=1

.
.
 . 

. 
.











Yr (n)
Yr (n − i)

  e1 (n)
 
 
 
 
  e2 (n)
 
+
 
  .
  .
  .
 
 
 
er (n)

















(3.14)

where Y(n) = [Y1 (n), Y2 (n), · · · , Yr (n)] is the nth sample of a r-dimensional time series generated by r variables, with each Ai being a r-by-r matrix of coeﬃcients (weights), coeﬃcients
are in the range of 0 to 1 and also keep the system stable. en = [e1 (n), e2 (n), · · · , er (n)]
is additive white Gaussian noise with zero mean and unit variance. In this dissertation,
we generate a synthetic network of 18 nodes (r = 18) with the maximum time lag p = 4.
The network contains only linear dependencies as shown in Figure 3.5(a). We generate 1024
realizations of the 18 diﬀerent time series for each node and the results are averaged across
10 simulations. The order between any two time series is determined by the order selection
criterion. We then compute the time lag and modiﬁed time-lagged DI over 12 time samples
80

according to equations (3.12) and (2.25) and compute the p-value for each DI value under
the distribution of the null hypothesis. If the p-value of DI is less than the threshold α = 0.01
for each pair of nodes with α found by controlling the FDR bound q as 0.05, we keep the
connection. Both time lag and time-lagged conditional directed information are used to eliminate the indirect connections. The F-scores using three diﬀerent algorithms before and
after applying CDI, time lag and MCDI are shown in Figure 3.5(b). We can observe that,
Algorithm 3 (Figure 3.4) reaches the highest F-score 0.9351 and it improves the network
reconstruction when introducing MCDI to remove the indirect causality relationship. The
introducing of time lag information to remove the indirect connection leads to slight change
in F-score. It is due to the diﬃculty of estimating the time lag accurately for a multi-order
model in particular when the number of realizations is limited. Algorithm 1 has lower Fscore compared to the other two algorithms, because it can not capture the whole causal
information between two variables or among multiple variables. However, introducing CDI,
though without considering the time lag information can still remove parts of the indirect
causality relationship.
Without loss of generality, we permuted the matrices A1 and A2 50 times to compare the
performance of diﬀerent algorithms further, and we let A3 and A4 equal to all zero matrices
to reduce the order and computational complexity. Only the location of the connections were
changed with the additional constraint that there are no connected triplets in the permuted
networks. More important, in practice the realizations of each random processes is limited.
Therefore, we only generate 256 realizations of 18 diﬀerent time series. The average F -score
for each algorithm is shown in Table 3.1. We also compare the proposed algorithms based
on DI with model based methods, such as PDC. Matlab toolbox for PDC developed by
Baccala is used [44]. In practice, the implementation of PDC depends on the length of the
signal and the performance of DI based algorithms relies on the number of realizations of
time series. Therefore, to compare the two measures, the length of the generated signal for
PDC is chosen as 256, which is the same with the number of realizations for DI. The results

81

7

12

13

1

1

10
15

3

0.8

16

14
8

11

F-score

9

0.6
0.4

18
17

4

5

2

0.2
0

6

Algorithm 1 Algorithm 2 Algorithm 3
(b)

(a)

Figure 3.5: The performance of proposed algorithms for linear network inference. (a) The
synthetic linear network. (b) Average F-score of Algorithm 1 before and after applying
causally conditioned directed information without considering time lag information; average
F-score of Algorithm 2 before and after applying time lag; average F-score of Algorithm 3
before and after applying MCDI.

Table 3.1: Average F -score for three proposed algorithms and PDC for linear network
Algorithm

Mean

1

DI+CDI

0.8270

2

TLDI+TL

0.7903

3

MDI+MCDI

0.8301

4

PDC

0.7351

are shown in Table 3.1. We observe that Algorithm 3 outperforms the other algorithms.
The performance of Algorithm 1 is slightly worse than Algorithm 3, because DI without
considering time lag information will not lose all causal information when the order of the
model is low. The failure of Algorithm 2 is due to the inaccurate estimation of time lag.
PDC is based on MVAR model and is expected to have better performance, but since the
length of the signal is limited and there is not enough time samples to reconstruct the model,
PDC fails to reveal the network structure compared to other algorithms.

82

3.4.2

Synthetic data: Nonlinear network

In this subsection, we test our algorithm on a synthetic nonlinear network of 14 nodes in [95].
The network contains both linear and nonlinear causality as shown in Figure 3.6(a):
X1 (n) = 0.7X1 (n − 1) + w1 (n);
X2 (n) = 0.29X2 (n − 1) + 0.65X1 (n − 1) + w2 (n);
X3 (n) = 0.15X3 (n − 1) + 0.79X2 (n − 1) +

0.9X14 (n − 1)
1 + e−6X14 (n−1)

+ w3 (n);

X4 (n) = 0.17X4 (n − 1) + 0.7X3 (n − 1) + 0.7X6 (n − 1) + w4 (n);
X5 (n) = 0.6X5 (n − 1) + 0.8X4 (n − 1) + w5 (n);
√
0.6 + 0.8X9 (n − 1)
X6 (n) = 0.12X6 (n − 1) + 0.8X7 (n − 1) + 0.7X8 (n − 1) +
+ w6 (n);
1 + e−2X7 (n−1)
X7 (n) = 0.8X7 (n − 1) + w7 (n);
X8 (n) = 0.7X8 (n − 1) + w8 (n);
X9 (n) = 0.77X9 (n − 1) + w9 (n);
X10 (n) = 0.4X10 (n − 1) + 0.4[X11 (n − 1)]2 + w10 (n);
X11 (n) = 0.7X11 (n − 1) +

0.7X14 (n − 1)
1 + e−6X8 (n−1)

+ w11 (n);

X12 (n) = 0.4X12 (n − 1) + 0.8X11 (n − 1) + w12 (n);
X13 (n) = 0.4X13 (n − 1) + 0.9X11 (n − 1) + w13 (n);
X14 (n) = 0.65X14 (n − 1) + w14 (n);
(3.15)
where w1 , · · · , w14 are white Gaussian random processes. We generate 512 realizations of the
14 diﬀerent time series for each node and compute the time-lagged DI over 12 time samples.
The p-values of each DI value were computed and the connections with p-values less than α
are kept. To remove the indirect connections, algorithms based on time lag and MCDI were
used, respectively. We generate this simulation 10 times and the averaged F-score before
and after applying CDI without lag information, time lag and MCDI for each algorithm are

83

shown. From Figure 3.6(b), we see that the proposed algorithm using both MDI and MCDI
has better performance compared to other two algorithms, and obtains a F-score of 0.9455,
which eﬀectively detects all linear and nonlinear dependencies. Algorithm 1 using DI and
CDI without considering any time lag information performs worst since it may lose important
causal information to reveal the structure of the network. Algorithm 2 performs better than
Algorithm 1 but worse than Algorithm 3, which is due to the inaccurate estimation of the
time lag in a complex network with nonlinear interactions. Therefore, we observe that timelagged directed information and modiﬁed directed information can eﬀectively detect most
of the causal interactions, and time-lagged conditional directed information and modiﬁed
conditional directed information are able to eﬀectively removes most of the false positive
connections.

Linear
Nonlinear

1

1
2
F-score

0.8

3
4
5
7

8

0.4

14
6

0.2

11
9

10

0.6

12

13

0
Algorithm 1

Algorithm 2 Algorithm 3
(b)

(a)

Figure 3.6: The performance of proposed algorithms for nonlinear network inference. (a) The
synthetic nonlinear network. (b) Average F-score of Algorithm 1 before and after applying
causally conditioned directed information without considering time lag information; average
F-score of Algorithm 2 before and after applying time lag; average F-score of Algorithm 3
before and after applying MCDI.

Similarly, to verify the eﬀectiveness of our algorithms further, we permuted the matrices
A ten times, which only changes the location of the connections with the additional constraint
that there are no connected triplets. The average F -score for each algorithm is shown in
84

Table 3.2: Average F -score for three algorithms of nonlinear network
Algorithm

Mean

1

DI+CDI

0.6580

2

TLDI+TL

0.7798

3

MDI+MCDI

0.8175

4

PDC

0.7087

Table 3.2. Similarly, we also apply PDC to the randomized networks and the length of the
generated signal for PDC is set to 512. The performance of PDC is also shown in Table. 3.2.
Our proposed algorithm has the highest F score among all the algorithms, indicating more
eﬃcient and stable performance for nonlinear networks. The performance of PDC is worse
than Algorithm 2 and 3, because PDC is implemented under a linear AR frame work, which is
not aligned with the actual model of the signal and fails to reconstruct the network correctly.

3.5
3.5.1

Discussions
Problems with current algorithms

The proposed algorithm based on modiﬁed directed information and modiﬁed conditional
directed information can eﬀectively infer the directed network. However, there are still some
issues remaining with the current algorithms. First, the bias and variance of DI estimator
depends on the sample size, which may aﬀect the signiﬁcance testing of the DI value and in
turn may increase false negative connections. Second, the algorithm is based on the assumption that the network is acyclic and has no triple connected nodes. The algorithm considers
connections in a connected triplet and break at least one of the connections. Therefore, if a
particular connection does not connect any triplets, the algorithm will keep it without applying conditional directed information on it. In fact, algorithms in this thesis can be extended

85

to cyclic networks by stopping before the step of removing the connection with the smallest
DI value. Third, the detection of time lags is limited to linear models with single order.
When the model is complex, the eﬀectiveness of determining the time lag according to the
maximum time-lagged directed information needs to be further proven. Finally, the proposed
algorithms just consider the scenario that there is only one intermediate node between two nodes. If there are two or more intermediate nodes between X and Z, i.e., X → Y → W → Z,
the algorithm may fail to remove the indirect connection by computing the CDI from X to
Z only conditioned on either Y or W. In this case, DI(X N → Z N ||Y N W N ) needs to be
considered to remove the false positive connection X → Z. However, the computational
complexity of CDI increases with the number of nodes involved, which is hard to implement.

3.5.2

Comparison with existing algorithms

The major contributions of this chapter are using the directed information and conditional
directed information to infer a directed network. We propose modiﬁed conditional directed
information for both chain and hub cases to remove the indirect causality and at the same
time to reduce the computational complexity.
Model-dependent methods such as PDC have been widely used for eﬀective network inference. Though they have good performance when the system is linear, there are two main
issues with model dependent measures. First, model-dependent measures fail to detect the
strong time-lagged nonlinear couplings, which might be the case for most complex systems.
Second, the number of parameters to be estimated increases with the order of the model, which requires a large sample size. When the sample size is limited, model-dependent
measures may fail to model the system accurately and detect the directed connectivity. On
the other hand, because DI is a model free measure to capture the causal information in a
system, our algorithms based on DI are easier to implement for complex systems with strong
nonlinear relationships and high orders. Though the performance of our measure is also
inﬂuenced by the sample size, the modiﬁed version of DI and CDI address this problem.

86

3.6

Conclusions

In this chapter, we pointed out the drawback of only using directed information for network inference and introduced the conditional directed information to address this problem.
Moreover, time-lagged conditional directed information and the modiﬁed conditional directed information are proposed to reduce the computational complexity. Three algorithms are
proposed for network inference and the results for both linear and nonlinear networks are
shown. The simulation results show that our methods outperform the existing measures.

87

Chapter 4
COMMUNITY DETECTION FOR DIRECTIONAL NEURAL NETWORKS

4.1

Introduction

A remaining issue with the analysis of the brain network is to reveal the underlying community structure. The complex network theory has been used to show that both the functional
and structural brain networks follow a small world topology characterized by a short minimum path length between all pairs of nodes in the network together with a high clustering
coeﬃcient [97, 34]. Although small-worldness summarizes key aspects of complex networks
at both global (the whole network) and local (each node together in relationship with its
most immediate neighbors) levels, it does not provide any information about the intermediate scale of network organization which is more completely described by the community
structure or modularity of the network [35, 98]. The modules of a complex network are
subsets of nodes that are densely connected with each other but sparsely connected to nodes
in other modules. Module detection also allows one to obtain simpliﬁed reduced representations of complex networks in terms of subgraphs or communities. In this chapter, we address
the issue of community detection for eﬀective brain networks.
In recent years, a lot of work has been done on applying community detection algorithms
from graph theory to the study of functional brain networks [99, 35, 36, 37]. Functional
brain networks are usually described by undirected graphs with corresponding symmetric
association matrices, where each entry indicates the pairwise functional connectivity between
two regions. Therefore, most of the work on community detection for the study of brain
networks has focused on undirected networks. For example, Fair et al. showed that young
children and young adults have diﬀerent community structures in functional brain networks
from the study of resting state fMRI data [35]. Similarly, Ferrarini et al. showed that the
resting-state human brain has a hierarchical functional module structure [36] and Meunier et
88

al. revealed age-related changes in the modular structure of human brain functional networks
from fMRI [37]. Chavez et al. pointed out that the modular structure of the human brain
provides important information on the functional organization of the brain during normal and
pathological neural activities [38]. However, in the study of brain networks, it is important to
quantify both the dependency between diﬀerent nodes or neuronal populations in the brain
as well as the causality between these nodes, i.e. eﬀective connectivity as we mentioned
in the previous chapters. Friston pointed out that functional integration of the brain can
be better understood through eﬀective connectivity since it reﬂects the dynamic (activity
dependent and time dependent) characteristics of a system [12]. In this sense, the brain
network can be better described by an eﬀective network where the edges of the graph have
direction and the corresponding association matrix is no longer symmetric. As Kim et al.
claimed approaches that ignore the direction of links may fail to understand the dynamics
of the system, and similarly any community detection approach on these networks may
fail to reveal the actual community structure [100]. In addition, Leicht et al. also states
that abundant useful information about a network’s structure will be lost if we ignore the
directions of the edges [101]. Therefore, we expect that using eﬀective connectivity would
reveal new topological characteristics of the brain network [102]. In this chapter, we propose
a multi-subject hierarchical community detection algorithm for weighted directed networks
in order to reduce eﬀective brain networks involved in cognitive control into a small number
of functional modules.
The approach outlined in this chapter advances the current study of brain networks in
several key ways. First, the pairwise relationship between two processes is quantiﬁed by
eﬀective connectivity, which can reﬂect both the interaction and the direction of information
ﬂow of the network. Second, we employ recent work in the area of community detection
in directed networks for infering functional modules. Although most of the literature on
community detection focuses on undirected networks, a signiﬁcant amount of information
about a network’s structure will be lost if we ignore the directions of the edges [101]. In

89

order to discover the underlying organization of the network, traditional clustering algorithms such as Kernighan-Lin algorithm, agglomerative (or divisive) algorithm and k-means
clustering, have been widely used. However, these algorithms need to pre-determine the
number of clusters [103, 104, 105, 106]. Therefore, modularity based algorithms are widely
used to choose the best partition of a network by maximizing the modularity, which include
greedy techniques, spectral optimization, etc. [39]. Recently, Blondel introduced a greedy
approach for the modularity optimization for weighted graphs, which is proven to be eﬃcient, multi-level and close to the optimal value obtained from slower methods [1]. In this
chapter, we extend this algorithm to weighted directed networks to ﬁnd the functional modules. Third, we extend community detection algorithms developed for single networks to a
group of networks in order to ﬁnd a modular structure that best describes all of the subjects
in a group. Traditionally, two broadly used approaches are employed for group analysis,
i.e., ‘virtual-typical-subject’ (VTS) approach and ‘individual structure’ (IS) approach. The
former approach pools or averages the group data to obtain one community structure for
the whole group. The latter one applies the proposed algorithm to each individual subject
and ﬁnds the common structure for all of them by employing diﬀerent strategies, such as
averaging the community structures across subjects, voting for the community structure
that the majority of the subjects agree on, or ﬁnding the most representative subject whose
structure is the most similar to the other subjects in the group. However, both VTS and
IS approaches ignore the between-subject variability and the results may be inﬂuenced by
outliers [34]. In this chapter, instead of dealing with individual results from each subject, we
focus on modifying the community detection algorithm itself by combining all of the information from the data. To be speciﬁc, instead of maximizing the modularity and identifying
the communities of each subject separately, we detect the community structure of a group
by maximizing the total modularity of the group. Our proposed method has the advantage
of being more computationally eﬃcient than IS, because IS spends a large amount of time in
detecting the community structure for each subject. Finally, the algorithm proposed in this

90

chapter is applied to multivariate EEG recordings which oﬀer a better view of the dynamic
change of community structure with high temporal resolution compared to fMRI.

4.2
4.2.1

Background
Modularity

The concept of modularity is motivated by the idea that nodes in the same module have very
dense connections and nodes in diﬀerent modules have sparse inter-module connections [101].
Modularity was proposed as a quality function to choose the best partition of a network such
as in the Girvan-Newman algorithm [98] and is widely used as an optimization criterion in
many graph clustering methods [107, 34, 1, 101, 108]. A good partition of a network has
high modularity Q with Q=(fraction of edges within communities)-(expected fraction of
such edges) [98], where the expected fraction of edges is evaluated for a random graph. The
original expression of modularity for undirected binary networks is given as,
[
]
ki kj
1 ∑
Aij −
δci ,cj ,
Q=
2m
2m

(4.1)

i,j

where A is the adjacency matrix with Aij = 1, if i and j are connected, and Aij = 0
otherwise, ki is the degree of vertex i, δci ,cj is equal to 1 when i and j are in the same
community and is equal to 0 otherwise. For a directed network, the probability of a directed
edge relies on the in-degree and out-degree of the vertex. Leicht et al. extended the deﬁnition
of modularity to directed binary networks as [101],
[
]
out in
ki kj
1 ∑
Qd =
Aij −
δci ,cj ,
m
m

(4.2)

i,j

in
out
where ki (ki ) is the in-degree (out-degree) of vertex i. Arenas et al. gave a more general

expression of modularity for directed weighted networks [108],
[
]
sout sin
1 ∑
i
j
Qgen =
Aij −
δci ,cj ,
W
W
i,j

91

(4.3)

where Ai,j is the weight of edge ei,j , sin (sout ) is the in-degree (out-degree) of vertex i,
i
i
∑
and W = i,j Ai,j . In this chapter, Ai,j is quantiﬁed by the directed information measure
introduced in Chapter 1.

4.2.2

Overview of community detection methods

Uncovering community structure is a key step to the understanding of a complex network.
The idea of community detection is closely related to graph partitioning in computer science
and hierarchical clustering in sociology. Therefore, traditional approaches in these two areas can be employed for community detection [109, 39]. Kernighan-Lin algorithm is one of
the representative heuristic algorithms in graph partitioning [103]. It starts with an initial
partitioning of the graph into two clusters with pre-deﬁned sizes and tries to maximize the
diﬀerences between the number of edges inside the modules and the number of edges lying
between modules. However, the pre-deﬁned size of a cluster is usually unknown and the
performance of this algorithm is highly dependent on the initial partition of the graph [39].
On the other hand, hierarchical clustering algorithms do not require any a priori knowledge
of clusters and try to group nodes into the same community (agglomerative) or split nodes
into diﬀerent clusters (divisive) by quantifying the similarity between pairs of nodes. The
results are presented in a dendrogram or tree, but the algorithms fail to express the best
level of partitioning during the processes. To ﬁnd the optimal partitioning, Girvan-Newman
algorithm proposed to not only use the ‘edge betweenness’ measure to remove edges from
the network, but also to use the modularity function Q as a stopping criterion to choose the
best level of partitioning [98]. In addition, the proposition of modularity function Q leads
to the development of another category of community detection algorithms which is based
on the optimization of modularity. To optimize modularity, greedy techniques, simulated
annealing, extremal optimization and spectral optimization are widely employed. Unlike
simulated annealing and extremal optimization, which may encounter the problems of high
computational complexity and not being applicable to large networks [110, 39, 111], the

92

greedy strategies and spectral optimization are more commonly used for ﬁnding the community structure of a network with the maximum modularity. Newman was the ﬁrst to propose
a greedy strategy to maximize the modularity [112] and since then several improvements of
the algorithm were proposed to improve the detection accuracy [39, 113, 114]. Later, Newman proposed a more eﬃcient approach based on the bipartition of the modularity matrix,
which uses the eigenvalues and eigenvectors of the modularity matrix to ﬁnd a solution for
the community detection problem (spectral optimization) [107].
Although most of the modularity-based community detection algorithms have focused on
binary and undirected networks, in recent years there have been some extensions to weighted
and directed networks. Because a weighted and directed link can reﬂect both the strength
and direction of the interaction between two nodes, which can reveal more characteristics of
a system than the binary and undirected network. For example, Leicht et al. extended the
deﬁnition of modularity to binary directed networks and employed spectral optimization to
ﬁnd the community structure for directed networks. This technique works well for bipartites,
but is less applicable to networks with a large number of communities [39] and weighted links.
An alternative approach is to represent the directed network as an undirected bipartite
network. To do this, each node in the network is split into two nodes, with one node only
receiving information and the other node only sending information. In this way, community
detection algorithms for bipartite networks can be employed for directed networks [115].
However, this approach is limited to networks with a small number of clusters. Recently,
Blondel et al. introduced an alternative greedy algorithm to ﬁnd the hierarchical structure
of undirected weighted graphs [1], the computation time of which is comparable to spectral
optimization [39]. In this chapter, we extend this algorithm to directed weighted graphs for
community detection.

93

4.2.3

Group analysis approaches

In many neuroimaging studies, extracting a common set of features or a representation for
a group of subjects such as the common community structure is more important than extracting features for individual subjects. This common structure usually gives us an overall
understanding of the group, while individual subject level representations show the subjectspeciﬁc features. There are three major group analysis strategies that can be employed
for community detection, i.e., the ‘virtual-typical-subject’ (VTS) approach, the ‘individual
structure’ (IS) approach, and the algorithm-based approach [116, 117]. The VTS approach
assumes that data from each subject performs the same function or follows the same distribution. It reconstructs a virtual subject by pooling or averaging the group data or connectivity
matrices and obtains one community structure for the whole group. However, this approach
does not consider the inter-subject variability and may fail when the behavior from subject
to subject is not consistent [116, 117]. Therefore, the results may not reﬂect any of the
features seen in individual analysis [118]. The IS approach applies a community detection
algorithm to every individual subject and ﬁnds a unanimous community structure from these
individual structures. Since this subject-speciﬁc strategy considers diversity across subjects, various strategies are employed to integrate inconsistent results. A common community
structure can be obtained through either averaging individual structures or voting/consensus
algorithms [119]. For example, replicator dynamics proposed by Neumann et al. can be used
to capture nodes that are jointly presented in the same cluster across subjects by analyzing
the structures obtained from individual subjects [120]. An alternative strategy to combine
the results in IS is to ﬁnd the most representative subject, whose structure is the most
similar to all other subjects in the group [34] and can be used to represent the structure
of the whole group. However, this approach may lose important information provided by
other subjects, and may be even worse than just averaging the community structure over
subjects. The IS approach is usually computationally expensive since it requires the extraction of the community structure for each subject before obtaining a common structure.

94

Both the VTS and IS approaches focus on either preprocessing the data or post-processing
the community structures obtained from each subject. However, none of these approaches
reveals the community structure of multiple subjects by directly extending the community
detection algorithm to multiple subjects. Recently, some work has been done to address the
data-fusion or group inference problems at the algorithm level. Mechelli et al. constructed
a network or covariance matrix that comprises of m nodes from n subjects and assumed the
same model with diﬀerent model parameters for all subjects [117]. However, this strategy is
more suitable for model-based approaches rather than clustering problems. For data-driven
problems, Correa et al. extended the canonical correlation analysis (CCA) to multi-set CCA
by optimizing an objective function of the correlation matrix of multiple canonical variates
instead of two such that the correlation among multiple variates is maximized [121]. Similar
to the multi-set CCA, which redeﬁnes the optimization problem for multiple subjects, in this
chapter, we propose a group analysis method by optimizing a common modularity function
of directed networks from multiple subjects.

4.3

Algorithm for community detection

In this section, we ﬁrst extend the method proposed by Blondel et al. to weighted directed
networks [1] and then propose an extension for group analysis. The algorithm proposed by
Blondel et al. is originally a modularity based community detection algorithm for undirected
weighted networks. To reveal the community structure of an undirected weighted network,
it maximizes the modularity through greedy search. Initially, all nodes of the network are
in diﬀerent communities. The algorithm is divided into two phases. In the ﬁrst step, for
each node i, the gain in the modularity when node i is assigned to the community of its
neighbor j (Ai,j ̸= 0) is computed and node i is assigned to the community which has the
largest positive increase in modularity. This procedure is repeated for all the nodes until
the modularity does not increase any more. In the second step, a new network is built by
aggregating those nodes in the same community at the ﬁrst step and forming meta-nodes.
95

1

3

1

4

5

3

2

0
4

0

5

7

7
6

ity
un n
mm tio
Co rega
g
Ag

M
Op od
tim ula
iza rity
tio
n

2

14
1st Step

4

4

14
2nd Step

4
4

6

Figure 4.1: Hierarchical optimization of modularity by Blondel et al. [1]. The algorithm is
divided into two phases. First, each node is assigned to a community and the algorithm
tries to combine small communities by optimizing modularity locally. Second, it builds a
new network by aggregating those nodes in the same community at the ﬁrst step. These two
steps are repeated iteratively until a maximum of modularity is reached.

The weight of the edge between two meta-nodes is the sum of the weights of edges between
nodes in the two corresponding communities from the previous step. These two steps are
repeated iteratively until a maximum modularity is reached. Compared to the existing
modularity based methods, this approach is fast and can reveal the hierarchical structure of
a network [39, 1]. The change of modularity is always computed with respect to the initial
graph to guarantee the convergence of the algorithm [39]. An illustration of this algorithm
is shown in Figure. 4.1.

4.3.1

Algorithm for community detection in weighted directed networks

In this subsection, in order to reveal the community structure of the functional human
brain network which is known to have a hierarchical structure [37], we propose to extend
Blondel’s approach to directed weighted networks. Initially, all vertices of the graph are
put in diﬀerent communities. The algorithm for uncovering the community structure of a
directed weighted network consists of two steps. First, for each node i, the gain in the
modularity ∆Qgenj is computed when the node is assigned to the communities of all other
nodes j, where j = 1, · · · , N , j ̸= i. The original algorithm only evaluates the change of
96

modularity when node i is assigned to the communities of its neighbors j, where j is deﬁned
as the neighbor of i when Ai,j ̸= 0, which may be inaccurate and yield spurious partitions
in practical cases [39]. For this reason, we consider the change of modularity with respect
to all other nodes. Once ∆Qgenj is obtained, where j = 1, · · · , N , j ̸= i, the community for
which ∆Qgenj is positive and highest is chosen as the new community for node i. ∆Qgenj ,
which partly determines the eﬃciency of the algorithm, can be computed as follows,


Nj
sout sin
sin sout
∑
1
Ai,j − i jp + Aj ,i − i jp 
∆Qgenj =
p
p
W
W
W
p=1


Ni −1
sout sin
sin sout
∑
1
Ai,i − i ip + Ai ,i − i ip  ,
−
p
p
W
W
W

(4.4)

p=1

where jp ∈ Cj , ip ∈ Ci and ip ̸= i, Nj (Ni ) is the number of nodes in community Cj
(Ci ) to which node j (i) belongs to, sin (sout ) is the in-degree (out-degree) of vertex i, and
i
i
∑
W =
i,j Ai,j . The ﬁrst term in the right hand side of the above equation is the gain
of modularity when node i moves to the community of node j, Cj , while the second term
is the modularity gained when node i stays in its original community Ci . This process is
sequentially and repeatedly applied to all nodes until there is no gain in modularity. At
this stage, the partition of the network at the ﬁrst level is obtained. Next, nodes in the
same community after the ﬁrst step are used to form several meta-nodes. The number of
meta-nodes is equal to the number of current communities and the weights between any two
meta-nodes are given by the sum of the weights of edges between nodes in the corresponding
communities [1].
N

Anew (inew , jnew ) =

Ni
j
∑∑

Ai,j ,

(4.5)

i=1 j=1

where inew , jnew = 1, · · · , tN with tN being the current number of meta-nodes, and Ni ,
Nj are the number of nodes in the two clusters. Note that since the network is directed,
Anew (inew , jnew ) ̸= Anew (jnew , inew ). The two steps are iterated until the modularity
cannot increase anymore, and several levels of partitions are obtained at diﬀerent resolutions.

97

Require: Weighted adjacency matrix A ∈ (0, 1)N ×N , nodes 1, · · · , N , initial community structure C = {{1}, · · · , {N }}, tN = N .
Ensure: M Communities.
1: Compute the modularity Q of the network;
2: repeat
3:
∆Qtotal = 0, nchange = 0;
4:
repeat
5:
for h = 1 to tN do
6:
for j = 1 to tN do
7:
The change of modularity ∆Qgenj when node h is assigned to Cj ;
8:
end for
9:
j∗ = arg maxj ∆Qgenj ;
10:
if ∆Qgenj ∗ > 0 then
11:
Cj ∗ = Cj ∗ ∪ vh ;
12:
end if
13:
end for
14:
Compute the change of the modularity ∆Qtotal = Qnew − Q;
15:
if ∆Qtotal > 0 then
16:
Q = Qnew , nchange = nchange + 1;
17:
end if
18:
until ∆Qtotal ≤ 0
19:
Nodes in the same community form new meta-nodes;
20:
tN is equal to the current number of communities;
21:
Recompute the weighted matrix A ∈ (0, 1)tN ×tN ;
22: until nchange = 0.
Figure 4.2: Algorithm 4: community detection of weighted networks

Low resolution indicates the number of nodes (meta-nodes) is small. The ﬁrst level of
partitioning (before the formation of meta-nodes) has the highest resolution. The modularity
is always computed with respect to the initial graph topology, in this way, the two-step
iterative procedure will not be trapped at a local maximum. This algorithm is summarized
in Figure 4.2.

4.3.2

Algorithm for community detection for multiple subjects

One of the remaining challenging problems for the application of this algorithm is the group
analysis when information from multiple subjects needs to be merged. As we mentioned in

98

the background part, the standard approach to group analysis is based on either averaging
the data or averaging the detected community structure from each subject. In this subsection, we propose a community detection algorithm for multiple subjects by integrating the
information from each subject at the algorithm level, which can take both the inter-subject
variability and commonality into account. To be speciﬁc, for the algorithm proposed above,
we only take into account the change of modularity for moving one node to the community
of another node for each subject. However, this activity leads to the change of modularity
in all subjects. Therefore, we compute the change in modularity ∆Qk
gen for subject k when
j

assigning node i to the communities of all other nodes j, where j = 1, · · · , N , j ̸= i, and
k = 1, · · · , L with L being the number of subjects, and try to maximize a common modular∑
ity function, i.e, the sum of the gain in modularity for all subjects ∆Qgenj = L ∆Qk .
gen
k=1
j

In this way, the eﬀect of outliers is directly decreased at the algorithm level. The details are
shown in Figure 4.3.

4.4

Results

In this section, we ﬁrst illustrate the importance of edge direction information for revealing
the real structure of a directed network by a simulated synthetic network. We then test
the eﬀectiveness of the proposed community detection algorithm for group analysis on both
synthetic networks and real EEG data.

4.4.1

Directed vs. undirected networks

Most of the existing community detection algorithms are intended for the analysis of undirected and binary networks. However, many networks of interest, such as biological networks,
are directed. One approach that has been commonly employed for community detection in
directed networks is to directly apply the algorithms designed for undirected networks without considering the edge direction information [101]. To illustrate the importance of edge

99

Require: Weighted adjacency matrix Ai ∈ (0, 1)N ×N , i = 1, · · · , L with L being the number of subjects, nodes 1, · · · , N , initial community structure C =
{{1}, · · · , {N }}, tN = N .
Ensure: M Communities.
1: Compute the modularity Qi of subject i, i = 1, · · · , L;
∑L
2: The modularity of the group is Q =
i=1 Qi ;
3: repeat
4:
∆Qtotal = 0, nchange = 0;
5:
repeat
6:
for h = 1 to tN do
7:
for j = 1 to tN do
8:
The change of modularity ∆Qi
genj when node h is assigned to Cj for
subject i, i = 1, · · · , L;
∑
9:
The change of the whole group is ∆Qgenj = L ∆Qi
genj ;
i=1
10:
end for
11:
j∗ = arg maxj ∆Qgenj ;
12:
if ∆Qgenj ∗ > 0 then
13:
Cj ∗ = Cj ∗ ∪ vh ;
14:
end if
15:
end for
∑
16:
Compute the modularity of the whole group Qnew = L Qi , Qi
new is the
i=1 new
modularity of subject i;
17:
Compute the change of the modularity ∆Qtotal = Qnew − Q;
18:
if ∆Qtotal > 0 then
19:
Q = Qnew , nchange = nchange + 1;
20:
end if
21:
until ∆Qtotal ≤ 0
22:
Nodes in the same community form new meta-nodes;
23:
tN is equal to the current number of communities;
24:
Recompute the weighted matrix Ai ∈ (0, 1)tN ×tN of subject i, i = 1, · · · , L;
25: until nchange = 0.
Figure 4.3: Algorithm 5: community detection of multiple weighted networks

100

direction information for community detection, we generate a simulated directed network
and employ two strategies to detect the community structure of this network. The ﬁrst approach is to apply our proposed algorithm to the association matrix of the directed network
A directly and the second approach is to ignore the edge direction information and apply the
original Blondel’s algorithm to the association matrix of the undirected network 1 (A + AT ),
2
where AT is the transpose of matrix A. The simulated network consists of 24 nodes and 2
clusters, with each cluster having 12 nodes. Nodes 1 to 12 are in the same cluster, while the
rest are in the other cluster. Each entry of the association matrix A is uniformly distributed
between [0, 1], which resembles the normalized DI value and A is not symmetric. The mean
of connectivity strength in each cluster is 0.5. The mean inter-cluster connectivity strength
from cluster 1 to 2 is 0.9, and from 2 to 1 is 0.1 . The standard deviation of the connectivity
strengths in each cluster is 0.2. The community detection results are evaluated by computing
the percentage of false discoveries or false discovery rate (FDR) F ,
N
∑ Oi,j − Mi,j
F =
N2

(4.6)

i,j

where N is the number of nodes, Oi,j is a binary matrix with entries equal to 1 if nodes i
and j are in the same cluster. If nodes i and j are identiﬁed in the same cluster by the algorithm, then Mi,j = 1 and Mi,j = 0 otherwise. The community detection results are shown
in Fig. 4.4. We observe that the proposed algorithm can detect the community structure of
the network if we use the edge direction information (Figure 4.4(b)), whereas the original
algorithm designed for undirected networks fails to capture the actual community structure
(Figure 4.4(c)). Without loss of generality, we generate the network 100 times and the average false discovery rates for both approaches are obtained. If we consider the edge direction,
the false discovery rate is 0.0423 and is 0.4789 otherwise. Therefore, our algorithm has good
performance for community detection of directed networks and can reveal the real structural
information of the directed network that conventional clustering algorithms for undirected
networks cannot. In fact, when the association matrix of a network is strongly asymmetric

101

or the total in-degree and out-degree for each node is signiﬁcantly diﬀerent, the diﬀerence
of the community detection results between directed and undirected representation of the
network will be relatively large, and the community detection results based on representing
the network with directed weighted graphs can reveal the real community structure of the
network.

cluster
2
5

cluster
2

cluster
2
5

5

10
1

10

10

1

15

20
5

10 15 20
(a)

1

15

20

15

20

N/A
5

10 15 20
(b)

N/A

5

10 15 20
(c)

N/A

Figure 4.4: Community detection of diﬀerent representation of the network. Community membership matrices for (a) Actual community structure. (b) Community structure
obtained from the proposed algorithm which considers edge direction information. (c) Community structure obtained from Blondel’s original algorithm which does not consider edge
direction. White indicates that the corresponding node pairs are not in the same cluster
(cluster N/A). Gray indicates that the corresponding node pairs are in cluster 1. Brown
indicates that the corresponding node pairs are in cluster 2.

4.4.2

Group analysis on synthetic data

In this subsection, we evaluate the performance of the proposed group analysis algorithm by
applying it to a single simulated network and a group of simulated networks. In addition, we
compare our proposed group analysis algorithm with alternative group analysis approaches to
show the eﬀectiveness of our method. First, we test our algorithm on a directed network with
64 nodes and 4 clusters, with each cluster having 16 nodes. Each entry of the association
matrix is uniformly distributed between [0, 1], which resembles the normalized DI value.
Means of intra-cluster connectivity strength in the four clusters are 0.3, 0.5, 0.7, and 0.9,
respectively. The mean of inter-cluster connectivity is 0.15. To demonstrate the robustness

102

False discovery rate

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Standard deviation of connectivity strength

Figure 4.5: Average false discovery rate of the community detection algorithm for a simulated
directed weighted network.

of the algorithm, the standard deviation of these distributions is modiﬁed from 0.1 to 0.5
with a step size of 0.1. Without loss of generality, we generate the network 50 times and
the average false discovery rate F is obtained for diﬀerent standard deviations. The result is
shown in Figure 4.5. We observe that the false discovery rate grows with increasing standard
deviation of the strength of connectivity for each edge. Even so, the maximum false discovery
rate of our algorithm is 0.126, which is low and acceptable.
To test the eﬀectiveness of the proposed group analysis method, we test Algorithm 5
(Figure 4.3) on a group of ten directed networks with the same community structure. Each
network has 64 nodes and 4 clusters, with each cluster having 16 nodes. Each entry of the
association matrix is uniformly distributed between [0, 1]. Means of intra-cluster connectivity
strength in the four clusters are 0.3, 0.5, 0.7, and 0.9, respectively. The mean of intercluster connectivity is 0.15. For each network, the standard deviation of all the edge values
are randomly chosen from [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], which leads to the variation
across the ten networks. Without loss of generality, we generate 50 simulations of networks
to get the average false discovery rate. In addition, we compare our method with two
standard approaches, i.e., VTS (average the association matrix) and IS (average the subject-

103

Table 4.1: Average false discovery rate for group analysis methods
Proposed Algorithm

VTS

IS (average)

IS (voting)

FDR

0.0175

0.0225

0.6254

0.0231

Time (s)

141.1615

13.7438

141.9871

141.9870

speciﬁc community structure or majority voting). The average false discovery rate F for each
method is shown in Table 4.1. We observe that the proposed algorithm outperforms existing
approaches with the lowest false discovery rate, which indicates our proposed method can
provide promising results for group analysis problems. In addition, our proposed method is
slightly computationally more eﬃcient than IS but less eﬃcient than VTS. Due to the fact
that VTS is applied to averaged association matrix, it is only applied once instead of ten
times. IS approach consumes a large amount of time in extracting the community structure
for each subject before obtaining a common structure, while our proposed algorithm spends
time in computing the change of modularity for all subjects when combing small clusters.

4.4.3

Group analysis on EEG Data

In this chapter, we examined EEG data from a study containing the error-related negativity
(ERN) as described in Chapter 1. Previous work indicates there is increased information
ﬂow associated with ERN for the theta frequency band (4 − 8 Hz) and ERN time window
25 − 75 ms for Error responses (ERN) compared to Correct responses (CRN) [122]. The
EEG data collected are preprocessed by the spherical spline current source density (CSD)
waveforms to sharpen event-related potential (ERP) scalp topographies and eliminate volume
conduction [73]. In addition, the bandpass ﬁlter is used to obtain signals in the theta band.
The eﬀective connectivity quantiﬁed by the time-lagged DI is computed over a window
corresponding to the ERN response (0 − 100 ms after the response), for all trials between
each pair of 61 electrodes in the theta band. The time-lagged DI is averaged over the
information ﬂow within 10 − 20 ms time delay [59], i.e. , d = 5, · · · , 10 in equation (2.13).

104

Once the connectivity matrices for each response type and each subject are obtained, we use
Algorithm 5 to identify the community structure of each response type. Since the proposed
clustering approach is multi-level, we give the clustering results at all levels for each response
type. The results are shown in Figure 4.6 and Figure 4.7. We observe that both the CRN
and ERN show hierarchical structures. The optimal number of partition levels for the CRN
is 3, while for the ERN it is 4. Since the modularity for each response type achieves its
maxima at the top (ﬁnal) level, we interpret the partition at the top (ﬁnal) level for each
response type. The third level of partition for CRN has three large clusters, i.e., frontal,
parietal, and some nodes in the central regions, which indicates that the frontal and parietal
regions exchange information through the central nodes. The fourth level of partition for
ERN also has three large modules, i.e., left frontal-central region, right front-central-parietal
region, and the parietal region, which indicates that the frontal and parietal regions work
together when an error occurs, and the left and right side of the brain work diﬀerently.
The diﬀerences between CRN and ERN also implies that ERN has more large-scale (across
diﬀerent regions) interactions compared to CRN, while the information ﬂow in CRN is more
local or less integrated. These results are aligned with previous work which indicates that
there is increased information ﬂow associated with ERN for the theta frequency band (4 − 8
Hz) and ERN time window 25 − 75 ms for Error responses compared to correct responses in
particular between mPFC and lPFC regions [74].
Though we have the community structures for both CRN and ERN at diﬀerent resolutions, there are still two issues that need to be addressed. First, we need to determine
whether the obtained clusters are signiﬁcantly diﬀerent from those from a random network.
In order to address this issue, the modularity of the community structure from the actual
data needs to be compared with modularity from random networks. Second, the modularity is known to have the problem of resolution limit for weighted graphs and it may not
be the best criterion for the evaluation of the obtained community structure of a weighted
network [123]. Therefore, we propose to determine the optimal level of partition for each

105

.

.

.

.

.

Fp2

.

.
AF4 AF8

.

AF7 AF3
.

.

Fpz

Fp1

AFz

.

.

.
.
.
.
.
F8
.
F7 F5 F3
F1 Fz F2 F4 F6
FT8
.FT7 .
.
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
.

.

.

.

.

.

.

.

.

T7 C5 C3 C1 Cz C2 C4 C6 T8
TP7 .
.
.
.
.
.
. TP8
.
CP5 CP3 CP1 CPz CP2 CP4 CP6 .
.

.

.
P5 P3 P1

.

P7

.

.

PO3 POz

PO7

.

.

.

.

.

P2 P4 P6.

Pz

.

.

P8

.

PO4

PO8

.

.

O1

.

O2

Oz
(a)

.

.

Fpz

Fp1

.

.

.

AF7 AF3

AFz

.

.

Fp2

.

.

.

.

.

.

.

.

.

.

T7 C5 C3 C1 Cz C2 C4 C6 T8
TP7 .
.
.
.
.
.
. TP8
.
CP5 CP3 CP1 CPz CP2 CP4 CP6 .
.

.

P7

.

.
P5 P3 P1
.

PO7

.

.

Pz
.

PO3 POz
.

O1

.

Oz

.

.

PO4
.

.

.

AF7 AF3

.

.

.

AFz
.

.

.

Fp2

.

.

AF4 AF8

.

.

.

.

F8
F7 F5 F3
F1 Fz F2 F4 F6
. FT7
.
. FT8
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
.

.

.

.

.

.

.

.

.

.

.

T7 C5 C3 C1 Cz C2 C4 C6 T8
.
.
.
.
.
TP7 .
. TP8
.
.
CP3 CP1 CPz CP2 CP4 CP6
CP5

.

.
P2 P4 P6

Fpz

Fp1

.

AF4 AF8

.
.
.
.
.
.
F8
F7 F5 F3 .
F1 Fz F2 F4 F6
FT8 .
. FT7 .
.
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
.

.

.

.

.

P5 P3 P1

Pz

.

.

.

.

P8

P7

.

.

PO8

PO7

O2

.

.

.

.

PO3 POz
.

O1

.

PO4
.

.

Oz

.

P2 P4 P6

.

P8

.

PO8

O2

(c)

(b)

Figure 4.6: Applying the multi-subject community detection algorithm to 10 subjects for
CRN. (a) The ﬁrst level of partition for CRN. (b) The second level of partition for CRN. (c)
The third level of partition for CRN.

106

.

.

Fpz

Fp1

.

.

.

.

Fp2

AF4 AF8

AFz

.

.

.

.
.
.
F8
F7 F5 F3
F1 Fz F2 F4 F6
FT8
.FT7 .
.
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
.

.

.

.

.

.

.

.

.

.

P5 P3 P1

P7

.

Pz

.

.

PO3 POz

PO7

.

.

.

.

.

.

.

.

.

.

.

P7

P8

.

Fpz

.

.

.

AF7 AF3

AFz

Pz

PO7

.

.

.

.

.

.

P7

.

PO7

.

Pz

.

.
PO3 POz
.

O1

.

Oz

.

P8

.

PO4

PO8

O2

.

.

.

.
AF7 AF3

.

.

AFz

.

Fp2
Fp2
.

.
AF4 AF8
.

.

. F8
.
.
.
F7 F5 F3
.
.
.
F6
F4
Fz
F2
F1 Fz F2 F4 F6
FT8
.FT7 .
.
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
FC4
FC2
FCz

.

.

.

.

.

.

.

.

.

.

.

C2
C4
C6
T7 C5 C3 C1 Cz C2 C4 C6 T8
TP7 .
.
.
.
.
.
. TP8
.
CP5 CP3 CP1 CPz CP2 CP4 CP6 .

.

P2 P4 P6.
.
PO4

.

Fpz
Fpz

.

.
AF4 AF8

.

.
P5 P3 P1

.

.

.

.

Fp1

T7 C5 C3 C1 Cz C2 C4 C6 T8
TP8
TP7 .
.
.
.
.
.
.
. CP5 CP3 CP1 CPz CP2 CP4 CP6
.
.

.
.

Oz

.

Fp2

.

.

.

PO3 POz
O1

.

.

.

P4
P2 P4 P6

.

.

.

.
. F8
.
.
.
.
F7 F5 F3
.
F1 Fz F2 F4 F6
FT8
.FT7 .
.
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6
.

.

.

(b)

.

.

.

.

P5 P3 P1

.

(a)
Fp1

.

.

.

.

.

.

O2

Oz

AF4 AF8

AFz
.

.

.

.

.

PO8

.

.

.

.

.

T7 C5 C3 C1 Cz C2 C4 C6 T8
.
.
.
.
.
. TP8.
.TP7 .
CP1 CPz CP2 CP4 CP6
CP5 CP3

.

PO4

.

O1

.

P2 P4 P6

.

Fp2

F8
F7 F5 F3
F1 Fz F2 F4 F6
.FT7 .
. FT8
.
.
.
.
.
FC5 FC3 FC1 FCz FC2 FC4 FC6

T7 C5 C3 C1 Cz C2 C4 C6 T8
.
.
.
TP7 .
.
.
. TP8 .
.
CP5 CP3 CP1 CPz CP2 CP4 CP6
.

.

AF7 AF3

.

.

.

Fpz

Fp1

.

.

.

.

AF7 AF3

.

.

.

.

.

P8

P7

.

.

PO8

PO7

.

O2

.

.

.
P5 P3 P1
.

.

Pz

.

PO3 POz
.

O1

.

Oz

.

.

.
P2 P4 P6
.

PO4
.

.

P8

.

PO8

O2

(d)

(c)

Figure 4.7: Applying the multi-subject community detection algorithm to 10 subjects for
ERN. (a) The ﬁrst level of partition for ERN. (b) The second level of partition for ERN. (c)
The third level of partition for ERN. (d) The fourth level of partition for ERN.

107

.

Table 4.2: Comparison of Modularity for CRN with for random graphs
Modularity

Level 1

Level 2

Level 3

CRN

−0.0125

0.0678

0.0992

Averaged across random graphs

−0.0512

−0.0027

0.0151

Table 4.3: Comparison of Modularity for ERN with for random graphs
Modularity

Level 1

Level 2

Level 3

Level 4

ERN

−0.0161

0.0587

0.0813

0.087

Averaged across random graphs

−0.0526

−0.0053

0.0130

0.0170

response type in terms of information ﬂow. To address the ﬁrst problem, we compute the
modularity from random graphs and compare them with the modularity obtained from the
CRN and ERN. For comparison, random graphs that have the same number of nodes and
weighted links with the original graphs are generated. To be speciﬁc, we keep the weight
of each link unchanged and randomly assign the links [32]. In this way, the diﬀerences of
modularity between real networks and random networks are mainly due to the structure of
the network. We generate 100 sets of random graphs for each response type, which have
the same level of partitions with the original one, i.e., for the CRN the number of levels
is 3 and for the ERN it is 4. The modularity from the original networks of both response
types and the modularity averaged across random networks at diﬀerent levels of partitions
are summarized in Tables 4.2 and 4.3. The modularity for the original networks is greater
than that of random graphs. In addition, the null hypothesis, i.e. the modularity obtained
from the CRN and ERN is not diﬀerent from the modularity from random graphs at each
level of partition, is rejected at α = 0.001 signiﬁcance level. Therefore, the clusters obtained
for each response type reveal signiﬁcant diﬀerences from the equivalent random networks,
which indicates signiﬁcant modular structures for both response types.
Modularity may sometimes fail to determine the best partition for a weighted network.
Therefore, we propose to determine the best partition in terms of information ﬂow. We

108

choose the partition which has high information ﬂow inside the module and low information
ﬂow between modules to represent the community structure of the whole group. In this
way, we can also reveal the community at diﬀerent resolutions in terms of information ﬂow.
Suppose we have K clusters at partition level l, the ratio between intra-cluster and intercluster information ﬂow is given as follows,
1 ∑
T Ai,j ;
W IF Lk = 2
Nk
i,j∈Ck

1
BIF Lk =
K −1

K
∑

∑

p=1,p̸=k i∈Ck ,j∈Cp

T Ai,j + T Aj,i
;
2 × Nk × Np

(4.7)

K
1 ∑ W IF Lk
R=
,
K
BIF Lk
k=1

1 ∑
where Nk is the number of nodes in cluster Ck , k = 1, · · · , K, T A = L L Am where Am
m=1

is the association matrix of the mth subject in the group and L is the number of subjects
in that group (In our analysis L = 10), W IF Lk is the information ﬂow within cluster k,
BIF Lk is the averaged inter-cluster information ﬂow between cluster k and other clusters,
while R is the averaged ratio of intra-cluster information ﬂow to inter-cluster information
ﬂow over clusters. We compute the information ﬂow ratio for both the CRN and ERN
response types and compare this ratio with the one obtained from random assignment of the
cluster labels to each node for each response type. To be speciﬁc, for each level of partition,
we keep the number of clusters and the number of elements in each cluster unchanged, but
randomly assign each node a community label and compute the information ﬂow ratio at each
level. In this way, the diﬀerences of information ﬂow within and among clusters between real
networks and random networks at each speciﬁc level of partition are mainly due to the levelspeciﬁc structure rather than the number of clusters and the number of nodes in each cluster.
For each level of partition, we repeat the random assignment 100 times and the averaged
information ﬂow ratio across these random networks are shown in Figure 4.8. We observe
that at each level of partition the information ﬂow ratio obtained from original community
structure is higher than the ratio averaged across random graphs. In addition, for each level
109

of partition, the hypothesis that there is no diﬀerence between the information ﬂow ratio of
the original community structure and a randomly assigned community structure is rejected
at α = 0.001 signiﬁcance level. Therefore, the obtained structures at each level of partition
for brain networks are signiﬁcantly diﬀerent from random graphs, which is aligned with
the results when using modularity for signiﬁcance testing. We also observe that ERN has
a slightly higher information ﬂow ratio, i.e., stronger within cluster information exchange
compared to intra-cluster information exchange, which indicates strong local activity or
functional segregation compared to CRN. On the other hand, for the comparison of best
representation of CRN and ERN in terms of information ﬂow, we choose the second level
of partition for both groups (Figure 4.6(b) and Figure 4.7(b)), because the information
ﬂow ratio of both response types reaches the local maxima at level 2. We observe that
at this level the ERN group and the CRN group have diﬀerent clustering (Figure 4.6(b)
and Figure 4.7(b)) in the frontal and central-parietal regions. The frontal and centralparietal regions around the cerebral midline are not in the same cluster for ERN, which
shows the functional specialization of the frontal and central-parietal regions around the
cerebral midline whereas for CRN that specialization does not exist. On the other hand, the
right lateral frontal and central-parietal regions are in the same cluster for ERN, which is
contrary to the CRN. These results are aligned with the previous work in [75], which shows
that error processing is controlled by the communication between the lateral prefrontal cortex
and medial prefrontal cortex.

4.5

Conclusions

In this chapter, we propose a method to identify modules in the eﬀective brain network.
In order to achieve this goal, ﬁrst, we applied the directed information measure to EEG
data involving a study of ERN to obtain the association matrix of the network. Directed
information can eﬀectively detect the nonlinear causal relationship between EEG signals,
which is the basis for obtaining a reliable community structure. In addition, we extended
110

1.5

Ratio

1

0.5

0
0

1

2
Level

CRN
ERN
Random-CRN
Random-ERN
3
4

Figure 4.8: The information ﬂow ratio of both response types and their corresponding random networks.

a modularity based community detection algorithm proposed by Blondel et al. to weighted
directed networks. Compared to current community detection methods, our proposed algorithm discovers the actual structure of a directed network by employing the edge direction
information. Finally, we proposed a group analysis method to obtain a common community
structure across subjects to address the problem of variability across subjects. We extended
the idea of modularity optimization from a single subject to a group of subjects. This strategy decreases the eﬀect of outliers without making any assumptions about the data. The
proposed group analysis method has higher accuracy in community detection than standard
approaches, such as VTS and IS. It is also applied to EEG data and is shown to discriminate
between error and correct responses in terms of the community structures obtained.
The proposed algorithm is based on the optimization of modularity. However, the modularity optimization encounters the problem of resolution limit, which indicates that it may
miss detecting clusters whose size is comparatively small to the whole graph [39]. Therefore,
it would be of interest to investigate and extend methods that do not depend on modularity
optimization, e.g. random walk, to ﬁnd the common communities across a group of weighted directed networks. In addition, one can consider overlapping communities by extending

111

the current framework to consider multiple community memberships. Finally, this work
can be extended to dynamic networks and detect the change of modules across time and
frequency [102, 124].

112

Chapter 5
SUMMARY AND FUTURE WORK

This thesis discusses the problem of revealing the underlying structure of complex networks
from multivariate time series. In particular, we focus on quantifying the pairwise causal relationships, inferring the topological structure, and detecting functional modules of a complex
network.

5.1

Summary

The ﬁrst part of this thesis focuses on ﬁnding a proper measure to quantify the interaction
and causality between two random processes. To achieve this goal, we tried to answer the
following questions: (1) What are some suitable measures? (2) What are the problems in
the implementation and application to real data with limited sample size? Motivated by
these questions, we introduced directed information and illustrated its relationship with existing measures, such as Granger causality, mutual information, and transfer entropy. The
implementation of DI requires large sample sizes and is very time consuming. To reduce
the computational complexity of computing DI while still quantifying the causal dependencies, we derived simpliﬁed expressions of DI, i.e., the time-lagged directed information and
modiﬁed directed information. We also showed the relationship between modiﬁed directed
information and transfer entropy. The proposed expressions are shown to be more eﬃcient
than the original DI in terms of computational complexity and capture more causal dependencies than short time DI without considering time lags. Moreover, we compared the
performance of DI with model based measures such as Granger causality on diﬀerent realistic
signal models, and DI is found to be applicable to a wider range of signal types. In addition,
we developed a new directed information estimation method based on multi-information and
provided a quantitative comparison of various DI estimation methods.

113

Quantifying the pairwise causal relationships does not reﬂect the true topological structure of a system. Two nodes which have a high DI value may inﬂuence each other indirectly
through a third node. In the second part of our proposed work, we pointed out the drawback
of only using directed information for network inference, i.e., using DI alone can not distinguish direct causality from indirect causality. Therefore, the conditional directed information
is applied to address this problem. Moreover, time-lagged conditional directed information
and the modiﬁed conditional directed information are proposed to reduce the computational
complexity. Three algorithms are proposed for network inference and the results for both
linear and nonlinear networks are shown. The results of our algorithms on simulated data
show that the combination of modiﬁed DI and CDI can eﬀectively increase the accuracy of
network inference.
The proposed network inference algorithms are able to detect the organization of any
collection of triplets in a network, but may fail to reﬂect the dynamics between a large group
of nodes or signals because of the limitation of computational complexity. In the last part
of our proposed work, we aim to simplify the inferred networks as well as to determine the
functional modules that result in the observed connectivity patterns by developing community detection algorithms for directed networks. We introduced a hierarchical community
detection algorithm to discover the modules in a complex weighted directed network. In
addition, we proposed a group analysis method to obtain a common community structure
across subjects to address the problem of variability across subjects in neurophysiological
study. We applied the proposed framework to an EEG data set collected during a study of
cognitive control networks in the brain. In particular, we looked at a data set of subjects
involved in error-processing. The proposed method is applied to both synthetic data and
EEG data and is shown to discriminate between error and correct responses in terms of the
community structures obtained.

114

5.2

Future work

There are still remaining challenges with the application of causality measures to real multivariate data and the inference of network structure. Some of these challenges include:
• The extension and evaluation of the proposed community detection algorithm. Although the proposed community detection algorithm has promising performance, there
are some issues remaining to be addressed. First, both theoretical and practical development may be needed to explain which representation (directed or undirected graph)
is more suitable for revealing the actual module structure of the network in diﬀerent
scenarios. For example, when the association matrix quantiﬁed by eﬀective connectivity is approximately symmetric, the results of community detection based on functional
and eﬀective connectivity will be similar to each other. Second, one node can belong to
more than one community, therefore, it might be interesting to extend the algorithm
to uncover the overlapping community structure of a network. Third, the community
structure of the brain network may change over time, therefore, community structure
for a group of time-dependent, multi-scale networks are needed. Mucha et al. developed a generalized framework for network quality function that detects the community
structure in time-dependent multi-scale and multiplex networks [125]. This approach
may be extended to ﬁnd the community structure of the eﬀective brain network over
time.
• The group analysis proposed in Chapter 4 may also be applied to the network inference
problem. The proposed network inference algorithm was applied to synthetic data,
but when applied to the EEG data, the results are hard to interpret because of the
variability across subjects. Therefore, instead of reconstructing the brain network
for each subject, it is possible to reconstruct one network for the whole group. For
example, we can keep the common strong connections and remove the common indirect
connections for a group. In addition, the accuracy of the inference algorithm can
115

also be improved by considering more complex scenarios. Node X may inﬂuence Y
through nodes Z and W and it is hard to remove the false positive connection between
X and W when only computing DI(X → Y||Z) or DI(X → Y||W) , because the
conditional directed information will equal to zero when considering both nodes, i.e.,
DI(X → Y||ZW). However, considering more nodes means higher dimensionality
and computational complexity compared to only considering one node. Therefore,
more reliable estimation of conditional directed information is needed. This problem
can be addressed by either using parametric density models or improving existing
mutual information and entropy estimators. In addition, the performance of MCDI
and TLCDI also depend on the estimated order of the model. Recently, Faes et al.
proposed a sequential procedure to determine the embedding dimension of multivariate
series [80]. This method is based on an information-theoretic technique and shows
promising performance for various signal models, which may be extended to MCDI
and TLCDI computation in the future.

116

APPENDICES

117

Appendix A
CONDITIONAL DIRECTED INFORMATION IN TWO GENERAL
TRIVARIATE MODELS

Two general models for triplets connected in a hub and chain pattern are given respectively
below:
Model 1: Hub Model
Xn = f (Xn−p1 , · · · , Xn−1 ) + un ,
Yn = g(Xn−p2 , · · · , Xn−1 , Yn−p3 , · · · , Yn−1 ) + vn ,

(A.1)

Zn = h(Xn−p4 , · · · , Xn−1 , Zn−p5 , · · · , Zn−1 ) + wn ,
where X causes Y and Z (X is the hub), respectively. f (·), g(·), h(·) are three diﬀerent
functions, p2 (p4 ) is the maximum time lag between X and Y (X and Z), p1 , p3 , and p5 are
the order of time series X, Y, and Z respectively. un , vn and wn are white Gaussian noises
and are independent of the signals.
Model 2: Chain Model
Xn = f (Xn−p1 , · · · , Xn−1 ) + un ,
Yn = g(Xn−p2 , · · · , Xn , Yn−p3 , · · · , Yn−1 ) + vn ,

(A.2)

Zn = h(Yn−p4 , · · · , Yn , Zn−p5 , · · · , Zn−1 ) + wn ,
where X causes Y, and Y causes Z (Y is the intermediate node). f (·), g(·), h(·) are three
diﬀerent functions, p2 (p4 ) is the maximum time lag between X and Y (Y and Z), p1 , p3 ,
and p5 are the order of time series X, Y, and Z respectively. un , vn and wn are white
Gaussian noises and are independent of the signals.
We compare two causally conditional directed information for both models, i.e., DI(Y N →
Z N ||X N ) and DI(X N → Z N ||Y N ).

118

• For hub case:
DI(Y N → Z N ||X N ) =

=

=

N
∑

I(Y n ; Zn |Z n−1 X n )

n=1
N
∑

[H(Zn |Z n−1 X n ) − H(Zn |Y n Z n−1 X n )]

n=1
N
∑

[H(h(Xn−p4 , · · · , Xn , Zn−p5 , · · · , Zn−1 ) + wn |Z n−1 X n )

n=1

− H(h(Xn−p4 , · · · , Xn , Zn−p5 , · · · , Zn−1 ) + wn |X n Z n−1 Y n )]
=

=

N
∑

[H(wn |Z n−1 X n ) − H(wn |X n Z n−1 Y n )]

n=1
N
∑

[H(wn ) − H(wn )] = 0.

n=1

(A.3)
where the forth equality comes from that H(Y |X1 , · · · , Xn ) = H(Z|X1 , · · · , Xn ) when
∑n
Y =
i=1 Xi + Z [126], and the ﬁfth equality is due to the fact that noise wn is
independent of signals.

DI(X N → Z N ||Y N ) =

=

N
∑

I(X n ; Zn |Z n−1 Y n )

n=1
N
∑

[H(Zn |Z n−1 Y n ) − H(Zn |X n Z n−1 Y n )],

(A.4)

n=1

≥ 0.
where the inequality is true since conditioning reduces entropy and the equality holds
only when X n and Y n are identical.
Therefore,
DI(X N → Z N ||Y N ) > 0 = DI(Y N → Z N ||X N ).
• For chain case:
119

(A.5)

DI(Y N → Z N ||X N ) =

=

N
∑

I(Y n ; Zn |Z n−1 X n )

n=1
N
∑

[H(Zn |Z n−1 X n ) − H(Zn |X n Z n−1 Y n )]

(A.6)

n=1

≥ 0.

DI(X N → Z N ||Y N ) =

=

=

N
∑

I(X n ; Zn |Z n−1 Y n )

n=1
N
∑

[H(Zn |Z n−1 Y n ) − H(Zn |Y n Z n−1 X n )]

n=1
N
∑

[H(h(Yn−p4 , · · · , Yn , Zn−p5 , · · · , Zn−1 ) + wn |Z n−1 Y n )

n=1

− H(h(Yn−p4 , · · · , Yn , Zn−p5 , · · · , Zn−1 ) + wn |Y n Z n−1 X n )]
=

=

N
∑

[H(wn |Z n−1 Y n ) − H(wn |Y n Z n−1 X n )]

n=1
N
∑

[H(wn ) − H(wn )] = 0.

n=1

(A.7)
Therefore,
DI(Y N → Z N ||X N ) > 0 = DI(X N → Z N ||Y N ).

120

(A.8)

Appendix B
COMPUTATION OF TIME-LAGGED CONDITIONAL DIRECTED
INFORMATION

In this Appendix, we compute the time-lagged conditional directed information for both hub
and chain models.
Model 1: X interacts with Y and Z with diﬀerent time delays,
Xi = ui ,
Yi = bXi−m + vi ,

(B.1)

Zi = cXi−n + wi .
Therefore,
E(Xk ) = 0, V ar(Xk ) = σ 2 , Cov(Xk Xk+1 ) = 0,
E(Yk ) = 0, V ar(Yk ) = (b2 + 1)σ 2 , Cov(Yk Yk+1 ) = 0,
E(Zk ) = 0, V ar(Zk ) = (c2 + 1)σ 2 , Cov(Zk Zk+1 ) = 0,

 2
bσ , l = m

Cov(Xk−l Yk ) = E[Xk−l (bXk−m + vk )] =


0, otherwise


 2
bσ , l = m − 1

Cov(Xk−l Yk+1 ) = E[Xk−l (bXk+1−m + vk+1 )] =

0, otherwise


 2
cσ , l = m

Cov(Xk−l Zk+n−m ) = E[Xk−l (cXk+n−m−n + wk+n−m )] =

0, otherwise


 2
cσ , l = m − 1

Cov(Xk−l Zk+n−m+1 ) = E[Xk−l (cXk+n−m+1−n + wk+n−m+1 )] =

0, otherwise


121


 2
bσ , l = m + 1

Cov(Xk−l+1 Yk ) = E[Xk−l+1 (bXk−m + vk )] =


0, otherwise


 2
bσ , l = m

Cov(Xk−l+1 Yk+1 ) = E[Xk−l (bXk+1−m + vk+1 )] =

0, otherwise


 2
cσ , l = m + 1

Cov(Xk−l+1 Zk+n−m ) = E[Xk−l+1 (cXk+n−m−n + wk+n−m )] =

0, otherwise


 2
cσ , l = m

Cov(Xk−l+1 Zk+n−m+1 ) = E[Xk−l+1 (cXk+n−m+1−n + wk+n−m+1 )] =

0, otherwise

Cov(Yk Zk+n−m ) = E[(bXk−m + vk )(cXk+n−m−n + wk+n−m )] = bcσ 2 ,
Cov(Yk Zk+n−m+1 ) = E[(bXk−m + vk )(cXk+n−m+1−n + wk+n−m+1 )] = 0,
Cov(Yk+1 Zk+n−m ) = E[(bXk+1−m + vk+1 )(cXk+n−m−n + wk+n−m )] = 0,
Cov(Yk+1 Zk+n−m+1 ) = E[(bXk+1−m + vk+1 )(cXk+n−m+1−n + wk+n−m+1 )] = bcσ 2 .
(B.2)

Then we get the covariance matrix and obtain the CDI values for diﬀerent values of l.
(1) For l = m,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m Zk+n−m+1 | = σ 12 , |Xk−l Yk | = σ 4 ,
|Xk−l Zk+n−m | = σ 4 , |Xk−l Xk−l+1 Zk+n−m Zk+n−m+1 | = σ 8 ,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m | = σ 10 , |Xk−l Yk Zk+n−m | = σ 6 ,
|Xk−l Xk−l+1 Zk+n−m | = σ 6 ,
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = 0.

122

(B.3)

(2) For l = m − 1,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m Zk+n−m+1 | = (b2 + c2 + 1)σ 12 , |Xk−l Yk | = (b2 + 1)σ 4 ,
|Xk−l Zk+n−m | = (c2 + 1)σ 4 , |Xk−l Xk−l+1 Zk+n−m Zk+n−m+1 | = (c2 + 1)σ 8 ,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m | = (b2 + c2 + 1)σ 10 , |Xk−l Yk Zk+n−m | = (b2 + c2 + 1)σ 6 ,
|Xk−l Xk−l+1 Zk+n−m | = (c2 + 1)σ 6 ,
1
(b2 + 1)(c2 + 1)
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log
.
2
b2 + c2 + 1
(3) For l = m + 1,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m Zk+n−m+1 | = (b2 + c2 + 1)σ 12 , |Xk−l Yk | = (b2 + 1)σ 4 ,
|Xk−l Zk+n−m | = (c2 + 1)σ 4 , |Xk−l Xk−l+1 Zk+n−m Zk+n−m+1 | = (c2 + 1)σ 8 ,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m | = (b2 + 1)σ 10 , |Xk−l Yk Zk+n−m | = (b2 + c2 + 1)σ 6 ,
|Xk−l Xk−l+1 Zk+n−m | = σ 6 ,
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log

(b2 + 1)(c2 + 1)
.
b2 + c2 + 1
(B.4)

(4) For l ̸= m − 1, m, m + 1,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m Zk+n−m+1 | = (b2 + c2 + 1)2 σ 12 , |Xk−l Yk | = (b2 + 1)σ 4 ,
|Xk−l Zk+n−m | = (c2 + 1)σ 4 , |Xk−l Xk−l+1 Zk+n−m Zk+n−m+1 | = (c2 + 1)2 σ 8 ,
|Xk−l Xk−l+1 Yk Yk+1 Zk+n−m | = (b2 + 1)(b2 + c2 + 1)σ 10 ,
|Xk−l Yk Zk+n−m | = (b2 + c2 + 1)σ 6 , |Xk−l Xk−l+1 Zk+n−m | = (c2 + 1)σ 6 ,
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log

(b2 + 1)(c2 + 1)
.
b2 + c2 + 1
(B.5)

The time-lagged conditional directed information between Y and Z given X can be

123

simpliﬁed as:
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 )

1
 log (b2 +1)(c2 +1) , l = m − 1

2

(b2 +c2 +1)


= 0, l = m





log (b2 +1)(c2 +1) , l ≥ m + 1 or l < m − 1

2 2

(B.6)

(b +c +1)

Model 2: X causes Y, and Y causes Z,
Xi = ui ,
Yi = bXi−m + vi ,
Zi = cYi−(n−m) + wi .
.
E(Xk ) = 0, V ar(Xk ) = σ 2 , Cov(Xk Xk+1 ) = 0,
E(Yk ) = 0, V ar(Yk ) = (b2 + 1)σ 2 , Cov(Yk Yk+1 ) = 0,
E(Zk ) = 0, V ar(Zk ) = (c2 + 1)σ 2 , Cov(Zk Zk+1 ) = 0,

 2
bσ , l = m

Cov(Xk Yk+l ) = E[Xk (bXk+l−m + vk+l )] =

0, Otherwise


 2
bσ , l = m − 1

Cov(Xk Yk+l+1 ) = E[Xk (bXk+l+1−m + vk+l+1 )] =

0, Otherwise

Cov(Xk Zk+m+n ) = E[Xk (cYk+n+m−n + wk+n+m )] = bcσ 2 ,
Cov(Xk Zk+m+n+1 ) = E[Xk (cXk+n+m+1−n + wk+n+m+1 )] = 0,

 2
bσ , l = m + 1

Cov(Xk+1 Yk+l ) = E[Xk+1 (bXk+l−m + vk+l )] =

0, Otherwise


 2
bσ , l = m

Cov(Xk+1 Yk+l+1 ) = E[Xk+1 (bXk+l+1−m + vk+l+1 )] =

0, Otherwise

124

(B.7)


 2
c(b + 1)σ 2 , l = m

Cov(Yk+l Zk+n+m ) =


0, Otherwise


 2
c(b + 1)σ 2 , l = m + 1

Cov(Yk+l Zk+n+m+1 ) =

0, Otherwise


 2
c(b + 1)σ 2 , l = m − 1

Cov(Yk+l+1 Zk+n+m ) =

0, Otherwise


 2
c(b + 1)σ 2 , l = m

Cov(Yk Zk+n−m+1 ) = E

0, Otherwise


(B.8)

Then we get the covariance matrix and obtain the CDI values for diﬀerent values of l.
(1) For l = m,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = σ 12 ,
|Yk+l Zk+m+n | = (b2 + 1)σ 4 ,
|Xk Yk+l | = σ 4 ,
|Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)2 σ 8 ,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m | = σ 10 , |Xk Yk+l Zk+n+m | = σ 6 ,
|Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)σ 6 ,
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = 0.

125

(B.9)

(2) For l = m − 1,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)(c2 + 1)σ 12 ,
|Yk+l Zk+m+n | = (b2 + 1)(b2 c2 + c2 + 1)σ 4 ,
|Xk Yk+l | = (b2 + 1)σ 4 ,
|Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)2 (b2 c2 + c2 + 1)σ 8 ,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)σ 10 ,

(B.10)

|Xk Yk+l Zk+n+m | = (b2 + 1)(c2 + 1)σ 6 ,
|Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)2 σ 6 ,
(b2 c2 + c2 + 1)
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log
.
c2 + 1
(3) For l = m + 1,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)(c2 + 1)σ 12 ,
|Yk+l Zk+m+n | = (b2 + 1)(b2 c2 + c2 + 1)σ 4 ,
|Xk Yk+l | = (b2 + 1)σ 4 ,
|Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)2 (b2 c2 + c2 + 1)σ 8 ,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)(c2 + 1)σ 10 ,
|Xk Yk+l Zk+n+m | = (b2 + 1)(c2 + 1)σ 6 ,
|Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)2 (b2 c2 + c2 + 1)σ 6 ,
1
(b2 c2 + c2 + 1)
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log
.
2
c2 + 1
(B.11)

126

(4) For l ̸= m − 1, m, m + 1,
|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)2 (c2 + 1)2 σ 12 ,
|Yk+l Zk+m+n | = (b2 + 1)(b2 c2 + c2 + 1)σ 4 ,
|Xk Yk+l | = (b2 + 1)σ 4 ,
|Yk+l Yk+l+1 Zk+n+m Zk+n+m+1 | = (b2 + 1)2 (b2 c2 + c2 + 1)2 σ 8 ,
(B.12)

|Xk Xk+1 Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)2 (c2 + 1)σ 10 ,
|Xk Yk+l Zk+n+m | = (b2 + 1)(c2 + 1)σ 6 ,
|Yk+l Yk+l+1 Zk+n+m | = (b2 + 1)2 (b2 c2 + c2 + 1)σ 6 ,
(b2 c2 + c2 + 1)
CDIk (Yk Yk+1 → Zk+n−m Zk+n−m+1 |Xk−l Xk−l+1 ) = log
.
c2 + 1

The time-lagged conditional directed information between X and Z given Y can be
simpliﬁed as:


log b2 c2 +c2 +1 , l ≤ m − 1 or l > m + 1



(c2 +1)



DIk (Xk Xk+1 → Zk+m+n Zk+m+n+1 |Yk+l Yk+l+1 ) = 0, l = m




1
 log (b2 c2 +c2 +1) , l = m + 1

2
2
(c +1)

(B.13)

127

BIBLIOGRAPHY

128

BIBLIOGRAPHY

[1]

V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment,
vol. 2008, p. P10008, 2008.

[2]

P. Mathai, N. C. Martins, and B. Shapiro, “On the detection of gene network interconnections using directed mutual information,” in Information Theory and Applications
Workshop, 2007, pp. 274–283.

[3]

A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Using directed information
to build biologically relevant inﬂuence networks,” in Proc. of Computational Systems
Bioinformatics, 2007, pp. 145–156.

[4]

C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the
directed information to infer causal relationships in ensemble neural spike train recordings,” Journal of computational neuroscience, vol. 30, no. 1, pp. 17–44, 2011.

[5]

O. Sporns, D. R. Chialvo, M. Kaiser, and C. C. Hilgetag, “Organization, development
and function of complex brain networks,” Trends in Cognitive Sciences, vol. 8, no. 9,
pp. 418–425, 2004.

[6]

J. W. Duncan and H. S. Steven, “Collective dynamics of small-world networks,” Nature,
vol. 393, no. 6684, pp. 440–442, 1998.

[7]

R. Albert, A. L. Barabasi, and H. Jeong, “Mean-ﬁeld theory for scale-free random
networks,” Physica A, vol. 272, no. 1/2, pp. 173–187, 1999.

[8]

M. E. J. Newman, “The structure and function of complex networks,” Structure,
vol. 45, no. 2, pp. 167–256, 2003.

[9]

Y. Halchenko, S. Hanson, and B. Pearlmutter, “Multimodal integration: fmri, mri,
eeg, meg,” Advanced Image Processing in Magnetic Resonance Imaging, pp. 223–265,
2005.

[10] M. A. Koch, D. G. Norris, and M. Hund-Georgiadis, “An investigation of functional
and anatomical connectivity using magnetic resonance imaging,” NeuroImage, vol. 16,
no. 1, pp. 241–250, 2002.
[11] K. J. Friston, “Functional and eﬀective connectivity in neuroimaging: a synthesis,”
Human Brain Mapping, vol. 2, no. 1-2, pp. 56–78, 1994.
[12] K. Friston, “Functional and eﬀective connectivity: A review,” Brain Connectivity,
vol. 1, no. 1, pp. 13–36, 2011.
[13] K. Friston, L. Harrison, and W. Penny, “Dynamic causal modelling,” Neuroimage,
vol. 19, no. 4, pp. 1273–1302, 2003.
129

[14] W. Hesse, E. M¨ller, M. Arnold, and B. Schack, “The use of time-variant EEG granger
o
causality for inspecting directed interdependencies of neural assemblies,” Journal of
neuroscience methods, vol. 124, no. 1, pp. 27–44, 2003.
[15] E. Pereda, R. Q. Quiroga, and J. Bhattacharya, “Nonlinear multivariate analysis of
neurophysiological signals,” Progress in Neurobiology, vol. 77, pp. 1–37, 2005.
[16] M. Kami´ski, M. Ding, W. A. Truccolo, and S. L. Bressler, “Evaluating causal relan
tions in neural systems: Granger causality, directed transfer function and statistical
assessment of signiﬁcance,” Biological Cybernetics, vol. 85, pp. 145–157, 2001.
[17] F. Lopes da Silva, J. P. Pijn, and P. Boeijinga, “Interdependence of EEG signals:
Linear vs. nonlinear associations and the signiﬁcance of time delays and phase shifts,”
Brain Topography, vol. 2, pp. 9–18, 1989.
[18] T. Schreiber, “Measuring information transfer,” Physical review letters, vol. 85, no. 2,
pp. 461–464, 2000.
[19] Y. Saito and H. Harashima, Recent Advances in EEG and EMG Data Processing,
N. Yamaguchi and K. Fujisawa, Eds. Elsevier, Amsterdam, 1981.
[20] J. Massey, “Causality, feedback and directed information,” in Proc. of Intl. Symp. on
ISITA, 1990, pp. 27–30.
[21] M. Rubinov and O. Sporns, “Complex network measures of brain connectivity: uses
and interpretations,” Neuroimage, vol. 52, no. 3, pp. 1059–1069, 2010.
[22] M. Lungarella and O. Sporns, “Mapping information ﬂow in sensorimotor networks,”
PLoS Computational Biology, vol. 2, no. 10, p. 144, 2006.
[23] H. Hinrichs, T. Noesselt, and H. J. Heinze, “Directed information ﬂow model free measure to analyze causal interactions in event related EEG-MEG-experiments,” Human
brain mapping, vol. 29, no. 2, pp. 193–206, 2008.
[24] S. Sabesan, L. B. Good, K. S. Tsakalis, A. Spanias, D. M. Treiman, and L. D. Iasemidis,
“Information ﬂow and application to epileptogenic focus localization from intracranial
eeg,” Neural Systems and Rehabilitation Engineering, IEEE Transactions on, vol. 17,
no. 3, pp. 244–253, 2009.
[25] M. Al-khassaweneh and S. Aviyente, “The Relationship Between Two Directed Information Measures,” IEEE Signal Processing Letters, vol. 15, pp. 801–804, 2008.
[26] H. Marko, “The bidirectional communication theory–a generalization of information
theory,” IEEE Transactions on Communications [legacy, pre-1988], vol. 21, no. 12, pp.
1345–1351, 1973.
[27] G. Kramer, “Capacity results for the discrete memoryless network,” IEEE Transactions
on Information Theory, vol. 49, no. 1, pp. 4–21, 2003.

130

[28] S. Tatikonda and S. Mitter, “The capacity of channels with feedback,” Information
Theory, IEEE Transactions on, vol. 55, no. 1, pp. 323–349, 2009.
[29] P. O. Amblard and O. J. Michel, “On directed information theory and granger causality
graphs,” Journal of computational neuroscience, vol. 30, no. 1, pp. 7–16, 2011.
[30] P. O. Amblard and O. J. J. Michel, “Measuring information ﬂow in networks of stochastic processes,” Journal of Computational Neuroscience, pp. 1–10, 2010.
[31] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the
directed information to infer causal relationships in ensemble neural spike train recordings,” Journal of computational neuroscience, pp. 1–28, 2010.
[32] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network
motifs: simple building blocks of complex networks,” Science, vol. 298, no. 5594, p.
824, 2002.
[33] O. Sporns, C. Honey, and R. Kotter, “Identiﬁcation and classiﬁcation of hubs in brain
networks,” PLoS One, vol. 2, no. 10, p. 1049, 2007.
[34] D. Meunier, R. Lambiotte, A. Fornito, K. D. Ersche, and E. T. Bullmore, “Hierarchical
modularity in human brain functional networks,” Frontiers in neuroinformatics, vol. 3,
no. 37, pp. 1–12, 2009.
[35] D. A. Fair, A. L. Cohen, J. D. Power, N. U. Dosenbach, J. A. Church, F. M. Miezin,
B. L. Schlaggar, and S. E. Petersen, “Functional brain networks develop from a “local
to distributed” organization,” PLoS computational biology, vol. 5, no. 5, pp. 1–14,
2009.
[36] L. Ferrarini, I. M. Veer, E. Baerends, M. J. van Tol, R. J. Renken, N. J. A. van der
Wee, D. Veltman et al., “Hierarchical functional modularity in the resting-state human
brain,” Human brain mapping, vol. 30, no. 7, pp. 2220–2231, 2009.
[37] D. Meunier, S. Achard, A. Morcom, and E. Bullmore, “Age-related changes in modular
organization of human brain functional networks,” Neuroimage, vol. 44, no. 3, pp. 715–
723, 2009.
[38] M. Chavez, M. Valencia, V. Navarro, V. Latora, and J. Martinerie, “Functional modularity of background activities in normal and epileptic brain networks,” Physical review
letters, vol. 104, no. 11, p. 118701, 2010.
[39] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5,
pp. 75–174, 2010.
[40] T. M. Cover, J. A. Thomas, and J. Wiley, Elements of information theory.
Online Library, 1991, vol. 1.

Wiley

[41] M. Studeny and J. Vejnarova, “The multiinformation function as a tool for measuring
stochastic dependence,” Learning in graphical models, pp. 261–300, 1998.

131

[42] Y. Liu, S. Aviyente, and M. Al-khassaweneh, “A high dimensional directed information
estimation using data-dependent partitioning,” in Proc. of IEEE Workshop on SSP,
2009, pp. 606–609.
[43] C. W. J. Granger, “Testing for causality:: A personal viewpoint,” Journal of Economic
Dynamics and Control, vol. 2, pp. 329–352, 1980.
[44] L. A. Baccala and K. Sameshima, “Partial directed coherence: a new concept in neural
structure determination,” Biological Cybernetics, vol. 84, no. 6, pp. 463–474, 2001.
[45] J. Geweke, “Measurement of linear dependence and feedback between multiple time
series,” Journal of the American Statistical Association, vol. 77, no. 378, pp. 304–313,
1982.
[46] J. S. Moser, H. S. Schroder, C. Heeter, T. P. Moran, and Y.-H. Lee, “Mind your
errors: Evidence for a neural mechanism linking growth mindset to adaptive posterror adjustments,” Psychological Science, vol. 22, no. 12, pp. 1484–1489, 2011.
[47] J. S. Moser, T. Moran, and A. Jendrusina, “Parsing relationships between dimensions of anxiety and action monitoring brain potentials in female undergraduates,”
Psychophysiology, vol. 49, no. 1, pp. 3–10, 2012.
[48] B. A. Eriksen and C. W. Eriksen, “Eﬀects of noise letters upon the identiﬁcation of
a target letter in a nonsearch task,” Perception and psychophysics, vol. 16, no. 1, pp.
143–149, 1974.
[49] D. M. Olvet and G. Hajcak, “The stability of error-related brain activity with increasing trials,” Psychophysiology, vol. 46, no. 5, pp. 957–961, 2009.
[50] G. Gratton, M. G. H. Coles, and E. Donchin, “A new method for oﬀ-line removal of
ocular artifact,” Electroencephalography and clinical Neurophysiology, vol. 55, no. 4,
pp. 468–484, 1983.
[51] E. Pereda, R. Quiroga, and J. Bhattacharya, “Nonlinear multivariate analysis of neurophysiological signals,” Progress in Neurobiology, vol. 77, no. 1-2, pp. 1–37, 2005.
[52] W. Mader, D. Feess, D. Saur, R. Lange, V. Glauchec, C. Weillerc, J. Timmer, and
B. Schelter, “Investigating multivariate systems using directed partial correlation,”
International Journal of Bioelectromagnetism, vol. 12, no. 1, pp. 21–25, 2010.
[53] B. Schelter, M. Winterhalder, M. Eichler, M. Peifer, B. Hellwig, B. Guschlbauer,
C. L¨cking, R. Dahlhaus, and J. Timmer, “Testing for directed inﬂuences among
u
neural signals using partial directed coherence,” Journal of neuroscience methods, vol.
152, no. 1-2, pp. 210–219, 2006.
[54] D. Marinazzo, W. Liao, H. Chen, and S. Stramaglia, “Nonlinear connectivity by
granger causality,” Neuroimage, vol. 58, no. 2, pp. 330–338, 2010.

132

[55] L. Leistritz, T. Weiss, J. Ionov, K. J. B¨r, W. H. R. Miltner, and H. Witte, “Conneca
tivity analysis of somatosensory evoked potentials to noxious intracutaneous stimuli in
patients with major depression,” Methods of Information in Medicine, vol. 49, no. 5,
pp. 484–91, 2010.
[56] L. Faes and G. Nollo, “Extended causal modeling to assess partial directed coherence in
multiple time series with signiﬁcant instantaneous interactions,” Biological cybernetics,
vol. 103, no. 5, pp. 387–400, 2010.
[57] R. Vicente, M. Wibral, M. Lindner, and G. Pipa, “Transfer entropya model-free measure of eﬀective connectivity for the neurosciences,” Journal of computational neuroscience, vol. 30, no. 1, pp. 45–67, 2011.
[58] M. Wibral, B. Rahm, M. Rieder, M. Lindner, R. Vicente, and J. Kaiser, “Transfer
entropy in magnetoencephalographic data: Quantifying information ﬂow in cortical
and cerebellar networks,” Progress in Biophysics and Molecular Biology, vol. 105, no.
1-2, pp. 80–97, 2010.
[59] Y. Liu and S. Aviyente, “Information theoretic approach to quantify causal neural interactions from EEG,” in Signals, Systems and Computers (ASILOMAR), Conference
Record of the Forty Fourth Asilomar Conference on. IEEE, 2010, pp. 1380–1384.
[60] W. Zhao, E. Serpedin, and E. R. Dougherty, “Inferring connectivity of genetic regulatory networks using information-theoretic criteria,” IEEE/ACM Transactions on
Computational Biology and Bioinformatics, vol. 5, no. 2, pp. 262–274, 2008.
[61] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Transactions on Information Theory, vol. 45,
no. 4, pp. 1315–1321, 1999.
[62] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vettering, Numerical recipes:
The art of scientiﬁc computing, third edition. Cambridge University Press, 2007.
[63] E. G. Miller, “A new class of entropy estimators for multi-dimensional densities,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3,
2003, pp. 297–300.
[64] J. Lizier, “The local information dynamics of distributed computation in complex
systems,” Ph.D. dissertation, University of Sydney, 2010.
[65] B. Schelter, M. Winterhalder, and J. Timmer, Handbook of time series analysis: recent
theoretical developments and applications. Vch Verlagsgesellschaft Mbh, 2006.
[66] L. Cao, “Practical method for determining the minimum embedding dimension of a
scalar time series,” Physica D: Nonlinear Phenomena, vol. 110, no. 1, pp. 43–50, 1997.
[67] G. Pipa and S. Gr¨n, “Non-parametric signiﬁcance estimation of joint-spike events by
u
shuﬄing and resampling,” Neurocomputing, vol. 52, pp. 31–37, 2003.

133

[68] A. K. Seth, “A MATLAB toolbox for granger causal connectivity analysis,” Journal
of neuroscience methods, vol. 186, no. 2, pp. 262–273, 2010.
[69] G. Nolte, A. Ziehe, V. Nikulin, A. Schl¨gl, N. Kr¨mer, T. Brismar, and K. M¨ller,
o
a
u
“Robustly estimating the ﬂow direction of information in complex physical systems,”
Physical Review Letters, vol. 100, no. 23, p. 234101, 2008.
[70] M. Breakspear, “Nonlinear phase desynchronization in human electroencephalographic
data,” Human Brain Mapping, vol. 15, no. 3, pp. 175–198, 2002.
[71] W. Michiels and H. Nijmeijer, “Synchronization of delay-coupled nonlinear oscillators:
An approach based on the stability analysis of synchronized equilibria,” Chaos, vol. 19,
no. 3, p. 033110, 2009.
[72] J. C. Butcher and J. Wiley, Numerical methods for ordinary diﬀerential equations.
Wiley Online Library, 2003, vol. 2.
[73] J. Kayser and C. E. Tenke, “Principal components analysis of laplacian waveforms as
a generic method for identifying ERP generator patterns: I. Evaluation with auditory
oddball tasks,” Clinical Neurophysiology, vol. 117, no. 2, pp. 348–368, 2006.
[74] S. Aviyente, E. Bernat, W. Evans, and S. Sponheim, “A phase synchrony measure
for quantifying dynamic functional integration in the brain,” Human brain mapping,
vol. 32, no. 1, pp. 80–93, 2011.
[75] J. F. Cavanagh, M. X. Cohen, and J. J. B. Allen, “Prelude to and resolution of an error:
Eeg phase synchrony reveals cognitive control dynamics during action monitoring,”
The Journal of Neuroscience, vol. 29, no. 1, pp. 98–105, 2009.
[76] C. R. Genovese, N. A. Lazar, and T. Nichols, “Thresholding of statistical maps in
functional neuroimaging using the false discovery rate,” Neuroimage, vol. 15, no. 4,
pp. 870–878, 2002.
[77] D. W. Scott, Multivariate density estimation: theory, practice, and visualization.
Wiley-Interscience, 1992.
[78] L. Kozachenko and N. Leonenko, “On statistical estimation of entropy of random
vector,” Problems of Information Transmission, vol. 23, no. 2, pp. 95–101, 1987.
[79] L. Zhao, H. Permuter, Y. H. Kim, and T. Weissman, “Universal estimation of directed information,” in Information Theory Proceedings (ISIT), 2010 IEEE International
Symposium on. IEEE, 2010, pp. 1433–1437.
[80] L. Faes, G. Nollo, and A. Porta, “Information-based detection of nonlinear granger
causality in multivariate processes via a nonuniform embedding technique,” Physical
Review E, vol. 83, no. 5, p. 051112, 2011.
[81] V. A. Smith, E. D. Jarvis, and A. J. Hartemink, “Evaluating functional network inference using simulations of complex biological systems,” Bioinformatics, vol. 18, no.
Suppl 1, p. S216, 2002.
134

[82] W. J. Melssen and W. J. M. Epping, “Detection and estimation of neural connectivity
based on crosscorrelation analysis,” Biological cybernetics, vol. 57, no. 6, pp. 403–414,
1987.
[83] C. Andrew and G. Pfurtscheller, “Event-related coherence as a tool for studying dynamic interaction of brain regions,” Electroencephalography and clinical Neurophysiology, vol. 98, no. 2, pp. 144–148, 1996.
[84] O. David, S. J. Kiebel, L. M. Harrison, J. Mattout, J. M. Kilner, and K. J. Friston, “Dynamic causal modeling of evoked responses in EEG and MEG,” NeuroImage,
vol. 30, no. 4, pp. 1255–1272, 2006.
[85] M. Winterhalder, B. Schelter, W. Hesse, K. Schwab, L. Leistritz, J. Timmer, and
H. Witte, “Detection of directed information ﬂow in biosignals,” Biomedizinische Technik, vol. 51, no. 5/6, pp. 281–287, 2006.
[86] K. P. Murphy, “Dynamic bayesian networks: representation, inference and learning,”
Ph.D. dissertation, Citeseer, 2002.
[87] V. A. Smith, J. Yu, T. V. Smulders, A. J. Hartemink, and E. D. Jarvis, “Computational
inference of neural information ﬂow networks,” PLoS Comput Biol, vol. 2, no. 11, p.
e161, 2006.
[88] D. M. Chickering, “Learning Bayesian networks is NP-complete,” Learning from data:
Artiﬁcial intelligence and statistics v, vol. 112, pp. 121–130, 1996.
[89] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from
massive datasets: The “sparse candidate” algorithm,” in Proc. UAI, 1999.
[90] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using bayesian networks to analyze
expression data,” Journal of computational biology, vol. 7, no. 3-4, pp. 601–620, 2000.
[91] P. E. Meyer, K. Kontos, F. Laﬁtte, and G. Bontempi, “Information-theoretic inference
of large transcriptional regulatory networks,” EURASIP Journal on Bioinformatics
and Systems Biology, vol. 2007, no. 1, 2007.
[92] A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Favera, and
A. Califano, “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context,” BMC bioinformatics, vol. 7, 2006.
[93] A. J. Butte and I. S. Kohane, “Mutual information relevance networks: functional
genomic clustering using pairwise entropy measurements,” in Paciﬁc Symposium on
Biocomputing, vol. 5, 2000, pp. 418–429.
[94] C. Quinn, T. Coleman, and N. Kiyavash, “Causal dependence tree approximations
of joint distributions for multiple random processes,” Arxiv preprint arXiv:1101.5108,
2011.

135

[95] Y. Liu and S. Aviyente, “Directed network inference using a measure of directed information,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 513–516.
[96] L. Harrison, W. D. Penny, and K. Friston, “Multivariate autoregressive modeling of
fMRI time series,” NeuroImage, vol. 19, no. 4, pp. 1477–1491, 2003.
[97] C. J. Stam, “Functional connectivity patterns of human magnetoencephalographic
recordings: a ‘small-world’ network?” Neuroscience letters, vol. 355, no. 1-2, pp. 25–
28, 2004.
[98] M. Girvan and M. E. J. Newman, “Community structure in social and biological
networks,” Proceedings of the National Academy of Sciences of the United States of
America, vol. 99, no. 12, p. 7821, 2002.
[99] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical analysis of
structural and functional systems,” Nature Reviews Neuroscience, vol. 10, no. 3, pp.
186–198, 2009.
[100] Y. Kim, S. Son, and H. Jeong, “Finding communities in directed networks,” Physical
Review E, vol. 81, no. 1, p. 016103, 2010.
[101] E. A. Leicht and M. E. J. Newman, “Community structure in directed networks,”
Physical Review Letters, vol. 100, no. 11, p. 118703, 2008.
[102] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D. U. Hwang, “Complex networks:
Structure and dynamics,” Physics reports, vol. 424, no. 4-5, pp. 175–308, 2006.
[103] B. Kernighan and S. Lin, “An eﬃcient heuristic procedure for partitioning graphs,”
Bell System Technical Journal, vol. 49, no. 2, pp. 291–307, 1970.
[104] J. MacQueen, “Some methods for classiﬁcation and analysis of multivariate observations,” in Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and
probability, vol. 1, no. 281-297. California, USA, 1967, p. 14.
[105] J. Shi and J. Malik, “Normalized cuts and image segmentation,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 888–905, 2000.
[106] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,”
in Advances in Neural Information Processing Systems 14: Proceeding of the 2001
Conference, 2001, pp. 849–856.
[107] M. E. J. Newman, “Modularity and community structure in networks,” Proceedings of
the National Academy of Sciences, vol. 103, no. 23, p. 8577, 2006.
a
o
[108] A. Arenas, J. Duch, A. Fern´ndez, and S. G´mez, “Size reduction of complex networks
preserving modularity,” New Journal of Physics, vol. 9, p. 176, 2007.

136

[109] M. E. J. Newman, “Detecting community structure in networks,” The European Physical Journal B-Condensed Matter and Complex Systems, vol. 38, no. 2, pp. 321–330,
2004.
[110] R. Guimera and L. A. N. Amaral, “Functional cartography of complex metabolic
networks,” Nature, vol. 433, no. 7028, pp. 895–900, 2005.
[111] J. Duch and A. Arenas, “Community detection in complex networks using extremal
optimization,” Physical Review E, vol. 72, no. 2, p. 027104, 2005.
[112] M. E. J. Newman, “Fast algorithm for detecting community structure in networks,”
Physical Review E, vol. 69, no. 6, p. 066133, 2004.
[113] K. Wakita and T. Tsurumi, “Finding community structure in mega-scale social networks:[extended abstract],” in Proceedings of the 16th international conference on
World Wide Web. ACM, 2007, pp. 1275–1276.
[114] J. M. Pujol, J. B´jar, and J. Delgado, “Clustering algorithm for determining commue
nity structure in large networks,” Physical Review E, vol. 74, no. 1, p. 016107, 2006.
[115] R. Guimer`, M. Sales-Pardo, and L. A. N. Amaral, “Module identiﬁcation in bipartite
a
and directed networks,” Physical Review E, vol. 76, no. 3, p. 036102, 2007.
[116] J. Li, Z. J. Wang, S. J. Palmer, and M. J. McKeown, “Dynamic Bayesian network
modeling of fMRI: A comparison of group-analysis methods,” NeuroImage, vol. 41,
no. 2, pp. 398–407, 2008.
[117] A. Mechelli, W. D. Penny, C. J. Price, D. R. Gitelman, and K. J. Friston, “Eﬀective connectivity and intersubject variability: using a multisubject network to test
diﬀerences and commonalities,” Neuroimage, vol. 17, no. 3, pp. 1459–1469, 2002.
[118] M. S. Goncalves, D. A. Hall, I. S. Johnsrude, and M. P. Haggard, “Can meaningful
eﬀective connectivities be obtained between auditory cortical regions?” NeuroImage,
vol. 14, no. 6, pp. 1353–1360, 2001.
[119] A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” The Journal of Machine Learning Research, vol. 3, pp.
583–617, 2003.
[120] J. Neumann, G. Lohmann, J. Derrfuss, and D. Y. Von Cramon, “Meta-analysis of
functional imaging data using replicator dynamics,” Human brain mapping, vol. 25,
no. 1, pp. 165–173, 2005.
[121] N. M. Correa, T. Adali, Y. O. Li, and V. D. Calhoun, “Canonical correlation analysis
for data fusion and group inferences,” Signal Processing Magazine, IEEE, vol. 27, no. 4,
pp. 39–50, 2010.
[122] S. Aviyente, E. M. Bernat, W. S. Evans, C. J. Patrick, and S. R. Sponheim, “A
phase synchrony measure for quantifying dynamic functional integration in the brain,”
Human Brain Mapping, vol. 32, no. 1, pp. 80–93, 2010.
137

[123] J. Berry, B. Hendrickson, R. LaViolette, and C. Phillips, “Tolerating the community
detection resolution limit with edge weighting,” Physical Review E, vol. 83, no. 5, p.
056119, 2011.
[124] S. Boccaletti, M. Ivanchenko, V. Latora, A. Pluchino, and A. Rapisarda, “Detecting
complex network modularity by dynamical clustering,” Physical Review E, vol. 75,
no. 4, p. 045102, 2007.
[125] P. Mucha, T. Richardson, K. Macon, M. Porter, and J. Onnela, “Community structure
in time-dependent, multiscale, and multiplex networks,” Science, vol. 328, no. 5980,
pp. 876–878, 2010.
[126] E. Ordentlich, “Maximizing the entropy of a sum of independent bounded random
variables,” Information Theory, IEEE Transactions on, vol. 52, no. 5, pp. 2176–2181,
2006.

138