HIGHER-ORDER DATA REDUCTION THROUGH CLUSTERING, SUBSPACE ANALYSIS
AND COMPRESSION FOR APPLICATIONS IN FUNCTIONAL CONNECTIVITY BRAIN
NETWORKS
By
Alp Ozdemir

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Electrical Engineering - Doctor of Philosophy
2017

ABSTRACT
HIGHER-ORDER DATA REDUCTION THROUGH CLUSTERING, SUBSPACE
ANALYSIS AND COMPRESSION FOR APPLICATIONS IN FUNCTIONAL
CONNECTIVITY BRAIN NETWORKS
By
Alp Ozdemir
With the recent advances in information technology, collection and storage of higher-order
datasets such as multidimensional data across multiple modalities or variables have become much
easier and cheaper than ever before. Tensors, also known as multiway arrays, provide natural
representations for higher-order datasets and provide a way to analyze them by preserving the
multilinear relations in these large datasets. These higher-order datasets usually contain large
amount of redundant information and summarizing them in a succinct manner is essential for
better inference. However, existing data reduction approaches are limited to vector-type data and
cannot be applied directly to tensors without vectorizing. Developing more advanced approaches
to analyze tensors effectively without corrupting their intrinsic structure is an important challenge
facing Big Data applications.
This thesis addresses the issue of data reduction for tensors with a particular focus on providing a better understanding of dynamic functional connectivity networks (dFCNs) of the brain.
Functional connectivity describes the relationship between spatially separated neuronal groups and
analysis of dFCNs plays a key role for interpreting complex brain dynamics in different cognitive
and emotional processes. Recently, graph theoretic methods have been used to characterize the
brain functionality where bivariate relationships between neuronal populations are represented as
graphs or networks. In this thesis, the changes in these networks across time and subjects will be
studied through tensor representations.
In Chapter 2, we address a multi-graph clustering problem which can be thought as a tensor

partitioning problem. We introduce a hierarchical consensus spectral clustering approach to identify the community structure underlying the functional connectivity brain networks across subjects.
New information-theoretic criteria are introduced for selecting the optimal community structure.
Effectiveness of the proposed algorithms are evaluated through a set of simulations comparing with
the existing methods as well as on FCNs across subjects.
In Chapter 3, we address the online tensor data reduction problem through a subspace tracking perspective. We introduce a robust low-rank+sparse structure learning algorithm for tensors to
separate the low-rank community structure of connectivity networks from sparse outliers. The proposed framework is used to both identify change points, where the low-rank community structure
changes significantly, and summarize this community structure within each time interval.
Finally, in Chapter 4, we introduce a new multi-scale tensor decomposition technique to efficiently encode nonlinearities due to rotation or translation in tensor type data. In particular, we
develop a multi-scale higher-order singular value decomposition (MS-HoSVD) approach where
a given tensor is first permuted and then partitioned into several sub-tensors each of which can
be represented as a low-rank tensor increasing the efficiency of the representation. We derive a
theoretical error bound for the proposed approach as well as provide analysis of memory cost
and computational complexity. Performance of the proposed approach is evaluated on both data
reduction and classification of various higher-order datasets.

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . .
1.1 Data Reduction by Clustering . . . . . . . . . . . . . .
1.2 Higher-order Data Decomposition . . . . . . . . . . . .
1.3 Tensor Subspace Tracking . . . . . . . . . . . . . . . .
1.4 Tensor Based Approaches in Neuroscience Applications .
1.5 Organization of the Dissertation . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

Chapter 2
2.1
2.2

2.3

2.4
2.5

2.6

Hierarchical Spectral Consensus Graph Clustering for Group Analysis
of Functional Brain Networks . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Time-Varying Measure of Phase Synchrony . . . . . . . . . . . . . . . .
2.2.2 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Consensus Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.6 Cohenâs Kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Information Theoretic Cluster Quality Measure . . . . . . . . . . . . . . . . . .
2.3.1 Inter and Intra Edge Distribution . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Homogeneity and Completeness . . . . . . . . . . . . . . . . . . . . . .
Fiedler Consensus Clustering Approach . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Quality versus Modularity . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Evaluation of FCCA for Varying Inter-cluster Strength . . . . . . . . . .
2.5.3 Robustness to Outlier Graphs . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 Detecting Overlapping Communities . . . . . . . . . . . . . . . . . . .
2.5.5 Community Structure of the Brain During Error-Related Negativity . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 1
. 2
. 4
. 7
. 8
. 11

13
13
16
16
18
18
20
21
22
23
23
25
27
29
30
31
32
33
34
40

.
.
.
.

44
44
48
48

Chapter 3
3.1
3.2

Recursive Robust Low-rank + Sparse Structure Learning for Dynamic
Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Robust Principal Component Analysis . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

iv

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

51
52
54
55
55
57
60
62
63
63
66
67
72

Chapter 4
Multiscale Analysis for Higher-order Tensors . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Tensor Notation and Algebra . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Some Useful Facts Concerning Mode-n Products and Orthogonality . . .
4.2.3 Higher Order Singular Value Decomposition (HoSVD) . . . . . . . . . .
4.3 Multiscale Analysis of Higher-order Datasets . . . . . . . . . . . . . . . . . . .
4.3.1 Memory Cost of the First Scale Decomposition . . . . . . . . . . . . . .
4.3.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 A Linear Algebraic Representation of the Proposed Multiscale HoSVD
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Adaptive Pruning in Multiscale HoSVD for Improved Performance . . .
4.4 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.1 PIE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.2 COIL-100 dataset . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1.3 The Cambridge Hand Gesture Dataset . . . . . . . . . . . . . .
4.4.2 Data Reduction Experiments . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Data Reduction with Adaptive Tree Prunning . . . . . . . . . . . . . . .
4.5 Feature Extraction and Classification . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 COIL-100 Image Dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 The Cambridge Hand Gesture Dataset . . . . . . . . . . . . . . . . . . .
4.5.3 Classification Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Applications on fMRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Data Description and Preprocessing . . . . . . . . . . . . . . . . . . . .
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

76
76
80
80
83
84
85
88
89

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

90
96
97
98
98
98
99
99
102
106
106
107
107
107
108
111
111
112
114

3.3

3.4

3.5

3.2.2 Tensor Algebra & Tensor Decompositions . . . . . . . . . . . . . .
3.2.3 Time-Varying Measure of Phase Synchrony . . . . . . . . . . . . .
3.2.4 Consensus Clustering and Fiedler Consensus Clustering Approach .
Higher-order Recursive Low-Rank + Sparse Structure Learning (Ho-RLSL)
3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Slowly Changing Subspace & Change Points . . . . . . . . . . . .
3.3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Simulated Networks . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Effect of Network Size on Computation Time and Performance . . .
3.4.3 EEG Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

Chapter 5
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
APPENDIX

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

vi

LIST OF TABLES

Table 2.1:

Agreement table for the observers used for computing the kappa score . . 23

Table 2.2:

Average Cohenâs Kappa and average standard error for identifying community structure in simulated 4-community-networks with varying intercluster strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 2.3:

Average Cohenâs Kappa and average standard error for identifying communities in a group of networks with outliers . . . . . . . . . . . . . . . . 34

Table 2.4:

Average Cohenâs Kappa and average standard error for identifying overlapping clusters in simulated networks . . . . . . . . . . . . . . . . . . . 35

Table 2.5:

Mean and Standard Deviation of quality metric (U ) computed to quantify
the consistency between the groupâs community structure and individual
subjectsâ community structure for each subject, each response type and
the three consensus clustering methods . . . . . . . . . . . . . . . . . . . 37

Table 2.6:

Computation time for community structures obtained by the FCCA, Averaging and Voting methods . . . . . . . . . . . . . . . . . . . . . . . . . 37

Table 3.1:

Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for modular network structure under varying noise sparsity levels. . . . . . . . . . . . . . 65

Table 3.2:

Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for hierarchical
modular network structure under varying noise sparsity levels. . . . . . . 65

Table 3.3:

Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for network structure with overlapping modules under varying noise sparsity levels. . . . . 66

Table 3.4:

Average computation time for Ho-RLSL for dynamic tensors containing
64  64 and 128  128 networks with average MSE over time computed
for low-rank components during detected time intervals. . . . . . . . . . . 67

Table 3.5:

Detected ERN and CRN intervals by Ho-RLSL and HoSVD . . . . . . . 72

Table 4.1:

Reconstruction error and compression rate computed for pruned tree structure obtained by applying MS-HoSVD with 2-scales to PIE data. . . . . . 103

vii

Table 4.2:

Reconstruction error and compression rate computed for pruned tree structure obtained by applying MS-HoSVD with 2-scales to COIL-100 dataset. 103

Table 4.3:

Reconstruction error and compression rate computed for pruned tree structure obtained by applying MS-HoSVD with 2-scales to Hand Gesture
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Table 4.4:

Table 4.5:

Table 4.6:

Table 4.7:

1NN classification results for COIL-100 dataset over 20 trials with Nf 
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
1NN classification results for hand gesture dataset over 20 trials with
Nf  200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Average compression ratio (meanst.dev) and reconstruction error (meanst.dev)
obtained by MS-HoSVD, HoSVD and 4-D Wavelet over 20 subjects. . . . 113
Comparisons of probability of miss (PM iss ) and probability of false alarm
(PF A ) obtained by MS-HoSVD, HoSVD and 4-D Wavelet. . . . . . . . . 113

viii

LIST OF FIGURES

Figure 2.1:

An illustration of variations in homogeneity and completeness in a group
of objects where the true number of communities is 3; a) High homogeneity and low completeness; b) High completeness and low homogeneity;
c) High homogeneity and completeness. . . . . . . . . . . . . . . . . . . 25

Figure 2.2:

Average of quality metric (U ) and modularity (Q) measure corresponding
to defined clustering structures in a 3 cluster network with respect to the
ratio of the number of nodes in cluster 1 to the number of nodes in the
rest of the network over 100 trials. . . . . . . . . . . . . . . . . . . . . . 31

Figure 2.3:

Cluster structures obtained by FCCA a) Error (k = 10); b) Correct (k = 7).

Figure 2.4:

Cluster structures obtained by a) Averaging, Error (k = 9) and Correct (k
= 5); b) Voting, Error (k = 9) and Correct (k = 5). . . . . . . . . . . . . . 39

Figure 2.5:

(a) Average phase synchrony between clusters for Error response (k =
10), and Correct response (k = 7). (b) t-values for inter-modular bivariate
pair relative to the grand mean of the inter-modular pairs of ERN and
CRN conditions and (c) corresponding p-values. . . . . . . . . . . . . . 42

Figure 2.6:

Hierarchical structure for obtained by FCCA a) Error; b) Correct responses. 43

Figure 3.1:

Illustration of network structures: (a) Modular network (b) Hierarchical
modular network, (c) Overlapping modules. . . . . . . . . . . . . . . . . 64

Figure 3.2:

Average of (a) ERN and (b) CRN waveforms and the detected change
points corresponding to the connectivity mode. . . . . . . . . . . . . . . 69

Figure 3.3:

Network structures for the low-rank components of ERN networks obtained by Ho-RLSL: (a) pre-ERN, (b) ERN, (c) post-ERN. . . . . . . . . 70

Figure 3.4:

Network structures for the low-rank components of CRN networks determined obtained by Ho-RLSL: (a) pre-CRN, (b) CRN, (c) post-CRN. . . . 71

Figure 3.5:

Network structures for the low-rank components of ERN networks obtained by HoSVD: (a) pre-ERN, (b) ERN, (c) post-ERN. . . . . . . . . . 73

Figure 4.1:

Sample frames from PIE dataset corresponding to the 30th (left) and 80th
(right) frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

ix

38

Figure 4.2:

Image samples of four different objects from COIL-100 dataset from
varying pose angles (from 0 to 240 with 60 increments). . . . . . . . 99

Figure 4.3:

Illustration of nine different classes in Cambridge Hand Gesture Dataset. . 99

Figure 4.4:

Compression rate versus Normalized Reconstruction Error for MS-HoSVD
(dark blue), HoSVD (light blue), H-Tucker (green) and T-Train (yellow)
for a) PIE, b) COIL-100 and c) Hand Gesture datasets. Starting from
the left for all (a), (b) and (c), the first two compression rates correspond
to 1-scale MS-HoSVD with Ď  0.7 and Ď  0.75 while the last two
are obtained from 2-scale approximation with Ď  0.7 and Ď  0.75,
respectively. MS-HoSVD provides lower error than HoSVD, H-Tucker
and T-Train. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 4.5:

A single frame of the PIE dataset showing increasing accuracy with scale. 102

Figure 4.6:

Reconstruction error and compression rate computed for pruned tree structure obtained by applying MS-HoSVD with 2-scales to PIE dataset. Topleft and right image is the sample frame by reconstructing the tensor using
only 0th scale and recontruction by using 0th and 1st scales respectively.
Bottom-left image is a sample frame for reconstructed using 2-scale approximation with all the subtensors and the bottom-right image is the
reconstruction of 2 scale analysis with pruning approach where Îť  0.25. 104

Figure 4.7:

Compression rate versus Normalized Reconstruction Error for MS-HoSVD
with adaptive pruning (dark blue), HoSVD (light blue), H-Tucker (green)
and T-Train (yellow) for a) PIE, b) COIL-100 and c) Hand Gesture datasets.
2-scale MS-HoSVD tensor approximations are obtained using Ď  0.7
for each scale and varying pruning trade-off parameter Îť. . . . . . . . . . 105

Figure 4.8:

Reconstructed frame samples from PIE data compressed by T-Train (topleft), H-Tucker (top-right), HoSVD (bottom-left) and pruned MS-HoSVD
with 2-scales (bottom-right). 2-scale MS-HoSVD tensor approximation
is obtained using Ď  0.7 for each scale and Îť  0.25. . . . . . . . . . . . 106

x

LIST OF ALGORITHMS

Algorithm 2.1:

Fiedler Consensus Clustering Algorithm . . . . . . . . . . . . . . . . 28

Algorithm 2.2:

Fiedler Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Algorithm 3.1:

Higher-order Recursive Low-Rank + Sparse Structure Learning . . . . 59

Algorithm 3.2:

Delete Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Algorithm 3.3:

Add Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Algorithm 4.1:

Multiscale HoSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Algorithm 4.2:

Multiscale HoSVD with Adaptive Pruning . . . . . . . . . . . . . . . 97

xi

Chapter 1
Introduction
With the recent advances in information technology, collection and storage of higher-order datasets
such as multidimensional data with multiple aspects have become much easier and cheaper than
ever before. These higher-order datasets bring with themselves the problems of interpreting the
data and extracting useful information. Tensors, also known as multiway arrays, provide natural
representations for higher-order datasets and enable us to analyze them by preserving multilinear
relations among the different modes. Tensor type data appears in many applications including
computer vision where grey level images and image sequences can be represented as two- and
three-dimensional datasets, respectively, neuroimaging where signals such as electroencephalogram (EEG) recordings across multiple subjects, channels and experimental conditions and multimodal images can be represented as high order tensors [1â3], and hyperspectral imaging where
the images or videos obtained through remote sensing can be represented as 3-way and 4-way
tensors [4, 5].
These higher-order tensors are often very high-dimensional and contain large amount of redundant information [6]. Summarizing these huge datasets in a succinct manner is essential for better
inference. Data reduction can be performed in several different ways. Clustering approaches
provide a compact way of summarizing the data which can be processed quickly and interpreted
easily [7]. On the other hand, linear or nonlinear mapping of the high-dimensional data onto
1

lower dimensional subspaces or manifolds provides direct reduction of data dimensionality [8].
Approaches such as PCA/SVD for linear low-rank subspace approximation and manifold learning
techniques for nonlinear approximation are well-developed for vector type data. However, these
dimensionality reduction techniques become inadequate when dealing with higher-order tensors.
Applying vector based algorithms to tensors requires vectorization which results in extremely long
vectors and breaks the natural multilinear structure of the data. Analysis of these long vectors also
requires massive computing power. Therefore, there is a need to develop clustering and subspace
projection approaches for dimensionality reduction of tensor data.

1.1

Data Reduction by Clustering

One particular way of performing data reduction is clustering. There are a variety of clustering
algorithms specific to problems from different fields i.e computer science, earth science, life science and economics [9â12]. The common objective of all the clustering algorithms is grouping the
samples similar to each other based on a specified proximity measure i.e. Euclidean distance, cosine distance or Mahalanobis distance [13]. Clustering algorithms can be simply grouped into two
categories as Hierachical algorithms and partitional algorithms [10]. Hierarchical clustering approaches construct a hierarchical structure from a distance matrix and make use of dendograms to
visualize the distance. Hierarchical algorithms are commonly implemented through either agglomerative or divisive methods. Agglomerative algorithms perform clustering by merging individual
data points while divisive methods start with the entire dataset and successively divide it into
subclusters [14]. In contrast to hierachical approaches, partitional clustering algorithms directly
assign samples into K clusters. Partitional algorithms can be also further divided into square-error
based algorithms, graph-theoretic algorithms and mixture resolving algorithms [10]. Squared-error

2

based algorithms such as K-means perform clustering based on specified cost functions which
maximize intercluster distance and intracluster similarity. Mixture resolving approaches evaluate
the probability distributions of each cluster and group the data points based on the distributionâs
parameters that can be iteratively estimated by expectation maximization algorithm [15]. Graphtheoretic approaches make use of the similarity matrix formed using neighborhood information of
data samples. Spectral properties carried by eigenvectors of the graph Laplacian assist in finding
the optimum cut points of the graph [16].
However, all of these methods are mostly limited to clustering a single data set or unimodal
data. In the case of multiway data it may be important to find a common cluster structure across
one of the modes. Consensus clustering methods address this problem by combining multiple clustering results obtained by applying various algorithms to the same dataset [17, 18]. The final result
is obtained by minimizing the total dissimilarity between the target partition and each individual
partition [17]. Similar to consensus clustering, multiview clustering aims to obtain common clustering structure from multiple views of the same data [19]. Multiview clustering can equivalently
be thought as a higher-order data clustering problem and tensor based approaches show superior
performance compared to both single view and the other multiview approaches [20]. More recently, multi-way clustering problem has been defined by the use of N-way similarity measure
instead of pairwise similarity measures [21]. Using N-way similarity measure yields N-way super
symmetric affinity tensor and Shashua et al. [22] showed that applying nonnegative tensor factorization to N-way affinity tensor yields desired partitions. Moreover, He et al. [23] reformulated the
multi-way clustering problem as a PARAFAC problem.

3

1.2

Higher-order Data Decomposition

Increased usage of higher-order datasets has prompted researchers to develop new multilinear analysis tools to better capture the hidden multilinear structures underlying the observed data. Advantages of using multiway analysis over two-way analysis in terms of uniqueness, robustness to
noise and computational complexity have been shown in many studies [24â27]. Two of the most
well-known higher-order decomposition methods are Tucker decomposition and Parallel Factor
Analysis (PARAFAC). PARAFAC, also known as canonical decomposition (CANDECOMP), is an
extension of PCA to tensors and represents the tensor as a sum of rank-1 tensors. PARAFAC yields
a unique decomposition and provides the only possible combination of rank-one tensors [28, 29].
For an N-way tensor X

P RI1I2... IN , PARAFAC decomposition of X is:
X



RĚ§
r1

p1q  ap2q  ....  apN q

Îťr ar

r

(1.1)

r

piq P

where R is a positive integer, Îťr is the weight of the rth rank-one tensor, ar
factor of ith mode with unit norm where i

RIi is the rth

P t1, 2, ..., N u and r P t1, 2, ..., Ru, and ââ denotes

the outer product of vectors. The main restriction of the PARAFAC model is that the factors
across different modes only interacts factorwise. For example, for a 3-way tensor, the ith factor
corresponding to the first mode only interacts with the ith factors of the second and third modes.
However, this restriction also provides the same number of factors for each mode and yields a
unique solution for PARAFAC model [24, 30].
Tucker decomposition is a natural extension of the SVD to N-way tensors and decomposes the
tensor into a core tensor multiplied by a matrix along each mode [31]. One version of Tucker decomposition known as HoSVD is obtained by adding an orthogonality constraint to the component

4

matrices. HoSVD of N-way tensor X

X
X



Â°I1

i1 1 ...

P RI1I2... IN is written as:

 S 1 Up1q 2 Up2q... N UpN q,

(1.2)

Â°I N

p1q
p2q
pN q q
iN 1 si1 ,i2 ,...,iN ci1 pU q  ci2 pU q...  ciN pU

where the matrix Upnq contains the left singular vectors of the matrix obtained by unfolding X
along nth mode, cn pq is the nth column vector and core tensor S is obtained by S
Up1q,J

2 Up2q,J... N UpN q,J.



X

1

In contrast to PARAFAC, Tucker models allow interactions

between the factors obtained across the modes and the core tensor includes the strength of these
interactions. However, the main drawback of Tucker decomposition is that the factors are not
necessarily unique [24]. For example, the effect of rotating one of the mode matrices can be
eliminated by inversely rotating the core tensor.
Both of the PARAFAC and Tucker decompositions can be computed using alternating least
squares (ALS) algorithms [2, 24]. ALS repeatedly estimates the component matrices one at a time
while keeping the others fixed until convergence criteria are satisfied. However, ALS based algorithms do not guarantee convergence to the global optimum and changing the start points for
the algorithm may yield different solutions. To accelerate the convergence of ALS, studies in line
search techniques and gradient based approaches can be integrated in ALS algorithms [32â34].
Similar to PARAFAC, Tucker decomposition can also be performed by ALS algorithms and the
closed form solution is based on computing the left singular vectors of the matrix obtained by unfolding the tensor along the related mode [2]. To speed up this process, [35] proposed to compute
only the leading singular vectors which yields a truncated version of HoSVD. However, truncated
HoSVD does not provide the optimal solution for the specified rank. [36] proposed an iterative
technique known as higher order orthogonal iteration (HOOI) and showed that the truncated de-

5

composition obtained by HOOI provides a better fit to the original tensor.
Recently, alternatives to these major tensor decomposition techniques have also been proposed [30, 37â43]. De Lathauwer [37] proposed block-term decomposition (BTD) which unifies
HoSVD and PARAFAC and defined the essential conditions for uniqueness. Block-term decomposition factorizes the tensor as a sum of several tensors of the same size but with a lower multilinear
rank than the original tensor. BTD is a more generalized version of PARAFAC decomposition and
allows us to model more complex data structures. However, the main challenge with the blockterm decomposition is how to choose the appropriate block rank. When the order of the tensors
increases, computation and storage of the core tensor obtained by Tucker decomposition becomes
more costly. To reduce the memory requirements as well as computational complexity, various
tensor decomposition techniques including hierarchical Tucker decomposition and tensor train decomposition have been proposed [31, 39, 40]. Hierarchical Tucker Decomposition (H-Tucker) recursively splits the modes based on a hierarchy and applies SVD. Resulting binary tree contains a
subset of the modes at each node [39]. Alternatively, Tensor-Train Decomposition (T-Train) has
been proposed to compress large tensor data into smaller core tensors obtained by a sequence of
QR and SVD decompositions of matrices [40]. Real-valued datasets may exhibit different characteristics such as nonnegativity, sparsity, symmetricity and researchers have proposed various
tensor decomposition techniques by including these constraints. For example, Brachat et al. decomposes symmetric tensors as the sum of rank-one symmetric tensors [41]. Real data containing
frequency counts, pixel intensities and spectra, are nonnegative, and nonegativity constraint can
also be imposed to tensor factorization [30]. Moreover, Tichavsky et al. [38] extended joint block
diagonalization of nonsymmetric matrices to 3-way N  N  N tensors by decomposing the tensor
into a block diagonal core tensor multiplied by factor matrices along each mode. More recently,
Allen et al. [42] took sparsity of the tensors into account and proposed sparse tensor decompo6

sitions (Sparse-HoSVD, Sparse-PARAFAC) where Sparse-PARAFAC regularizes the factors with
an l1 -norm penalty and Sparse-HoSVD performs sparse PCA on matricized tensors .

1.3

Tensor Subspace Tracking

High-dimensional datasets often lie near a low dimensional subspace, and there has been a massive
amount of subspace learning algorithms used in data reduction and compression including reconstructive methods (principal component analysis (PCA), independent component analysis (ICA),
nonnegative matrix factorization (NMF) [44â46]), discriminative approaches (linear discriminant
analysis (LDA), canonical correlation analysis (CCA) [47, 48]). To better deal with corrupted data
and missing entries, researchers have focused on more robust subspace estimation techniques and
the problem of separating a sparse matrix and a low-rank matrix from their sum has received a
lot of attention. The final goal usually is to either find the column span of the low-rank matrix or
the support of the sparse one. There has been a large amount of recent work on batch methods
for low-rank + sparse recovery and its various extensions such as Principal Component Pursuit,
Outlier Pursuit and Low-Leverage Decomposition [49â62]. However, the subspace that the dynamic signals and streaming datasets lie in tends to evolve in time and these batch algorithms
are inconvenient to capture changing dynamics of the subspace. To address this issue, a considerable number of online subspace learning algorithms have been proposed i.e. SVD based approaches (incremental SVD, incremental PCA [63, 64]), recursive least squares based approaches
(PAST, PETRELS [65,66]) and robust approaches (GRASTA, REPROCS, online RPCA [67â70]).
However, all of these approaches are appropriate for vector type datasets and applying them to
higher-order datasets requires vectorization. Vectorizing the large datasets yields very long vectors
and makes the optimization of the algorithms highly costly. Vectorizing process also corrupts the

7

multi-dimensional couplings that the higher-order data contains.
To better preserve the multilinear structures of dynamic higher-order datasets, recently tensorbased approaches have been proposed to track dynamic tensor subspaces such as dynamic tensor
analysis, streaming tensor analysis and window based tensor analysis [71, 72]. However, these
approaches provide computationally efficient frameworks for analysis of streaming datasets by
recursively updating subspace information and do not address the robustness of the subspace estimates. Goldfarb and Qin extended robust PCA to tensors (HoRPCA) by solving low-rank + sparse
recovery problem for general higher order tensors [73]. However, this method is highly computationally expensive and does not update the subspaces online. Li et al. [74] presented a robust
subspace learning algorithm (RTSL) that incrementally updates the tensor subspace. Moreover,
Nion et al. [75] proposed two adaptive approaches to track PARAFAC decompostion of 3-way
tensors. These approaches suggest to update the PARAFAC decomposition at every time point
based on simultaneous diagonalization or minimization of weighted least squares criterion. More
recently, Mardani et al. [76] proposed an online subspace learning method based on nuclear norm
minimization and extended this approach for matrices and higher order datasets. Extension of this
algorithm to tensors takes advantage of parallel factor analysis (PARAFAC) model to minimize
tensor rank and considers temporal information as one of the tensor modes. Similar to [76], OLSTEC proposed in [77] also tracks the subspace of partially observed higher-order data by making
use of PARAFAC decomposition.

1.4

Tensor Based Approaches in Neuroscience Applications

Undoubtedly, the brain is one of the most complex biological systems in existence which has
motivated researchers to study it through different imaging modalities including EEG, MEG and

8

MRI. However, all of these modalities provide either multilinear signals or multidimensional images which are most naturally represented by tensors. Recently, tensor based approaches have
been widely used for analysis of these datasets [2]. For instance, Morup et al. [1] proposed to
use PARAFAC model to decompose event-related EEG data into channel

 frequency  time

components and identified the components corresponding to visual event related potential (ERP)
paradigm. In order to avoid degenerations due to time shifts that occur in EEG and fMRI signals
and to better model the delays occurring in neuro-physiological systems, the shifted PARAFAC
model has been proposed [78]. Tensor decomposition has also been used in BCI applications
[79â84]. Lee et al. [79] presented non-negative tensor factorization by adding nonnegativity constraint to PARAFAC decomposition and showed that hidden multiway patterns found by NTF
successfully classify the EEG signals. Similarly, generalized tensor discriminant analysis (GTDA)
[85] provides discriminative multilinear subspace information for single trial EEG classification
[80, 81]. Similarly, slice oriented decomposition (SOD) developed for 3-way tensors decomposes
the tensors as outer product of slice matrices and features extracted by SOD have been employed
in BCI applications [82]. Tucker models and extensions have also been used in BCI studies to extract features for classification of EEG signals [83,84]. Moreover, brain source localization studies
take advantage of tensor based methods [86â89]. These studies applied PARAFAC decomposition to time-frequency-electrode tensor constructed from EEG data and showed that some of the
PARAFAC components are related to the artifacts while others are associated with the origins of
epileptic seizure.
More recently, tensor based approaches were used in fMRI studies for several different applications. For example, Barnathan et al. [90] proposed a hybrid algorithm by combining tensor
decompositions with wavelet transform and then applied it to fMRI to cluster motor tasks. They
also showed that this hybrid method outperforms voxelwise analysis and methods depending only
9

on wavelet transform or tensor decompositions in terms of space, time, and accuracy. An extension
of ICA for tensors was proposed in [91] and has been used used to obtain more accurate activation
maps for multi-subject and multi-session fMRI analysis with increased robostness against deviation from model assumptions. In addition, EEG-fMRI fusion applications have taken advantage
of tensor decomposition since multilinear approaches better reflect the intrinsic structure of multimodal multiway neuroimaging data [92, 93]. Tensor based approaches have also been used for
compression of EEG signals and regression analysis of fMRI data [94â96].
In most of the studies mentioned above, the tensors are constructed from either the timefrequency distribution of the signals across channels or the time series across subjects for multichannel EEG recordings or fMRI voxel intensity values. However, there is less work on employing
tensors to represent the multivariate relationships among the different channels or voxels. Functional connectivity is defined as the statistical dependency between spatially remote neurophysiological events and is most commonly represented through the use of graph theoretic tools such as
the adjacency/connectivity matrix. Recently, functional connectivity has been used to understand
coordinated and integrated activity of the human brain [97, 98], and it has been shown that synchronization between different brain regions plays an important role in different cognitive and emotional processes [99, 100] as well as in various neurological and psychiatric disorders [101â104].
More recently, tools from graph theory have been employed to analyze the functional connectivity of the brain by associating nodes with distinct brain regions and edges with pairwise interactions between them [105, 106]. Moreover, it has been shown that functional connectivity networks
change dynamically in short time scales and exhibit task-related patterns [107â111]. Tensors provide a natural tool to represent these dynamic functional connectivity networks constructed across
subjects, time, frequency and experimental conditions. In recent work, tensor decomposition methods such as HoSVD have been used to summarize these high order datasets and to identify a small
10

number of network states [112, 113].

1.5

Organization of the Dissertation

This thesis makes fundamental contributions to data reduction of tensor type data with a particular
focus on providing a better understanding of dynamic functional connectivity networks. In Chapter
2, we approach the problem of data reduction through clustering. In particular, we focus on obtaining a common community structure across multiple connectivity networks, which can be thought of
partitioning a 3-way tensor. Therefore, unlike classical graph clustering this is a multi-graph clustering problem where the tensor corresponds to functional connectivity brain networks collected
in time across multiple subjects. In order to understand the organization of functional connectivity
networks, it is important to determine the community structure underlying these complex networks.
Moreover, the study of functional networks is confounded by the fact that most neurophysiological
studies consist of data collected from multiple subjects, thus, it is important to identify communities representative of all subjects. In Chapter 2, we propose a hierarchical consensus spectral
clustering approach to address these problems. Furthermore, new information-theoretic criteria
are introduced for selecting the optimal community structure. The proposed framework is applied
to electroencephalogram (EEG) data collected during a study of error-related negativity (ERN) to
better understand the community structure of functional networks involved in cognitive control.
The approach presented in Chapter 2 reduces the connectivity data across all subjects and
within a given time interval into a single cluster structure. In Chapter 3, we propose an alternative
way to reduce this high dimensional data through linear subspace estimation and update methods.
In this approach, the dynamics of the connectivity networks are taken into account such that the
data reduction is done across subjects for time intervals determined by the subspace tracking ap-

11

proach. Recent years have seen a growth of methods for subspace tracking of vector type data. The
main contribution of this chapter is that we introduce a tensor based approach for tracking dynamic
functional connectivity networks. The proposed framework introduces a robust low-rank+sparse
structure learning algorithm for tensors to separate the low-rank community structure of connectivity networks from sparse outliers. The proposed framework is used to both identify change points,
where the low-rank community structure of the FCN changes significantly, and summarize this
community structure within each time interval. The proposed framework is applied to the study of
cognitive control from electroencephalogram (EEG) data during a Flanker task.
In Chapter 4, we address the issue of data reduction through tensor decomposition. To this aim,
we propose a novel multiscale analysis technique to efficiently encode nonlinearities in tensor type
data. The proposed method constructs data-dependent multiscale dictionaries to better represent
the data and consists of two major steps: 1) Constructing a tree structure by decomposing the
tensor into a collection of permuted subtensors, and 2) Constructing multiscale dictionaries by
applying HoSVD to each subtensor. We introduce different variations of the proposed MS-HoSVD
method including a single scale and multi-scale decomposition along with an adaptive pruning
method. We derive a theoretical error bound for the proposed approach as well as provide analysis
of memory cost and computational complexity. Finally, we apply the proposed algorithm to data
reduction of real datasets to illustrate the improvement in the compression performance compared
to HoSVD, T-Train and H-Tucker deompositions. In addition, we show how the features obtained
from multiscale representation provide advantages over regular HoSVD and T-Train features for
classifying tensors containing nonlinearities such as rotation or translation.

12

Chapter 2
Hierarchical Spectral Consensus Graph
Clustering for Group Analysis of Functional
Brain Networks

2.1

Introduction

Functional connectivity is defined as the statistical dependency between spatially remote neurophysiological events [97] and is the key to understanding how the coordinated and integrated
activity of the human brain takes place [98]. In recent years, many studies have suggested synchronization of neuronal oscillations as one plausible mechanism in the interaction of spatially distributed neural populations [114]. Moreover, it has been shown that synchronization between different brain regions plays an important role in different cognitive and emotional processes [99,100]
as well as in various neurological and psychiatric disorders [101â104]. Synchronization refers to
interdependencies among activities of different neuronal assemblies and requires the need to focus
on the temporal dynamics of neural networks in the millisecond range. Therefore, neuroimaging techniques with high temporal resolution, such as electroencephalogram (EEG) [101, 115] and

13

magnetoencephalogram (MEG) [116], are the most appropriate tools.
Although phase synchrony is successful at quantifying pairwise interactions, it cannot completely describe the complex relationship between function and organization of the brain. Recently,
research in the area of complex networks, in particular graph theoretic methods, has been used to
characterize the relationship between the topology and the function of the brain [117â120]. The
bivariate relationships between neuronal populations are represented as graphs where the nodes
correspond to the individual sites and the edges to the strength of the interaction quantified by
functional connectivity measures. The conventional approach to functional connectivity graph
analysis extracts topological metrics either on the entire graph, i.e. global metrics, or at each
node, i.e. local metrics. At the large topological scale, the small-world organization, whereby
both integration (relatively high global-efficiency/low path length) and segregation (relatively high
local-efficiency/clustering coefficient) of information between brain regions are supported, has
been investigated thoroughly [121,122]. The small-world model has also been shown to be significantly altered in various brain disorders and pathologies such as schizophrenia [123], autism [124],
spinal cord injuries [125] and Alzheimerâs disease [126]. At the local scale, the centrality or the
degree of individual nodes can be computed and used to characterize the brain graph reorganization during different tasks and events. Although the global and local indices summarize the key
aspects of the connectivity networks, they do not provide any information about the intermediate
scale of network organization which is more accurately described by the community structure of
the network [127, 128]. A community structure in a graph is defined as a densely connected set
of nodes with sparse connections between communities in the network. It is hypothesized that
the community structure of complex biological networks is indicative of robustness [127] and
contributes to functionality [129] by compartmentalizing specific functions within certain cortical
regions without perturbing the rest of the network [130]. Intra-cluster associations are thought
14

to describe the segregation of information processing while the inter-cluster associations testify to
the integration of information processing across distant brain regions [131â133].
Identification of communities in the functional connectivity graphs has been originally addressed using methods like Principle Component Analysis (PCA) [134] and Independent Component Analysis (ICA) [135] which put non-physiological constraints in the obtained components
such as orthogonality and independence. Recently, methods from spectral graph clustering have
been used to detect communities [136, 137] by mapping the functional connections to a multidimensional subspace defined by a set of eigenvectors. However, these methods require a priori
knowledge about the number of clusters and do not reveal a hierarchical decomposition of the
network. Meunier and others [138â140] argue that most complex networks, including functional
connectivity networks, possess a multi-scale community characteristic, i.e, are hierarchically decomposable into a finite number of modular levels. Therefore, a hierarchical decomposition of
functional connectivity graphs is a more natural representation than conventional clustering approaches for community detection in brain networks.
A key challenge in identifying the community structure of brain networks is determining a
common structure across multiple subjects. Current work either focuses on obtaining the community structure for the average connectivity network or on analyzing each subject individually and
obtaining a common community structure using consensus clustering techniques [141]. Averaging
neglects the variance across subjects and can be influenced by the outliers. Consensus clustering,
also known as clustering ensembles, yields a stable and robust final clustering that is in agreement
with the individual clusterings through a consensus function [18, 142]. Therefore, in this chapter
we will introduce a hierarchical consensus based approach in which the best community structure
is identified by combining information shared across multiple subjects.
In this chapter, we first quantify functional connectivity using a new time-varying measure
15

of phase synchrony and apply it to multichannel EEG data to quantify pairwise synchrony. The
resulting connectivity matrices are treated as weighted undirected graphs representing each subject.
We then introduce a new hierarchical graph partitioning method based on spectral graph theory,
in particular the Fiedler bi-partitioning method [143]. This partitioning method is combined with
two novel information theoretic criteria, homogeneity and completeness, to introduce a non-greedy
consensus based hierarchical algorithm, the Fiedler Consensus Clustering Algorithm (FCCA), that
is designed to reveal multiple levels of community organization common across subjects. Next, an
information-theoretic quality measure is introduced to identify the optimal community structure.
Finally, the proposed approach is applied to EEG data collected during a study of cognitive control
in the brain based on the error-related negativity to test the approach on a known biological signal.

2.2
2.2.1

Background
Time-Varying Measure of Phase Synchrony

Phase synchronization within different frequency bands across the brain has been shown to be a
plausible mechanism explaining neuronal integration [114, 144]. Two commonly used measures
for quantifying time-varying phase synchrony are Hilbert transform and complex wavelet transform [145â147]. It has been observed that the two approaches are similar in their results with
the wavelet based methods giving higher resolution phase synchrony estimates over time and frequency, especially at the low frequency range [145]. Although the wavelet based phase synchrony
estimates address the issue of non-stationarity, they suffer from the resolution tradeoff, i.e. the frequency resolution is high at low frequencies and low at high frequencies. For this reason, there is a
need for high time-frequency resolution phase distributions that can better track dynamic changes
in phase synchrony. In the proposed work, pairwise functional connectivity will be quantified us16

ing a recently introduced time-frequency phase estimation method based on Reduced Interference
Rihaczek distribution (RID-Rihaczek) [148, 149].
Reduced Interference Rihaczek Distribution (RID-Rihaczek) is given by:
C pt, Ď q 


Âź
exp

pÎ¸Ď q2


Ď


Î¸Ď
exp j
ApÎ¸, Ď qej pÎ¸t Ď Ď q dĎ dÎ¸,
2

(2.1)

where expppÎ¸Ď q2 {Ď q is the Choi-Williams kernel used to filter out the cross-terms, ApÎ¸, Ď q
Âł

xpu

Ď
2



qxpu  Ď2 qejÎ¸udu is the ambiguity function of the signal and exppjÎ¸Ď {2q is the kernel

corresponding to the Rihaczek distribution [150]. The phase difference between two signals based
on this complex distribution is computed as
C1 pt, Ď qC2 pt, Ď q
ÎŚ12 pt, Ď q  arg
|C1pt, Ďq||C2pt, Ďq| ,




(2.2)

where C1 pt, Ď q and C2 pt, Ď q refer to the complex energy distributions of the two signals x1 ptq
and x2 ptq respectively and a synchrony measure quantifying the intertrial variability of the phase
differences, phase locking value (PLV), is defined as

P LV pt, Ď q 

1
N


 NĚ§

exp jÎŚk12 t, Ď


k 1

p




,


p qq

(2.3)

where N is the number of trials and ÎŚk12 pt, Ď q is the time-varying phase estimate between two
signals recorded at different electrodes for the kth trial. If the phase difference varies little across
the trials, PLV is close to 1. Compared to the existing synchrony measures, in our previous work
we have shown that RID-Rihaczek based phase synchrony measure is more robust to noise, has
uniformly better time-frequency resolution with less bias, and perform superior at detecting actual

17

synchrony within a group of oscillators [148].

2.2.2

Graph Theory

Recent developments in the quantitative analysis of complex networks, based largely on graph
theory, have been rapidly translated to studies of brain network organization [105, 151]. In this
approach, the different regions of the brain correspond to the nodes in the network and the pairwise functional connectivity corresponds to the edges of the network. An undirected, connected,
weighted graph, G

 pV, E, Wq, consisting of a finite set of N nodes, V  tvi|i P t1, 2, ..., N uu,

and a set of edges, E, associated with each node pair and a weighted adjacency matrix W can
be used to represent these functional connectivity networks. For a binary graph, wij
for a weighted graph wij

P r0, 1s.

P t0, 1u and

In an undirected graph, the edge weights are represented by

a symmetric weighted adjacency matrix W
elements along the ith row of matrix W, di

 rwij s where i, j P t1, 2, ..., N u.

The sum of all

 Â°Nj1 wij , is the degree of node vi. When m graphs

representing the same network are available, their adjacency matrices are represented as the set
W  tW r u where r

2.2.3

P t1, 2, ..., mu.

Spectral Clustering

A commonly used approach to identifying the community structure within graphs is spectral clustering thanks to its simple implementation and promising performance. Given a weighted and
undirected graph G, the spectrum of the graph is represented by the eigenvalues and eigenvectors
of the graph Laplacian matrix L

 D  W, where W is the adjacency matrix and D is the de-

gree matrix containing degrees of nodes along the diagonal [152, 153]. Different versions of the
Laplacian matrix, i.e. the symmetric normalized and the random walk normalized versions, have

18

been used leading to different versions of the spectral clustering algorithm. In this study, we use
the symmetric version of the normalized Laplacian matrix defined as L  D1{2 pD  WqD1{2
which yields more robust clustering solutions [136].
Since the normalized Laplacian matrix is a square, symmetric, and positive semi-definite matrix, its eigenvectors and eigenvalues are described by the equation Lui

 Îťui and the eigenvec-

tors, tu1 , u2 , ..., uN u are orthonormal and the eigenvalues {Îť1 , Îť2 , ..., ÎťN } are positive and real.
Spectral clustering algorithm finds the spectrum of G through the eigendecomposition of its Laplacian matrix and embeds the original vertices in G to a low dimensional spectral domain formed by
the graph spectrum. Typically, a subset of eigenvectors, tu1 , u2 , ..., uk u where k

Â  N , is extracted

and an optimization technique is iteratively applied to cluster centers within the data using algorithms such as k-means, fuzzy k-means [154], generalized synchronization cluster analysis [137],
the Ng-Jordan-Weiss algorithm (NJW) [155], or power iteration clustering (PIC) [156]. This
transformation enhances the intrinsic relationship among the original vertices leading to improved
cluster identification in the new low dimensional space [136, 137, 152, 157].
An alternative to spectral clustering is to evaluate only one eigenvector for the purpose of bipartitioning, i.e. identifying a minimal cut of the graph. This eliminates the problem of searching
for the optimal set of eigenvectors. According to Holzrichter et. al. [158], the optimal minimal
cut of a graph is defined by the eigenvector, u2 , associated with the second smallest non-zero
eigenvalue, Îť2 , of the Laplacian matrix. This eigenvector is referred to as the Fiedler vector, uF ,
and defines a set of two clusters tC1 , C2 u where

vi

$
'
& C1 if uF i

pqÂĽ0

P'

p q Â  0.

% C if u i
2
F

(2.4)

The Fiedler partition can be iteratively applied to each successive partition in order to achieve a
19

clustering with k

ÂĄ 2.

In this study, the partitioning of a graph using the Fiedler vector will be

referred to as FiedlerPartition(G), which partitions the nodes of graph G into two clusters, G1 and
G2 such that G1 Y G2

 V . Partitioning the graph according to the Fiedler vector generates a com-

munity structure in which the intra-cluster nodal relationships are maximally âstrongâ. Repeating
this partitioning process to the subsequent sub clusters reveals a hierarchical configuration of the
network structure.

2.2.4

Consensus Clustering

In many clustering problems, it is common to apply different algorithms to the same data and
then use a consensus method to combine the results [159]. In this chapter, a similar framework
for obtaining a common community structure from multiple graphs is proposed. Three popular
consensus clustering methods are consensus averaging, majority voting and the hypergraph partix  rw
tioning algorithm (HPGA). The first approach averages m adjacency matrices to obtain W
pij s

where

pij
w



1
m

mĚ§
r 1

prq

wij .

(2.5)

This approach is computationally efficient but loses the inter-subject variability.
The second commonly used approach for obtaining a common community structure across
multiple graphs is to identify the community structure of each individual graph and then combine
the information across the multiple community structures to identify a global community structure.
The combination of community structure across multiple graphs or clustering solutions has been
accomplished through different functions such as majority voting [160], mixture-model approach
[161], and disagreement minimization methods [142].

20

Finally, HyperGraph-Partitioning Algorithm (HGPA) is used to extract common clustering
structure [18, 162]. HGPA treats each cluster across all base clusterings as a hyperedge within a
single global graph. The algorithm is a multilevel graph partitioning system which partitions this
graph in three steps: 1) compress the graph by collapsing hyper-edges, 2) partition the compressed
graph using a minimum cut objective function, and 3) decompress the partitions and repeat the
process. HGPA has a computational complexity of OpkN hq where h is the number of hyperedges.
However, its overall complexity is dependent upon the total complexity of the clustering algorithms used to obtain the base clusterings. HGPA has the disadvantage of generating clusters of
approximately equal sizes, even though in real networks, equally sized clusters are unlikely.

2.2.5

Modularity

The most commonly used cluster quality measure, modularity [163], compares a community
structure to the expected community structure of a random graph such that there exists a high
number of edges within clusters and low number of edges between clusters. Modularity for a
weighted graph is defined as

1 Â¸
wij
2z



di dj
Ďij
2z

(2.6)

such that z is the sum of all edge weights in the graph and Ďij

 1 if vi and vj are in the same

Q

ij


cluster and 0 otherwise. Unfortunately, this definition of modularity does not always result in the
highest value for the true community structure [164] and can reveal a suboptimal structure due to
di dj
the simplistic random model computed through 2z
in the modularity equation [165].

21

2.2.6

Cohenâs Kappa

One of the most commonly used measures to quantify the quality of an observed cluster with
respect to the true structure is Cohenâs Kappa measure. Cohenâs Kappa [166] is a measure of
agreement between two observers and is defined as

Îş

po  pe
,
1  pe

(2.7)

where po is the probability of observed agreement and pe is the probability of expected agreement.
Cohenâs Kappa measure can also be used to quantify the agreement between the ground truth
clustering map (A) and the clustering map (B) obtained from the clustering algorithm as in Table2.1. Ai,j is equal to 1 if nodes i and j are assigned to the same cluster and 0 otherwise. Similarly,
Bi,j is equal to 1 if nodes i and j are assigned to the same cluster and 0 otherwise. In Table-2.1,
a is the number of node pairs which are correctly identified as being in the same cluster, b is the
number of node pairs which are falsely identified as being in the same cluster, c is the number of
node pairs which are falsely identified as not being in the same cluster and d is the number of node
pairs which are correctly identified as not being in the same cluster. Based on this observation, p0
and pe can be computed as:
po

a

a d ,
b c d

(2.8)
pe

 fp1afb2 cg1dq2g2 .

Standard error of kappa statistic is also known and is defined as [167]:
d

SE pÎşq 

pa

po p1  po q
.
b c dqp1  pe q2

22

(2.9)

Table 2.1: Agreement table for the observers used for computing the kappa score

Observer 2
(Measurement)

2.3

Bi,j
Bi,j

Observer 1
(Ground Truth)
Ai,j  1 Ai,j  0
a
b
c
d
f 1=a+c
g1=b+d

1
0

f 2=a+b
g2=c+d

Information Theoretic Cluster Quality Measure

One problem with hierarchical clustering algorithms is how to determine the optimal number of
clusters. In this section, we introduce a new measure to quantify the quality of the resulting clusters
in the absence of âground truthâ information, i.e. knowledge about the actual cluster structure.

2.3.1

Inter and Intra Edge Distribution

By the definition of a cluster, the pairwise connections within a cluster must be stronger than the
inter-cluster connections. In this study, we propose measures that evaluate the quality of a particular clustering structure based on the distribution of the inter and intra-edge distributions across m
graphs. These distributions will be defined similar to probability mass functions (pmfs). Prior to
defining the pmfs of intra-cluster and inter-cluster edges, a function that maps the continuous edge

prq Ă Sprq where Wprq  twprq P r0, 1su refers

values to a discrete alphabet is defined as f : Wi

i

i

ij

prq  tsprq P t1, 2, ..., N uu, and r P t1, 2, ..., mu.
ij

to the ith row of the rth adjacency matrix, Si

The elements of each row of the connectivity matrix across subjects are mapped to discrete integer
values between 1 and N to eliminate the variation of edge strengths across subjects and extract only
relational information about the pairwise edge strengths. We propose to use the rank function to do
this mapping such that the node pair with the largest edge weight is assigned a 1 and the weakest
node pair is assigned N .
23

When a particular cluster set C

 tc1, c2, ..., ck u is identified, the probability mass function of

intra-cluster ranks for a particular cluster, ct , is defined as

Pcintra
pÎ˛ q 
t

#

pÎ˛ q
Fcintra
t

+

Â°N

intra pÎ˛ q
Î˛ 1 Fct

,

(2.10)

P ct ,

(2.11)

where

Fcintra
t
t P t1, 2, . . . , k u, Î˛

pÎ˛ q 

mĚ§ NĚ§
r1 i,j

Î´ psrij , Î˛ q

|

vi , vj

P t1, 2, . . . , N u, and
$
'
& 1 if x

Î´ px, y q 

y

'
% 0 otherwise.

This function computes the frequency with which node pairs with varying strengths of connectivity
are assigned to the same community.
Similarly, the probability mass function of inter-cluster ranks for cluster ct is defined as

Pcinter
pÎ˛ q 
t

#

Fcinter
pÎ˛ q
t

Â°N

+

,

(2.12)

P ct ; v j R ct .

(2.13)

inter pÎ˛ q
Î˛ 1 Fct

where

Fcinter
t

pÎ˛ q 

mĚ§ NĚ§
r1 i,j

Î´ psrij , Î˛ q

|

vi

This function computes the frequency with which node pairs with varying strengths of connectivity
are assigned to different communities.

24

2.3.2

Homogeneity and Completeness

The next step is to quantify homogeneity and completeness, two principal characteristics which
determine the quality of clustering. A homogeneous cluster contains only data points which belong
to the same class while a complete cluster contains all possible data points within the sample space
(Fig. 2.1).

(a)

(b)

(c)
Figure 2.1: An illustration of variations in homogeneity and completeness in a group of objects
where the true number of communities is 3; a) High homogeneity and low completeness; b) High
completeness and low homogeneity; c) High homogeneity and completeness.

Similar measures such as F-measure [168] and V-measure [169], have been used in the
literature to quantify the quality of a cluster. Both measures, however, require a priori knowledge
of class labels. In most cases, this âground truthâ is unknown and therefore alternative measures
of cluster accuracy are needed. In this study, we propose new homogeneity and completeness
measures which depend on the edgesâ strength and we quantify the quality of a clustering structure
using the harmonic mean of the homogeneity and completeness measures similar to V-measure.
In this study, we use the observation that homogeneity is inversely related to the variance of

25

edge ranks within a cluster. If a particular cluster is homogeneous, then we would expect the
pairwise connection strengths among the members of that cluster to be close to each other, thus
implying the ranks of the weights to have small variance. We propose to quantify this variation
through a measure of normalized entropy such that the lowest homogeneity score is obtained for
a cluster containing a uniform distribution of ranks or large variation among the edge weights and
the maximum homogeneity score is obtained for a cluster containing only one rank. Therefore, a
normalized entropy measure of the clusterâs intra-cluster rank distribution would be indicative of
homogeneity. However, as cluster size gets smaller the intra-cluster rank distribution will naturally
become more concentrated thus increasing the homogeneity. To account for this, we introduce
a normalization term in the definition of homogeneity as follows: Hct
where ÎąCt

 |CNt| and this measure is always between 0 and 1.

 ÎąCt


1

H pPcintra q
t
log2 N


Similarly, we define a metric using relative entropy between the inter-cluster rank and intracluster rank distributions to quantify completeness. Rank distributions for inter-edges and intraedges are expected to be different from each other to maximize completeness. The similarity
between two distributions is commonly quantified using divergence measures. In this study, we
propose to use the Jensen-Shannon divergence measure [170] since it is symmetric and bounded.
With respect to completeness, a divergence measure approaching 1 is synonymous with increased
completeness. Completeness of a cluster is therefore defined as Cct

 JS pPcintra
, Pcinter
q where
t
t

JS pp, qq 


1 Â° p
i
2

1 Â°k C ,
k t1 ct

Â°
and average homogeneity, HĚ  k1 kt1 Hct , are computed across all clusters.

ppiq
log2 0.5pppiq qpiqq

piq

Â°

qpiq
log2 0.5pppiq qpiqq

piq

iq



. Average completeness, CĚ 

Similar to the balanced F-score, our quality measure U for the final clustering structure is
defined as the harmonic mean of the average completeness and average homogeneity as follows:

26

U

2.4

 1 2 1 .
CĚ
HĚ

(2.14)

Fiedler Consensus Clustering Approach

Since the literature has consistently demonstrated that the Fiedler vector is highly effective in
partitioning graphs [171, 172], in this study, we use the Fiedler vector for performing consensus
clustering across multiple weighted graphs. The original connectivity matrices are bi-partitioned
into two clusters using the Fiedler partitioning method. This results in a cluster matrix for the rth
subject T r such that

T r pi, j q 

and r

 t1, 2, ..., mu.

$
'
& 1 if nodes vi , vj are in the same cluster
'
% 0 otherwise

In order to find the common community structure across multiple graphs,

we introduce a co-occurence matrix P where P pi, j q 

Â°m

r 1

T r pi,j q
m

and P pi, j q P r0, 1s. P pi, j q

is the probability that a pair of nodes are members of the same cluster across multiple graphs.
The adjacency matrix reflects the strength of a direct relationship between a node pair, whereas P
reflects the likeliness that a pair of nodes are in the same cluster across all subjects.
The Laplacian matrix of P is computed and the Fiedler vector is found to form a bi-partition of
P into a community structure composed of clusters c1 and c1 . Since P represents the probability
that a node pair should be clustered together, the Fiedler partition of P represents the community
structure common to all graphs. The initial partition set, C
k

 tc1, c1u, contains 2 clusters but if

ÂĄ 2 is desired, the process can be repeated by selecting a cluster in C to partition. In this case,

c1 or c1 is selected based on the Îś values of each cluster. At each step of partitioning, weighted

27

Algorithm 2.1 Fiedler Consensus Clustering Algorithm

 N dimensional graphs, G  tG1, G2, ..., Gmu with vertices V  tv1, v2, .., vN u and
 twijr : vi, vj P V } such that Gr  pV, E r q and r  t1, 2, ..., mu.

1: Input: m N

Er

2:
3:
4:
5:
6:
7:
8:
9:

edges
Input: Number of clusters, k.
Output: k clusters C  tc1 , c2 , ..., ck u where cj Â V .
CH
for t  2 to k do
P 1  0|V||V|
for r  1 to m do
submatrix GĚr Â Gr |GĚr  pV, E r q
pV1, V2q  SubRoutine(Fiedler Partition(GĚr )) "
P 1 pi, j q  P 1 pi, j q

10:

T r pi, j q where T r pi, j q 

11:
end for
P1
12:
P
m
13:
V1 , V2
SubRoutine(Fiedler Partition(P ))
14:
C C
V1 , V2 .
15:
if t k then
16:
score 0t1
17:
for Îł 0 : 0.1 : 1 do
18:
Î
Îś1 Îł Îś2 Îł ... Îśt Îł
19:
j minq Î q and q
1, 2, ..., t
20:
score j
score j
1
21:
end for
22:
V
ci i minq score q and q
1, 2, ..., t
r
r
23:
E
wij : vi , vj V }
24:
C C ci
25:
end if
26: end for

p

1 if nodes vi , vj
0 otherwise

P V1 or vi, vj P V2



q
 Yt
u



 r p q| p q| | p qs

t p qu
Pt
p q
pq
 | 
t
 zt u

t

P

p qu

u

Pt

u

sum of homogeneity and completeness is computed to select the cluster to be partitioned at the
next level.

ÎśCt
where Îł

P r0, 1s.

 CĚCt Îł

HĚCt p1  Îł q

(2.15)

For each cluster, a set of Îś values are created by choosing a range of Îł

P

r0, 1s. The cluster whose Îś values are lower than othersâ for a majority of Îł values is selected for
partitioning.
Next, sub-matrices are extracted from the original connectivity matrices such that they only

28

Algorithm 2.2 Fiedler Partition

Input: graph G  pV, E q.
Output: Vertex sets V1 and V2 .
Compute Normalized Laplacian Matrix, L, of G.
Compute |V | eigenvectors, u, and eigenvalues, Îť, of L.
Order eigenvalues in ascending order: Îť1 Â¤ Îť2 Â¤ ... Â¤ Îť|V | .
uF  ui where i  minq tÎťq u|Îťq  0.
Sort elements of uF and find uF piq which has maximum gap with ensuing element
for j "1 to |V | do
V1 if uF pj q Â¤ uF piq
9:
vj P
V2 if uF pj q ÂĄ uF piq
10: end for
1:
2:
3:
4:
5:
6:
7:
8:

contain the nodes of the chosen cluster. These sub-matrices are used to derive the new sub cooccurence matrix, Py , where y



1 if cluster c1 was selected and y
y

 1 if cluster c1 was
y

selected. The Fiedler partition will result in two new clusters, c1 and c1 . The final cluster set is
C

 tcy , cy1 , cy1u which is a concatenation of the two new clusters with the original cluster that

was not chosen for bi-partitioning. Algorithm 2.1 describes this process for obtaining community
structures for a given number of clusters.

2.5

Results

In this section, we will evaluate the effectiveness of the proposed Fiedler Consensus Clustering
Algorithm for revealing the hierarchical community structure across multiple graphs. The optimal
community structure will be identified by maximizing the quality measure U which is the harmonic
mean of the homogeneity HĚ and the completeness scores CĚ as defined in Section-2.3.
First, we will compare the traditional modularity measure versus the proposed quality measure
in determining the optimal number of clusters for the FCCA. Then, we will compare the FCCA to
other consensus clustering approaches including averaging, voting and HGPA for different types of

29

network structure: varying inter-cluster strengths, outliers within a group and overlapping clusters.
Finally, the proposed clustering algorithm and the quality measure will be used to identify the
community structure which best describes the multivariate relationships across multiple subjects
from connectivity graphs obtained from EEG data. For the evaluation of computational complexity,
we note that all data analysis has been performed on a 2.4 GHz Intel Core i5 processor running
Windows 7.

2.5.1

Quality versus Modularity

Simulated networks consisting of 63 nodes and composed of 3 equal sized clusters were generated
100 times to evaluate the performance of the proposed quality measure, U , against modularity
metric for determining the true community structure. The weights of the intra-cluster edges were
selected from a truncated Gaussian distribution in r0, 1s with a mean of Âľintra
deviation of Ďintra

 0.1.

 0.6 and a standard

Similarly, inter-cluster edge weights were selected from a truncated

Gaussian distribution with a mean of Âľinter

 0.1 and a standard deviation of Ďinter  0.2.

To compare the robustness of the quality metric with the modularity metric for identifying
unequal size clusters, the number of nodes in the first cluster (c1 ) was gradually increased from 21
to 49 while the sizes of the other two clusters (c21 and c22 ) were decreased from 21 to 7. In order to
evaluate the performance of the proposed quality measure and the standard modularity metric, all
four possible partitionings of a 3 cluster network are considered, i.e. the three 2-cluster structures
(c1 and c21 as one single cluster vs. cluster c12 , c1 and c22 as one single cluster vs. cluster c21 ,
and c21 and c22 as one single cluster vs. c1 ) and the true 3 cluster structure. As seen in Fig. 2.2,
the proposed quality metric always has its highest value for the true community structure, thus
successfully identifying the correct community structure for each test condition while modularity
tends to merge small clusters.
30

â3

3

â3

x 10

3

x 10

c1, c21, c22

c1, c21, c22

c1, [c21 + c22]

2.5

c1, [c21 + c22]

2.5

[c1 + c22], c21

[c1 + c22], c21
2

[c1 + c21], c22

Modularity

Quality

2

1.5

1.5

1

1

0.5

0.5

0

1

2

3

4

5

6

0

7

# of nodes in c1 / # of nodes in c21

[c1 + c21], c22

1

2

3

4

5

6

7

# of nodes in c1 / # of nodes in c21

(a) Quality

(b) Modularity

Figure 2.2: Average of quality metric (U ) and modularity (Q) measure corresponding to defined
clustering structures in a 3 cluster network with respect to the ratio of the number of nodes in
cluster 1 to the number of nodes in the rest of the network over 100 trials.

2.5.2

Evaluation of FCCA for Varying Inter-cluster Strength

100 simulated networks consisting of 64 nodes and composed of 4 communities of equal size were
generated 100 times to evaluate the performance of FCCA for varying inter-cluster edge strength,
i.e. varying noise levels in the community structure. The weights of the intra-cluster edges were
selected from a truncated Gaussian distribution in r0, 1s with a mean of Âľintra
deviation of Ďintra

 0.8 and a standard

 0.1. The weights of the inter-cluster edges were selected from a truncated

Gaussian distribution in with a mean of Âľinter and a standard deviation of Ďinter

 0.2. In order

to evaluate the algorithms under different inter-cluster connectivity strengths, Âľinter was increased
gradually from 0.4 to 0.7.
Using the proposed Fiedler Consensus Algorithm, the Averaging method, the Voting method,
and HGPA, the networks were evaluated for 2 Â¤ k

Â¤ 10 communities. The best community struc-

ture was selected using the proposed quality measure, U . The accuracy of the resulting structure
was quantified by Cohenâs Kappa statistic which computes the agreement with the true community
structure. Overall success was determined by computing the average Kappa value over 100 trials.
31

As shown in Table-2.2, FCCA and averaging method are more robust than other algorithms for
varying inter-cluster edge strengths. These two algorithms accurately identified almost all clusters
in all cases for Âľinter

Â¤ 0.7 . Moreover, FCCA is computationally more efficient than voting and

HGPA approaches.
Table 2.2: Average Cohenâs Kappa and average standard error for identifying community structure
in simulated 4-community-networks with varying inter-cluster strength
Âľinter
Method
Îş
FCCA

time
(sec.)
Îş

Ave.

time
(sec.)
Îş

Voting

time
(sec.)
Îş

HGPA

2.5.3

time
(sec.)

0.4
0.9992
1.4352  105
1.1106
0.1415
1
0
0.1763
0.0219
0.9968
1.9523  104
9.1640
1.2854
0.8385
0.0092
1.6020
0.3354

0.5
1
0
1.2224
0.2651
1
0
0.1991
0.0299
1
0
11.1266
1.0163
0. 8576
0.0080
1.2207
0.1861

0.6
1
0
1.2709
0.1516
1
0
0.2131
0.0223
0.9905
0.0011
14.3203
2.1477
0.9276
0.0050
1.2711
0.1441

0.7
0.9690
0.0040
1.4901
0.4153
0.9921
0.0014
0.2100
0.0347
0.8938
0.0073
41.6633
4.2908
0.8101
0.0107
2.0018
0.4222

Robustness to Outlier Graphs

In a lot of real world settings, the community structure across a population may not always be the
same, i.e. there may be outliers in the group. In this subsection, we generate simulations to evaluate the performance of the different clustering methods in the case of outlier graphs. 100 simulated
networks consisting of 64 nodes were generated 100 times using two different community structures. The majority of the networks had a 3-cluster structure where nodes 1 to 16 formed the first
cluster, nodes 17 to 48 formed the second cluster and the last 16 formed the third cluster. The elements of the connectivity matrices ranged between 0 and 1. The weights of the intra-cluster edges
32

were selected from a truncated Gaussian distribution with a mean of Âľintra
deviation of Ďintra

 0.1.

 0.6 and a standard

Similarly, inter-cluster edge weights were selected from a truncated

Gaussian distribution with a mean of Âľinter

 0.3 and a standard deviation of Ďinter  0.2. Out-

lier networks were constructed to have 2 communities in which nodes 1 to 32 formed the first
cluster and nodes 33 to 64 formed the second cluster. The edge weights were selected from a
truncated Gaussian distribution with the intra-cluster edges having a mean of Âľintra
standard deviation of Ďintra

 0.8 and a

 0.1 while the inter-cluster edge weights had a mean of Âľinter  0.1

and a standard deviation of Ďinter

 0.2. To evaluate the robustness of the algorithms to outliers,

the ratio of the outlier networks to the whole group was increased gradually from 15% to 30%.
Similar to the previous section, all networks were evaluated for 2

Â¤ k Â¤ 10 communities and

the best community structure was selected by choosing k which maximizes the quality measure U .
Accuracy of the different clustering algorithms was quantified by computing the average Kappa
value across 100 trials. As shown in Table 2.3, FCCA and the voting approach are more accurate
in identifying the true community structure in the case of outlier graphs. Although the voting
approach is more robust against the outliers, its high computational complexity makes the FCCA
a useful alternative.

2.5.4

Detecting Overlapping Communities

Communities in a network may not always be distinctly separable from each other and may have
an overlapping structure. In order to evaluate the performance of the different clustering algorithms in the case of overlapping communities, 100 simulated networks consisting of 64 nodes and
composed of 4 equal size communities were generated 100 times. The weights of the intra-cluster
edges were selected from a truncated Gaussian distribution with a mean of Âľintra
dard deviation of Ďintra

 0.8 and a stan-

 0.1. Similarly, inter-cluster edge weights were selected from a truncated
33

Table 2.3: Average Cohenâs Kappa and average standard error for identifying communities in a
group of networks with outliers
Outlier Rate
Method
Îş
FCCA

time
(sec.)
Îş

Ave.

time
(sec.)
Îş

Voting

time
(sec.)
Îş

HGPA

time
(sec.)

15%
1
0
0.9672
0.1819
1
0
0.1716
0.0313
1
0
9.0646
0.9299
0.5177
0.0149
1.6271
0.1875

Gaussian distribution with a mean of Âľinter

20%
1
0
0.9168
0.1163
0.7143
0.0118
0.1739
0.0254
1
0
8.9715
0.5380
0.5164
0.0150
1.5554
0.1626

25%
1
0
0.9429
0.1245
0.7143
0.0118
0.1771
0.0281
1
0
9.0036
0.7468
0.4848
0.0154
1.2809
0.1719

30%
0.7143
0.0118
1.0095
0.1127
0.7143
0.0118
0.1627
0.0206
0.9304
0.0031
10.4778
2.7145
0.4554
0.0158
1.2043
0.0924

 0.1 and a standard deviation of Ďinter  0.2.

To

provide overlap between communities, a subset of the inter-cluster edges between each pair of
communities was selected from the same distribution as the intra-cluster edges. The number of
strong inter-cluster edges was increased gradually from 75 to 125.
As in previous simulations, all networks were evaluated for 2

Â¤kÂ¤

10 communities and

the optimal k was selected based on the maximizing the quality measure. Overall success was
determined by computing the average Kappa value over 100 trials. As seen in Table 2.4, Fiedler
Consensus Clustering Algorithm is more robust to overlapping communities in the network compared to the other methods.

2.5.5

Community Structure of the Brain During Error-Related Negativity

The time-varying phase synchrony measure is applied to a set of EEG data containing the errorrelated negativity (ERN) [173, 174]. The ERN is an event-related potential that occurs following

34

Table 2.4: Average Cohenâs Kappa and average standard error for identifying overlapping clusters
in simulated networks
# of Inter-Cluster Edges
Method
Îş
FCCA

time
(sec.)
Îş

Ave.

time
(sec.)
Îş

Voting

time
(sec.)
Îş

HGPA

time
(sec.)

75
0.9964
3.9437  104
1.3631
0.3329
0.9754
0.0025
0.2186
0.0560
0.9938
6.1236  104
11.8611
1.9694
0.8682
0.0074
1.5070
0.2811

100
0.9887
0.0013
1.2449
0.2767
0.9424
0.0052
0.2214
0.0577
0.9694
0.0022
15.0078
2.8988
0.8374
0.0094
1.5480
0.2081

125
0.9539
0.0048
2.5533
1.0191
0.8825
0.0080
0.3817
0.1246
0.9444
0.0047
26.5411
6.1363
0.7732
0.0117
1.7893
0.2913

performance errors in a speeded reaction time task. Previously reported EEG data [175] from
63-channels (10/20 system) were utilized. This included 91 undergraduate students (34 male)
from the University of Minnesota (one of the original 92 participants were dropped due to artifacts
rendering computation of the time-frequency phase synchrony (TFPS) values problematic). Full
methodological details of the recording are available in the previous report [175]. The task was
a common speeded-response letter (H/S) flanker, where error and correct response-locked trials
from each subject were utilized. A random subset of correct trials was selected, to equate the
number of error relative to correct trials for each participant. The EEG data are pre-processed
by the spherical spline current source density (CSD) waveforms to sharpen event-related potential
(ERP) scalp topographies and reduce volume conduction [176]. The CSD has fewer assumptions
than many inverse transforms, attenuates volume conduction, and represents independent sources
near the cortical surface [177]. Our previous work indicates that there is increased phase synchrony
associated with ERN for the theta frequency band (4-7 Hz) and ERN time window (25-75 ms) for

35

Error responses compared to Correct responses [148]. For each subject and response type, the
pairwise average phase locking value within the ERN time window and theta frequency band was
computed using Equation (4) across trials yielding a 63  63 connectivity matrix indicating the
average synchrony between brain regions.
First, the proposed Fiedler consensus clustering approach, averaging and voting methods were
applied to the set of Error and Correct data in order to identify an optimal community structure.
Hierarchical decomposition of the networks were evaluated for 2

Â kÂ 

15 and the optimum

k was selected by maximizing the quality metric U . The clustering results can be seen in Figs.
2.3 (FCCA) and 2.4 (averaging and voting methods, Figs. 4a and 4b, respectively). For FCCA,
Error responses were best represented by a structure composed of 10 communities (Fig. 2.3a), and
Correct responses with 7 communities (Fig. 2.3b), while Averaging and Voting methods identify 9
communities for Error responses and 5 communities for Correct responses (Fig. 2.4). As it can be
seen, the averaging method yields two large clusters for both error and correct conditions unable to
discriminate between error and correct responses and resolve the different subnetworks. Similarly,
the voting method yields a large cluster for both error and correct responses with a couple of small
frontal and lateral subnetworks. The proposed method, on the other hand, provides a more detailed
view of the network separating the medial and lateral clusters from each other.
The obtained clusters from the three consensus clustering approaches are evaluated on each
subjectâs network to quantify the agreement between the common cluster structure and each subjectâs connectivity graph. This agreement is quantified through the quality metric, U . As seen in
Table 2.5, the community structure obtained by FCCA is more appropriate for each subject and
yields a statistically significant higher score for both Error and Correct conditions pp Â  0.025q.
From Fig. 2.3, we can see that the clusters identified by FCCA are more segregated and differentiated for Errors relative to Correct responses. For example, in the Correct condition one large
36

Table 2.5: Mean and Standard Deviation of quality metric (U ) computed to quantify the consistency between the groupâs community structure and individual subjectsâ community structure for
each subject, each response type and the three consensus clustering methods
Method

U

Error
Correct

Averaging
0.0483  0.0031
0.0454  0.0050

Voting
0.0524  0.0040
0.0426  0.0054

FCCA
0.0567  0.0054
0.0504  0.0042

Table 2.6: Computation time for community structures obtained by the FCCA, Averaging and
Voting methods
Method

time
(sec.)

Error
Correct

Averaging
2.7104
2.3326

Voting
198.1623
212.5764

FCCA
11.8576
13.7596

cluster (1) accounts for the majority of prefrontal and motor regions, with a small cluster (5) consistent with separable activity in left-PFC regions. For Errors, on the other hand, separable clusters
are apparent relative to left (1) and right (2) motor areas, and left (4, 8) and right (5) lateral-PFC
regions (consistent with a priori hypotheses). Interestingly, one cluster in the Error condition (6)
and one in the Correct (4) center on medial-frontal sites including FCz and Cz (6), consistent with
the time-domain ERN and correct-related negativity (CRN) component topographies, respectively.
Activity in parietal-occipital regions was characterized with similar clusters for both Correct (2)
and Error conditions (3, 7).
An overall statistical assessment of the inter-modular relationships revealed that the grand mean
was significantly greater for the Error relative to Correct conditions ptp90q
while the same comparison for intra-modular pairs was not ptp90q

Â  .5q.



2.16, p

Â 

.033q,

This provides support

for the inference of increased functional connectivity related to error processing relative to correct.
Next, to provide detailed information about these relationships, average intra-modular and intermodular synchronies were computed for Correct and Error communities (presented in Fig. 2.5).
These maps illustrate the amount of integration between different clusters. To provide statistical
37

(a) Error

(b) Correct

Figure 2.3: Cluster structures obtained by FCCA a) Error (k = 10); b) Correct (k = 7).
assessment of the inter-modular relationships within the Error and Correct conditions, t-tests were
performed for each inter-modular bivariate pair relative to the grand mean of the inter-modular
pairs. Resulting t-values and Bonferroni corrected p-values are presented in Figs. 3 b and c,
respectively. While it was not appropriate to directly compare Error-Correct differences between
individual clusters (as they were derived from separate cluster analyses), several observations about
the individual clusters within Error or Correct conditions provide some interesting information at
this level of analysis. First, motor-related clusters in the Error condition (1, 2) were significantly
more related to each other than the across cluster average. Next, occipital clusters in both the
Error (3, 7) and Correct (2) conditions evidenced decreases relative to the mean, suggesting a

38

FP1
AF7
F7
FT7
T7

F5
FC5

CP5

P7

F1

F3
FC3

C1

P1

P3

PO3

PO5
PO7

AFZ
FZ

AF4

CZ

FC4

C2

O1

PZ

P2

F7

F8

F6

F4

F2

FT7

FC6 FT8

C4

T8

C6

P4

POZ
OZ

O2

P6

F5

FPZ

FP2

AF3

AFZ

AF4

CP5

PO6
PO8

C1

CZ

P1
PO3

O1

Error

PZ

FC4

FC6

FT8

C4

C6

T8

C2

P2

F8

F6

F4

F2

CP1 CPZ CP2

P3

PO5
PO7

FZ

AF8

FC1 FCZ FC2

CP3

P5

P7

P8

FC3
C3

C5

TP7

F1

F3

FC5

T7

CP4 CP6
TP8

PO4

FP1
AF7

AF8

CP1 CPZ CP2

CP3

P5

FP2

FC1 FCZ FC2

C3

C5

TP7

AF3

FPZ

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

TP8

P8

PO6
PO8

Correct
(a) Averaging

FP1
AF7
F7
FT7

F5
FC5

TP7
P7

F1

F3
FC3

CP5

C1

PO5
PO7

AF8

AFZ
FZ

C2

CZ

P1

P3

PO3
O1

PZ

F6

F4

F2

P2

F8
FC6

FT8

C4

C6

T8

CP4 CP6

POZ

PO4

OZ

O2

F7

FC4

P4

P6

FP1

FPZ

FP2

AF3

AFZ

AF4

AF7

AF4

CP1 CPZ CP2

CP3

P5

FP2

FC1 FCZ FC2

C3

C5

T7

AF3

FPZ

FT7

FC5

TP7
P7

P8

PO6
PO8

FC3

CP5
P5

FZ

C1

CZ

C2

CP1 CPZ CP2
P1

P3

PO3
O1

PZ

F6

F4

F2

FC1 FCZ FC2

CP3

PO5
PO7

Error

F1

F3

C3

C5

T7

TP8

F5

AF8

P2

F8

FC4

FC6

FT8

C4

C6

T8

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

TP8

P8

PO6
PO8

Correct
(b) Voting

Figure 2.4: Cluster structures obtained by a) Averaging, Error (k = 9) and Correct (k = 5); b)
Voting, Error (k = 9) and Correct (k = 5).
decrease in connectivity with visual processing areas during the ERN and CRN. For lateral-PFC
regions, the smaller left-laterized clusters (8 and 5, respectively for error and correct conditions)
were significantly associated with significantly increased connectivity with prefrontal areas and
decreased connectivity with parietal occipital areas. Another a priori effect of interest was that
both Error and Correct medial-frontal clusters (6 and 4, respectively) showed significant increases
with left lateral-PFC clusters (4 and 5, respectively).
Fig. 2.6 illustrates the hierarchical structure obtained from FCCA for both response types. For

39

the error response, the initial partition yields one large frontal and one parietal cluster. Further
decomposition provides the detailed construction of the frontal cluster. Similarly, for the correct
response, the initial partition yields one frontal and one parietal partition with the subsequent
partitions decomposing the frontal cluster into smaller sub-networks.

2.6

Conclusions

In this chapter, we proposed a new graph theoretic community detection approach to provide a
detailed view of the organizational structure underlying the functional brain connectivity network
through EEG recordings across multiple subjects. The main contributions include the hierarchical implementation of Fiedler vector based graph clustering, the introduction of an accurate and
computationally efficient consensus clustering approach, the introduction of a new informationtheoretic cluster quality measure, U , and a detailed study of the brain network involved in error
processing.
First, the well-known Fiedler vector based graph bi-partitioning method has been implemented
to obtain a hierarchical decomposition of the functional connectivity networks. This hierarchical
implementation is supported by previous work that suggests a hierarchical structure for functional
connectivity networks [138â140]. Second, the proposed partitioning approach is modified to account for multiple subjects by first obtaining an initial bipartition of each subjectâs connectivity
network and then by iteratively partitioning the co-occurrence matrix across subjects. As shown
through simulations, FCCA is computationally more efficient than voting and is more accurate than
averaging in the case of outliers and overlapping community structures. Moreover, the application
of FCCA to EEG data produced clusters consistent with published work [178,179], whereas voting
and averaging methods failed to partition the frontal cluster into physiologically meaningful lateral

40

and medial frontal communities. Finally, a new cluster quality measure U based on optimizing
the tradeoff between maximizing the divergence between clusters and minimizing the entropy of
individual clusters was introduced to select the optimal number of clusters. This measure provides
an alternative to the standard modularity measure which is known to fail for unequal cluster sizes
and weighted networks [180].
Future work will consider exploring single [181] and distributed dipole [182] source solutions
to the inverse problem for extending this approach to the source domain.

41

(a)
t â Values

t â Values
3

3

1

1

2

2

2
2

3
1

5
0
6
7

1

3

Clusters

Clusters

4

4

0

5

â1

â1

8
6
â2

9

â2
7

10
1

2

3

4

5

6

7

8

9

10

â3

1

2

3

Clusters

4

5

6

7

â3

Clusters

(b)
p â Values

p â Values

â3

x 10

1

â1

2

â2

â3

x 10
â1

1
â2
2

3
4
5

â5

6

â4

3

â4

Clusters

Clusters

â3

â3

â6

â5
4
â6
5

7

â7

â7
8

â8

6

â8
9

â9

â9
7

10
1

2

3

4

5

6

7

8

9

10

â10

1

Clusters

2

3

4

5

6

7

â10

Clusters

(c)
Figure 2.5: (a) Average phase synchrony between clusters for Error response (k = 10), and Correct
response (k = 7). (b) t-values for inter-modular bivariate pair relative to the grand mean of the
inter-modular pairs of ERN and CRN conditions and (c) corresponding p-values.

42

(a)

(b)

Figure 2.6: Hierarchical structure for obtained by FCCA a) Error; b) Correct responses.

43

Chapter 3
Recursive Robust Low-rank + Sparse
Structure Learning for Dynamic Tensors

3.1

Introduction

Advanced functional imaging techniques such as EEG and functional magnetic resonance imaging
(fMRI) have enabled the study of the neuronal mechanisms underlying cognition in detail. These
studies have revealed that transient synchronization, referred to as functional connectivity (FC),
between spatially distributed neural populations is responsible for human cognition, perception
and emotion [114, 183, 184]. Recently, tools from graph theory have been employed to analyze the
functional connectivity of the brain by associating nodes with distinct brain regions and edges with
pairwise interactions between them [105,106]. Most of the current work on functional connectivity
network analysis focuses on static networks where the networks correspond to average activity
over a time and frequency window of interest. However, recent studies have shown that functional
connectivity networks change dynamically in short time scales and exhibit task-related patterns
[107â111]. This continuous formation and destruction of functional connectivity also controls the
emergence of a unified neural process in cognition, perception and memory [111, 185, 186].
To better understand the brain dynamics, early studies focused on extracting graph theoretic
44

measures across time for a time-varying analysis of FC graphs [185, 187, 188]. For example, Valencia et al. [189] showed how the small-world structure of FC networks evolve during a visual
stimulus. Similarly, Fallani et al. [190] presented a graph theoretical approach for FC networks to
identify persistent edges during a motor task. Recently, dynamic FC network (dFCN) tracking approaches have been combined with network state estimation techniques. Allen et al. [191] assume
that the FC network at each time point is at a distinct network state where the network states are
determined through k-means clustering of dFCNs across time and subjects from resting state fMRI
data. Similarly, Dimitriadis et al. [192] introduced FC microstates inspired by the EEG microstate
literature [193]. In [194], the network states are obtained through clustering and Markov modelling
to identify both FC-states and the transitions between them. Similarly in [195], the network states
are identified by evolutionary clustering applied to FC edge timeseries. An alternative approach
to dFCN analysis assumes that network states are made up of multiple building blocks. Leonardi
et al. [196] propose a principal component analysis (PCA) based approach to reveal the intrinsic
FC patterns named as eigenconnectivities, and describe the FC matrices as weighted sum of eigenconnectivities. Similarly, different PCA based approaches have been used to identify dynamics of
both resting-state [197] and task-based EEG [198]. More recently, [199] compared clustering and
SVD based approaches to identify task related and resting state FC dynamics and showed that FC
patterns obtained from an SVD based approach better represent the task-based dynamics, while
patterns obtained from a clustering based approach are more suitable to identify resting state FC
dynamics.
In addition to approaches focused on unraveling the network states from dFCNs, recently introduced methods have also focused on detection of change points where the network structure
considerably changes. Cribben et al. [200] presented a data-driven technique which first detects
the temporal change points by a greedy partitioning scheme and then estimates a connectivity
45

graph for FC patterns in each temporal partition. Zhang et al. [201], presented dynamic Bayesian
variable partition model which simultaneously identifies the state transitions and learns significant
FC patterns in each state. Similarly, Ou et al. [202] proposed a Bayesian model to detect change
points from functional brain interactions and evaluated the network structure using nonnegative
matrix factorization within each temporal segment. However, all of these approaches make use
of time-series data and are not based on the dFCNs constructed directly from data. Moreover,
these approaches assume a multivariate Gaussian model for the underlying time series data and
require individual analysis of each subject before inferring the groupâs network structure. Finally,
these methods are computationally expensive as they rely either on greedy search or probabilistic
metrics.
In this chapter, we propose a tensor based representation of dFCNs for tracking and summarizing the functional connectivity across time and subjects from task-based EEG data. First, we
introduce a tensor subspace analysis method for robust low-rank + sparse structure recovery. This
is motivated by the fact that FCNs are known to have a modular structure which translates to a
low-rank connectivity matrix [151]. In the case of dFCNs, these low-rank structures can be assumed to change slowly in time, similar to EEG microstates [193], which can be described as a
slowly changing subspace. Conventional subspace analysis methods such as PCA and SVD cannot deal with higher order data and existing tensor decomposition methods such as higher-order
SVD (HoSVD) and parallel factor analysis (PARAFAC) are not robust to sparse outliers or noise
in the data [44, 73, 203, 204]. The proposed recursive framework separates the low-rank part of
the data from sparse noise components by identifying change points and updating the estimates
of low-rank subspaces. Identified change points corresponding to the subspace change along the
connectivity mode of the tensor are used to define time intervals of interest. The low-rank tensor within each time interval is then summarized through a recently introduced multiple network
46

clustering approach known as Fiedler Consensus Clustering Approach (FCCA) [205]. Finally, the
proposed framework is applied to dFCNs constructed from EEG data collected during a study of
error-related negativity.
The proposed framework offers three major contributions to both the literature in online tensor
subspace tracking and dFCN analysis. First, unlike most of the current work which focuses on the
dynamics of resting state fMRI networks, we focus on the dynamics of task related networks from
EEG data. As EEG data has high temporal resolution and the data considered in this chapter is
response locked, determining the actual time points where significant changes to network structure
occurs is highly relevant. The proposed framework offers a way to determine time intervals during
which the FCN has a common quasi-stationary pattern across time and subjects, i.e. slowly changing subspace structure similar to microstates [193]. Second, most of the current work reduces
the high dimensionality of the dFCNs by vectorizing the connectivity matrices into long vectors
before identifying the network states. This approach does not preserve the topological structure
of the network. In the proposed work, we address this problem by keeping the network structure
of FCNs intact by using tensor representations. Through tensor representation, we can capture the
variability common to all subjects across time. Finally, the proposed low-rank plus sparse structure
learning algorithm for tensors offers a novel way of recovering a low-rank subspace estimate along
each mode of the data where the rank is defined through the Tucker rank. This rank definition is
directly related to the modular structure of FCNs . The proposed approach separates the low-rank
part of the data from sparse noise components along time and then summarizes the network within
each time interval through clustering the extracted low-rank networks within the time interval. This
yields better structure information as it is equivalent to denoising the networks.

47

3.2
3.2.1

Background
Robust Principal Component Analysis

High dimensional data mostly lies in a lower dimensional subspace and principal component analysis (PCA) is the most widely used technique to identify this lower dimensional subspace. Recently, PCA has been used for identifying network states from dynamic functional connectivity
networks [196, 198]. However, it is known that PCA suffers from non-Gaussian corruptions and
may find a completely wrong principal subspace in the presence of even a few outliers. These
drawbacks have forced researchers to develop more robust subspace estimation techniques which
is a significantly more difficult problem than standard PCA [206, 207].
Since the recent work by Candes et al. and Chandrasekharan et al. [208, 209], the general
problem of separating a sparse matrix and a low-rank matrix from their sum has received a lot of
attention. The final goal usually is to either find the column span of the low-rank matrix or the
support of the sparse one. This is now commonly referred to as the âlow-rank + sparse recoveryâ
problem. There has been a large amount of recent work on batch methods for low-rank + sparse
recovery and its various extensions including Principal Component Pursuit, Outlier Pursuit and
Low-Leverage Decomposition [49â62, 209].
One of the well-known robust PCA methods is principal component pursuit (PCP) which assumes that the data matrix M has a low-rank part L and a sparse noise or outlier S as [208]:

ML

S.

(3.1)

It was shown that L can be efficiently estimated by solving the following optimization problem:

48

min k L k
s.t.

Îť k S k1

(3.2)

S  M,

L

where Îť is the regularization parameter and k L k 

Â°r

i1 Ďi

pLq denotes the nuclear norm where

Ďi âs are the first r singular values of the matrix L. This problem has been solved by using convex
optimization approaches, i.e. augmented Lagrange multiplier algorithm [208], accelerated proximal gradient approach [210].
In order to reduce the computational complexity and to achieve online subspace tracking, various approaches have also been proposed to solve the RPCA problem, i.e. GRASTA, PETRELS
and REPROCS [63, 66, 67, 70, 204, 211â214]. These approaches first identify the subspace that the
low rank data lies in, then recovers incoming low-rank measurement vectors from missing entries
by considering this subspace information.
A recently introduced algorithm REPROCS recursively separates the low-rank part from sparse
noise as follows. Let Mt
Lt

P

Rn1 be a time-series of measurement vectors written as Mt



St where Lt is the low-rank part which lies in a subspace spanned by Pt and St is the

sparse noise vector. Let PĚt be an accurate estimate of the r-dimensional basis Pt at time t and
PĚt,K be the orthogonal complement of PĚt . Let Îąt : PĚ1t Lt be the projection of Lt onto PĚt
and Î˛t : pPĚt,K q1 Lt be a projection of Lt onto PĚt,K . Then, Mt can be rewritten as Mt
PĚt Îąt

PĚt,K Î˛t



St . REPROCS first projects the measurement vector onto PĚt,K to approximately

nullify the low-rank part Lt . As yt : pPĚ1t,K qMt , where yt can be rewritten as yt

 pPĚ1t,KqSt

Î˛t ,

and the dimension of the projected data vector reduces to n  r. Since projecting Mt onto PĚ1t,K
nullifies the contribution of Lt , Î˛t can be interpreted as small noise. Therefore, solving for ndimensional St from pn  rq-dimensional yt becomes a traditional sparse recovery problem. Once
SĚt is recovered, Lt can be estimated as LĚt



Mt
49

 SĚt.

Performance of this algorithm highly

depends on the correctness of the estimated low-rank subspace and the slowly changing subspace
assumption [67, 204, 213, 214]. However, this method is limited to vector type measurements, and
cannot be applied directly to higher order datasets such as tensors. In this chapter, we will present
an extension of REPROCS to tensor type data.
Recently, tensor-based approaches have been proposed to track dynamic tensor subspaces such
as dynamic tensor analysis, streaming tensor analysis and window based tensor analysis [71, 72].
However, these approaches provide computationally efficient frameworks for analysis of streaming
datasets by recursively updating subspace information and do not address the robustness of the subspace estimates. Goldfarb and Qin extended robust PCA to tensors (HoRPCA) by solving low-rank
+ sparse recovery problem for general higher order tensors [73]. However, this method is highly
computationally expensive and does not update the subspaces online, i.e. would not be useful for
tracking subspace changes across time. Li et al. [74] presented a robust subspace learning algorithm (RTSL) that incrementally updates the tensor subspace. Moreover, Nion et al. [75] proposed
two adaptive approaches to track PARAFAC decompostion of 3-way tensors. These approaches
suggest to update the PARAFAC decomposition at every time point based on simultaneous diagonalization or minimization of weighted least squares criterion. More recently, Mardani et al. [76]
proposed an online subspace learning method based on nuclear norm minimization and extended
this approach for matrices and higher order datasets. Extension of this algorithm to tensors takes
advantage of PARAFAC model to minimize tensor rank and considers temporal information as
one of the tensor modes. Similar to [76], OLSTEC proposed in [77] also tracks the subspace of
partially observed higher-order data using PARAFAC decomposition.

50

3.2.2

Tensor Algebra & Tensor Decompositions

An order N tensor is denoted as X

P

RI1 I2 ...IN where xi1 , i2 ,... iN corresponds to the

pi1, i2, ..., iN qth element of the tensor X .

Vectors obtained by fixing all indices of the tensor

except the one that corresponds to nth mode are called mode-n fibers.
Mode-n product The mode-n product of a tensor X
is denoted as Y

 X n U  pY qi1,i2,...,in1,j,in

size I1  ...  In1  J

P RI1...In...IN and a matrix U P RJ In
1 ,...,iN

 Â°Iinn1 xi1,...,in,...,iN uj,in and is of

 In 1  ...  IN .

Tensor matricization Process of reordering the elements of the tensor into a matrix is known as
matricization or unfolding. The mode-n matricization of tensor Y
as Ypnq

Âą

P RI1...In...IN

is denoted

P RIn iPt1,...,N u{tnu Ii and is obtained by arranging mode-n fibers to be the columns

of the resulting matrix. Unfolding the tensor Y
equivalent to Ypnq

 X 1 U1 2 U2... N UN along mode-n is

 UnXpnqpUN b ...Un 1 b Un1... b U1qJ, where b is the matrix Kronecker

product.
The n-Rank Let X

P RI1I2...IN be an N-way tensor, the n-rank of X is the collection of rank

of mode matrices Xpnq and is denoted as:
!

)

rankn pX q  rank pXp1q q, rank pXp2q q, ..., pXpnq q ,

(3.3)

where n  1, 2, ..., N .
Tucker decomposition Tucker decomposition is a form of higher order SVD. Any tensor X
RI1 I2 ...IN can be decomposed as mode products of a core tensor S

51

P

P RI1I2...IN and N

mode matrices Upnq

P RInIn [215] [35].
 S 1 Up1q 2 Up2q... N UpN q,

X
X



Â°I1

i1 1 ...

Â°IN

iN 1 si1 ,i2 ,...,iN ci1

pUp1qq  ...  ciN pUpN qq,

(3.4)

where the matrix Upnq contains the left singular vectors of Xpnq and cn pq is nth column vector
and S is obtained by S

3.2.3

 X 1 UJp1q 2 UJp2q... N UJpN q.

Time-Varying Measure of Phase Synchrony

In this chapter, pairwise functional connectivity will be quantified using a recently introduced timefrequency phase estimation method based on Reduced Interference Rihaczek distribution (RIDRihaczek) [148, 149]. Phase synchronization within different frequency bands across the brain has
been shown to be a plausible mechanism explaining neuronal integration [114, 144].
Reduced Interference Rihaczek Distribution (RID-Rihaczek) is given by:
C pt, Ď q 


Âź
exp

pÎ¸Ď q2


Ď


Î¸Ď
exp j
ApÎ¸, Ď qej pÎ¸t Ď Ď q dĎ dÎ¸,
2

(3.5)

where expppÎ¸Ď q2 {Ď q is the Choi-Williams kernel used to filter out the cross-terms, ApÎ¸, Ď q
Âł



xpu Ď2 qx pu  Ď2 qejÎ¸u du is the ambiguity function of the signal xptq and exppjÎ¸Ď {2q is the kernel

corresponding to the Rihaczek distribution [150]. The phase difference between two signals, xi
and xj , based on this complex distribution is computed as
ÎŚij pt, Ď q  arg



Ci pt, Ď qCj pt, Ď q

|Cipt, Ďq||Cj pt, Ďq|

52



,

(3.6)

where Ci pt, Ď q and Cj pt, Ď q refer to the complex energy distributions of the two signals xi ptq
and xj ptq respectively. A synchrony measure quantifying the intertrial variability of the phase
differences, phase locking value (PLV), is defined as

P LVi,j pt, Ď q 



1 

Îş

ÎşĚ§

k 1




k
exp jÎŚij t, Ď  ,


p

p qq

(3.7)

where Îş is the number of trials and ÎŚkij pt, Ď q is the time-varying phase estimate between two
signals recorded at electrodes i and j for the kth trial. If the phase difference varies little across the
trials, PLV is close to 1. Compared to the existing synchrony measures, RID-Rihaczek based phase
synchrony measure is more robust to noise, and has uniformly high time-frequency resolution with
less bias [148].
In this chapter, we construct the functional connectivity matrices at each time point and for
each subject as:
Gi,j ptq 

Ďb
Â¸

1

Ďb  Ďa ĎĎ
a

P LVi,j pt, Ď q,

(3.8)

where the entries of the connectivity networks are computed as the average synchrony between
pairs of nodes at time t averaged over a frequency band of interest which is the theta band. Once
the individual connectivity matrices are constructed, a three-way tensor Xt

P RN N S is formed

at each time point across all subjects as:

Xt pi, j, sq  Gsi,j ptq

where i, j

(3.9)

P t1, 2, .., N u correspond to the nodes or brain regions in the network and s P

t1, 2, .., S u is the subject.

53

3.2.4

Consensus Clustering and Fiedler Consensus Clustering Approach

In neuroscience problems, it is desirable to find a common network structure across subjects performing the same task or in the same population [216, 217]. Recently, we have introduced Fiedler
Consensus Clustering Algorithm (FCCA) to address this issue to obtain a common community
structure across multiple weighted graphs, where the graphs correspond to individual functional
connectivity networks discussed in Section 3.2.3. [205]. One of the most common ways to partition
a graph is spectral clustering. Spectral clustering generally uses the eigenvectors of the Laplacian
matrix computed as L



D

 A, where A is the adjacency matrix of the graph and D is the

degree matrix containing degrees of nodes along the diagonal with Dpi, iq

 Â°Nj1, j i Api, j q.

The eigenvector corresponding to the second smallest non-zero eigenvalue of the Laplacian matrix provides the optimal minimal cut of a graph and this eigenvector is referred to as the Fiedler
vector [158].
In FCCA, the original connectivity matrices are bi-partitioned into two clusters using the
Fiedler partitioning method. This results in a cluster matrix for the rth network Tr such that

Tr pi, j q 

and r

$
'
& 1 if nodes vi , vj are in the same cluster
'
% 0 otherwise

(3.10)

 t1, 2, ..., mu where m is the number of networks. In order to find the common commu-

nity structure across multiple graphs, we introduce a co-occurence matrix W where W pi, j q
Tr pi,j q
r 1
m

Â°m

and W pi, j q

P r0, 1s.



W pi, j q is the probability that a pair of nodes are members

of the same cluster across multiple graphs. The adjacency matrix reflects the strength of a direct
relationship between a node pair, whereas W reflects the likeliness that a pair of nodes are in the
same cluster across all subjects.

54

The Laplacian matrix of W is computed and the Fiedler vector is found to form a bi-partition
of W into a community structure composed of clusters c1 and c1 . Since W represents the
probability that a node pair should be clustered together, the Fiedler partition of W represents
the community structure common to all graphs. The initial partition set, C
clusters but if k

 tc1, c1u, contains 2

ÂĄ 2 is desired, the process can be repeated by selecting a cluster in C to partition.

In this case, c1 or c1 is selected based on the quality score of each cluster (see [205] for more
details on the particular quality score used in FCCA).
Next, sub-matrices are extracted from the original connectivity matrices such that they only
contain the nodes of the chosen cluster. These sub-matrices are used to derive the new sub cooccurence matrix, Wy , where y

 1 if cluster c1 was selected and y  1 if cluster c1 was

selected. The Fiedler partitioning is performed on the selected cluster to obtain two new clusters,
y

y

c1 and c1 . The final cluster set C

 tcy , cy1 , cy1u is a concatenation of the two new clusters

with the original cluster that was not chosen for bi-partitioning. This method can be iterated until
an optimal quality score or a desired number of clusters is achieved.

3.3

Higher-order Recursive Low-Rank + Sparse Structure Learning (Ho-RLSL)

3.3.1

Problem Statement

In this chapter, we will represent dynamic functional connectivity networks across subjects as a
three-way dynamic tensor Mt

P RN1N2N3 .

We will assume that this tensor has a low-rank

piq q ! minpN , Âą3

structure Lt with rank pLt

i

k 1,k i Nk

55

q along the connectivity and subject modes

corresponding to the modular network structure plus some sparse outlier connections St as:

Mt

 Lt

St .

(3.11)

Our goal is to separate the low-rank tensor Lt from its noisy version Mt . This goal leads to the
following optimization problem where the Tucker rank of Lt is minimized while simultaneously
minimizing the l1 -norm of the noise part, St [73]:
minLt , St k Lt k
s.t. Lt

St

Îť k St k1

 Mt ,

(3.12)

where k Lt k is the nuclear norm of the low-rank tensor.
Since minimizing the nuclear norm is an NP hard problem, HoRPCA presented in [73] replaces
it by its convex surrogate which is the sum of the nuclear norms of the mode-i unfoldings:

piq k

Lt

.

Â°3

i1

k

One way to solve this optimization problem and obtain the low-rank and sparse parts of

Mt is to use HoRPCA presented in [31]. However, there are two main drawbacks of applying
HoRPCA for streaming or time-varying tensor data. First, it is very time consuming to compute
HoRPCA in batch mode since all of the high dimensional data needs to be stored and processed.
Second, HoRPCA yields different low-rank subspace information at each time point and subspace
tracking as desired in this chapter requires additional metrics to compare the subspaces across
time. To improve the computation efficiency and to better capture the evolving dynamics of the
data, we propose to adapt and extend the projection based subspace update approach outlined in
REPROCS. Thus, the optimization problem in equation (3.12) is solved in two steps. First, instead
of determining the low-rank subspace at each time point, we propose to update the subspace across
time by minimizing the Tucker norm, or the nuclear norm of each unfolding. As [208] has shown,
56

the minimizer of the nuclear norm can be obtained through singular value thresholding (SVT). As
Algorithm 1 will show, we will use this approach to determine the low-rank subspace. In step 2, we
project the observed tensor, Mt , to a subspace orthogonal to the estimated low-rank subspace and
transform equation (3.11) to a sparse recovery in noise problem as will be illustrated in equations
(3.14) and (3.15) to obtain St .

3.3.2

Algorithm Description

Suppose that we have a sequence of training tensors defined as Mtrain which do not contain any
sparse information and are used for the initial estimate of the subspace in which each mode of Lt
lies in. In the case of dynamic FCNs constructed from task-based EEG, this may correspond to the
pre-stimulus activity. Mtrain

P RN1N2N3ttrain can be considered as a 4-way tensor where

the time information constitutes the 4th mode and its full Tucker decomposition is

Mtrain

p1q

p2q

p3q

 C 1 Pp01q 2 Pp02q 3 Pp03q 4 Pp04q

p4q

(3.13)

piq P RNi Ni . Let

where P0 , P0 , P0 and P0 are the basis matrices along each mode with P0

piq

piq

PĚ0 s be the truncated version of P0 obtained by keeping the columns with the singular values

piq P RNi r0piq s where i P t1,

greater than Ďmin . PĚ0

piq

piq

2, 3u give the initial subspace information for

Lt and r0 is the rank of PĚ0 . The goal is to estimate Lt and St for each t ÂĄ ttrain by recursively

piq

updating its corresponding basis Pt s. The Lt âs are assumed to satisfy a slowly changing low-rank
subspace model which will be detailed in section 3.3.3.
Let Pt be the set of projection matrices which form the basis for the subspaces in which each
mode of Lt lies in Pt

 tPpt1q, Ptp2q, Ppt3qu. Assume Pt has been accurately predicted using past

estimates of Lt such that the projection of the new basis at time t to the orthogonal complement

57




piq

	
piq

piq 

of the past estimates  I  PĚt1 pPĚt1 qJ Pt  is small. Then Mt is projected to the space
2

piq

piq  I  PĚpiq pPĚpiq qJ to obtain
t1 t1

orthogonal to PĚt1 s defined through the projection operators Ď t
Yt as Yt

where Î˛t

 Mt 1 Ďpt1q 2 Ďtp2q 3 Ďpt3q, which can be rewritten as:
Yt

 pLt

Yt

 Î˛t

p1q  Ď p2q  Ď p3q ,
2 t
3 t
p1q
p2q
p3q
St 1 Ď t 2 Ď t 3 Ď t ,
St q 1 Ď t

(3.14)

 Lt 1 Ďpt1q 2 Ďpt2q 3 Ďpt3q. Since k ĎptiqPptiq k2 is small, the projection of Lt to Ďptiqs
piq

will yield small k Î˛t kF (see Appendix). Notice that, although the projection matrices Ď t âs are of

piq q. Therefore, obtaining S

size Ni  Ni , they have rank Ni  rank pPĚt

t

from Yt can be represented

piq

as sparse recovery problem in small noise. Since PĚt âs are dense and restricted isometry constants

piq

Ďt ) are small [214], we can accurately recover St from Yt by
(RIC) of measurement matrices (Ď
solving following problem:

SĚt
s.t.

 arg min k St k1,

p1q
p2q
p3q
k Yt  St 1 Ď t 2 Ď t 3 Ď t kF Â¤ .

(3.15)

To recover St from Yt , we apply serial recovery procedure for compressed tensors, known
as generalized tensor compressive sensing - serial (GTCS-S) presented in [218]. This algorithm
repeatedly unfolds the compressed tensor along one of the modes and applies l1 optimization to
recover its columns. Once SĚt is recovered, Lt can be estimated as LĚt

58

 Mt  SĚt.

Algorithm 3.1 Higher-order Recursive Low-Rank + Sparse Structure Learning
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:

piq

Input: Mt , PĚ0 s
Output: LĚt , SĚt , tj
for t ÂĄ 0 do
for i=1:3 do
piq piq
piq
Ď t  I  PĚt1 pPĚt1 qJ
end for
p1q
p2q
p3q
Yt  Mt 1 Ď t 2 Ď t 3 Ď t
Recover SĚt from Yt by using GTCS-S algorithm [218].
Estimate LĚt Ă Mt  SĚt
if modpt  tj 1, Îąq  0 then
for i=1:3 do



 LĚtpjiq pk1qÎą    LĚptjiq kÎą1
piq
piq
PĚptq  deleteDirectionpD, PĚpt1q q
piq
piq
PĚptq  addDirectionpD, PĚptq q
Dpiq

end for
p1q
p1q
p2q
if PĚptq  PĚpt1q or PĚptq
j Ă j 1, tj Ă t

p1q

p1q

 PĚpp2tq 1q or PĚpp3tqq  PĚpp3tq 1q

p2q Ă PĚp2q ,PĚp3q Ă PĚp3q
ptq pj q
ptq

PĚpj q Ă PĚptq , PĚpj q
end if
else
p1q
p1q
p2q
PĚptq Ă PĚpt1q , PĚptq
end if
end for

Ă PĚpp2tq 1q, PĚpp3tqq Ă PĚpp3tq 1q

59

then

3.3.3

Slowly Changing Subspace & Change Points

The following assumptions are made to define slowly changing subspace along each mode of the
tensor:

piq

1. Let tj denote the change points of the low-dimensional subspaces that Lt s are in. Note
that the subspaces along each mode can vary independently from the others and as such tj s
are the collection of all change points across modes. Assume that for Ď large enough, any Ď

p
iq
piq
length subsequence of Lt s lies in low-dimensional subspaces, i.e. maxt rank p LtĎ

1


p
iq
...L
!

2. Lt lies in a low dimensional subspace that changes slowly along each mode i.e. Lt

 At 1

minpĎ, Ni ,

Âą3

k 1,k i Nk

t

q.

p1q  Pp2q  Pp3q with Ppiq  Ppiq for all t Â¤ t Â¤ t , j  1, 2, ...J where J is the maximum
2 t
3 t
j
j 1
t
j
Âą
piq
piq
piq
number of change points. Pj is an Ni  rj basis matrix where rj ! minpNi , 3k1,ki Nk q.
piq
piq  piq piq  piq
3. At the change points, tj , at least one of the Pj âs changes as Pj  Pj 1 Pj,add , Pj 




p
iq
p
iq
p
iq
p
iq
p
iq
p
iq
piq
piq
Pj 1 zPj,del or Pj  pPj 1 zPj,del q, Pj,add where Pj,add is a Ni  cj,add basis matrix
piq
piq
with pPj,add qJ Pj 1  0, i.e., the new directions added to the projection matrix are orthogonal to
piq is a N  cpiq matrix of deleted basis columns.
the previous directions and P
Pt

j,del

piq

i

j,del

4. There exists constants cmax such that 0

0

Â¤

Â°j

2

j,add

i1

pci,add 

piq Â¤ rpiq cpiq : rpiq . The number of change points
max
0
dif

	
Âą
piq piq piq
piq
piq
J ! mini pNi  r0  cdif q{cmax , so rmax Jcmax ! Ni . Moreover, p 2ki,k1 Nk qptj 1 
piq cpiq helps to ensure max rankpLpiq ...Lpiq  ! minpĎ, N , Âą3
tj q " r0
t
i
k 1,k i Nk q.
t
tĎ 1
dif
5. The projection of L along the new added directions, A
 L  PJ,p1q  PJ,p2q  PJ,p3q

ci,del q

piq
Â¤ cdif

iq
iq
Â¤ cpj,add
Â¤ cpmax
Â  r0piq.

is required to imply rt

t

t,add

is initially small, i.e maxtj Â¤tÂ¤tj Îą k At,add k8 Â¤ Îładd and Îładd

t

1

j,add

3

j,add

! minpk Lt kF , k St kF q, but

can increase gradually.

piq

In order to enable a more efficient online implementation, the low-rank subspaces Pt s are

60

estimated and updated every Îą samples, where Îą is selected empirically. Similar to the projection

piq

PCA (p-PCA) procedure used in [67], mode-i unfoldings LĚt âs of the last Îą LĚt âs are concatenated
as Dpiq 



piq
LĚ
tj

pk1qÎą

   LĚpiq

tj kÎą1



with k

P t1, 2, ..., K u where K is the maximum number
piq

of length Îą windows and Dpiq s are projected onto subspaces which are orthogonal to PĚpj 1q s as

piq  pI  PĚpiq pPĚpiq qJ qDpiq . Then PCA is applied to find the subspace which
pj 1q pj 1q
piq . Let Ppiq be the truncated basis that spans this subspace obtained by keeping the
spans D
follows: Dproj
proj

j,add

piq

piq

eigenvectors with eigenvalues greater than Ďmin . Pj,add and previous subspace estimate PĚpj 1q

piq  rPĚpiq Ppiq s.
pj 1q j,add

together yield the new subspace estimate as: PĚpj q

During the update step,

some of the existing directions can also be deleted from the projection matrix by finding the ones
with eigenvalues lower than Ďmin (see Algorithms 3.2 and 3.3). If there are any added or deleted
directions, it means that there is a change point. It is also important to note that, the tensor subspace
estimation implemented in this study finds subspaces along each mode individually without taking
other modes into account. Thus, the proposed method is not optimized like higher-order orthogonal
iteration (HOOI) [36] but offers a computationally efficient way of estimating subspaces along each
mode.
Algorithm 3.2 Delete Direction
1: Input: D: data, P: input basis matrix
2: Output: Q: output basis matrix
1
3: Îť  diag ppP1 DpP1 DqJ qq where w is the number of columns of D.
w
4: i= find(Îť Â  Ďmin )
5: Q  rPzPpi, :qs

61

Algorithm 3.3 Add Direction
1: Input: D: data, P: input basis matrix
2: Output: Q: output basis matrix
3: Projection: compute Dproj Ă pI  PP1 qD
1D
1
1
4: PCA: compute w
proj Dproj  UÎťU where w is the number of columns in D.
5: i= find(diag pÎťq ÂĄ Ďmin )
6: Q  rP Upi, :qs

3.3.4

Computational Complexity

In this section, we offer a comparison of the computational complexity of the proposed approach
with respect to REPROCS applied to our data in vectorized form. Let the 3-way tensor be of size
N

 N  N . For the time points which do not require subspace update, computational complex-

ity of the proposed approach is equivalent to the complexity of l1 regularization OpN 3 q multiplied by the total number of fibers to recover for each mode to obtain the sparse component and
is equal to 3N 2 OpN 3 q. However, if we use REPROCS after vectorizing the data, complexity
for the same operations become OppN 3 q3 q

 OpN 9q.

For the time points which require basis

update, there is an additional cost of covariance matrix computation and eigenvalue decomposition. For our approach, covariance matrix computations for the three modes have a complexity of
3OppÎąN q N 4 q  3OpÎąN 5 q operations whereas eigenvalue decompositions cost 3OpN 3 q. However, REPROCS requires OpÎąpN 3 q2 q

 OpÎąN 6q operations for covariance matrix computation

and OppN 3 q3 q  OpN 9 q operations for eigenvalue decomposition.

62

3.4
3.4.1

Results
Simulated Networks

The proposed framework is first applied to three simulated dynamic tensors Xt
t

P R646460 for

P t1, 2, ..., 80u. For each network type, 20 simulations of the tensors are generated where each

frontal slice Xt p:, :, iq corresponds to a weighted and undirected network. In our experiments, to
generate the networks, we used three well-known network models known as the modular smallworld network, hierarchical modular small-world network and overlapped modular networks 3.1.
Distinct regions in the brain which are strongly connected within themselves are specialized for
different processes in the brain and, this phenomenon is known as functional segregation. Presence of these specialized neuronal groups appear as different modules in brain networks. Moreover, some of the nodes in a module may have more specialized function which yields hiearchical
structure in a network while some of the nodes belong to multiple clusters resulting in overlapping
modules [219]. All of these network structures are illustrated in Fig. 3.1.
For the experiments including modular small world network model, initially, the networks
contain 2 equal size modules. After t



20, both of the modules are slowly divided into two

smaller modules of size 16 nodes each. After t

 60, the network structure evolves back to the

initial structure. For the second set of experiments with the hierarchical modular small world
network model, the networks contain 2 equal size modules at the beginning. After t  20, both of
the modules are slowly divided into two smaller modules of size 16 nodes each while establishing
hierarchical structure. After t

 60, the network structure evolves back to the initial structure.

For the third set of experiments with overlapping modules, the networks contain 2 equal size nonoverlapping modules at the beginning. After t

 20, 25% of the nodes start to belong to both

modules. After t  60, network structure evolves back to the initial non-overlapping structure.
63

10

20

20

Nodes

Nodes

10

30

40

50

30

40

50

60

60
10

20

30

40

50

60

10

20

Nodes

30

40

50

60

Nodes

(a)

(b)

10

Nodes

20

30

40

50

60
10

20

30

40

50

60

Nodes

(c)

Figure 3.1: Illustration of network structures: (a) Modular network (b) Hierarchical modular network, (c) Overlapping modules.
For all of the experiments, intra-cluster edge values were selected from N p0.6, 0.1q and truncated to the interval r0, 1s while the inter-cluster edge values were selected from N p0.1, 0.1q.
Moreover, these networks were corrupted by a sparse noise matrix Et whose sparsity varies from
10% to 40% and ei,j

 betap4, 2q. Proposed algorithm is applied with Îą  5 and Ďmin is deter-

mined as 10% of highest singular value obtained from the initial subspace estimate along that mode
using the first 5 time points. The proposed algorithm is compared to an implementation without
the sparse recovery step similar to performing standard HoSVD at each time point. Mean squared
error which quantifies the error between estimated and original low-rank components for all of the
network models are computed for both algorithms as

MSE 

1
Tend  Tstart

Tend
Â¸

1

tTstart

k Lt  LĚt k2F
Âą3

i1 Ni

,

(3.16)

where Tstart and Tend are the start and end points of the detected time interval and Ni is the size
64

of the tensor along ith mode.
Tables 3.1, 3.2 and 3.3 show that Ho-RLSL is more robust than HoSVD for sparse outliers with
smaller MSE values. As the sparsity level of the noise increases, the difference in performance
between the two algorithms also increases. Complexity of the network structure, i.e. modular
vs. hierarchically modular, also affects the accuracy of the algorithms and increased structural
complexity results in increased error as seen in Tables 3.1, 3.2 and 3.3.
Table 3.1: Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for modular network structure under varying noise sparsity
levels.
Noise
Level
10%
20%
30%
40%

Method

Interval-1

Interval-2

Interval-3

Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD

0.0046
0.0097
0.0085
0.0233
0.0136
0.0409
0.0215
0.0613

0.0094
0.0144
0.0167
0.0302
0.0260
0.0496
0.0387
0.0713

0.0045
0.0095
0.0084
0.0229
0.0137
0.0405
0.0216
0.0607

Table 3.2: Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for hierarchical modular network structure under varying
noise sparsity levels.
Noise
Level
10%
20%
30%
40%

Method

Interval-1

Interval-2

Interval-3

Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD

0.0149
0.0202
0.0184
0.0335
0.0240
0.0511
0.0346
0.0712

0.0265
0.0319
0.0333
0.0465
0.0377
0.0663
0.0521
0.0875

0.0137
0.0189
0.0171
0.0322
0.0225
0.0497
0.0326
0.0699

65

Table 3.3: Average MSE over time computed for low-rank components during detected time intervals obtained by Ho-RLSL and HoSVD for network structure with overlapping modules under
varying noise sparsity levels.
Noise
Level
10%
20%
30%
40%

3.4.2

Method

Interval-1

Interval-2

Interval-3

Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD
Ho-RLSL
HoSVD

0.0043
0.0091
0.0080
0.0225
0.0131
0.0399
0.0209
0.0602

0.0088
0.0107
0.0195
0.0250
0.0336
0.0435
0.0509
0.0640

0.0045
0.0094
0.0083
0.0228
0.0135
0.0404
0.0215
0.0606

Effect of Network Size on Computation Time and Performance

The proposed framework is applied to two simulated dynamic tensors Xt
R12812860 for t

P R646460 and Xt P

P t1, 2, ..., 80u to see the effect of network size on the computation time and

performance of the algorithm. 10 simulations of the tensors are generated where each frontal slice
Xt p:, :, iq corresponds to a weighted and undirected network and the third mode corresponds to the
number of subjects. In these experiments, to generate the networks, we used modular small-world
networks. For the experiments, initially, the networks contain 2 equal size modules. After t  20,
both of the modules are slowly divided into two smaller modules of equal size. After t

 60, the

network structure evolves back to the initial structure.
For the experiments, intra-cluster edge values were selected from N p0.6, 0.1q and truncated to
the interval r0, 1s while the inter-cluster edge values were selected from N p0.1, 0.1q. Moreover,
these networks were corrupted by a sparse noise matrix Et whose sparsity is 10% and ei,j
betap4, 2q. Proposed algorithm is applied with Îą



 5 and Ďmin is determined as 10% of highest

singular value obtained from the initial subspace estimate along that mode using the first 5 time
points. Mean squared error which quantifies the error between estimated and original low-rank
components for both network sizes are computed for both algorithms as described in Section 3.4.1.

66

All of the simulations were run on a computer with Intel(R) Core(TM) i5-2500T CPU and
4.00 GB memory. Table 3.4 shows that doubling the number of nodes in the network increases
the computation time almost three times. Moreover, increased network size yields better low-rank
estimation for the complex network structures. As seen in Table 3.4, MSE computed for the second
time interval where the networks contain more modules significantly decreases with the increased
network size. This is due to the fact that with more modules in the network the number of nodes in
a module decreases making the subspace estimation more challenging. When the number of nodes
in the network increases, the subspace estimation becomes more accurate.
Table 3.4: Average computation time for Ho-RLSL for dynamic tensors containing 64  64 and
128  128 networks with average MSE over time computed for low-rank components during detected time intervals.
Tensor
Size
64  64  60
128  128  60

3.4.3

Time (sec)

Interval-1

Interval-2

Interval-3

2.3916  104
6.3397  104

0.0046
0.0045

0.0090
0.0065

0.0045
0.0046

EEG Data

The proposed tensor tracking approach is applied to a set of connectivity graphs constructed from
EEG data containing the error-related negativity (ERN) and correct-related negativity (CRN). The
ERN is a brain potential response that occurs following performance errors in a speeded reaction
time task usually 25-75 ms after the response [220]. Previous work [221] indicates that there is
increased coordination between the lateral prefrontal cortex (lPFC) and medial prefrontal cortex
(mPFC) within the theta frequency band (4-8 Hz) and ERN time window. EEG data from 63channels was collected in accordance with the 10/20 system on a Neuroscan Synamps2 system
(Neuroscan, Inc.) sampled at 128 Hz from 91 subjects. Full methodological details of the recording are available in the previous report [220]. The task was a common speeded-response letter
67

(H/S) flanker, where error and correct response-locked trials from each subject were utilized. A
random subset of correct trials was selected, to equate the number of error relative to correct trials for each participant. The EEG data are pre-processed by the spherical spline current source
density (CSD) waveforms to sharpen event-related potential (ERP) scalp topographies and reduce
volume conduction [222]. The CSD has fewer assumptions than many inverse transforms, attenuates volume conduction, and represents independent sources near the cortical surface [177].
For each subject and response type, the pairwise phase locking value in the theta frequency band
was computed as described in eqn. 3.8 [148]. We constructed 3-way tensors at each time point
Xt

P R636391 for both ERN and CRN data separately where the first and second mode represent

the adjacency matrix of the connectivity graphs while the third mode corresponds to the subjects
for t P t1, 2, ..., 256u.
In this section, the method described in Section 3.3 is applied to tensors constructed from the
connectivity networks as described in Section 3.2.3. The connectivity networks corresponding to
the first 10 time points t

P t1, 2, ..., 10u and all subjects was used in the training step to obtain

initial subspace information of the low-rank component Lt . This training data is used to obtain
an initial subspace estimate and 10 time points approximately correspond to 78 ms of data. It
is assumed that during this time period the connectivity networks are almost stationary and the
low-rank structure does not change significantly. Then the proposed approach was applied to the
remaining time points with Îą

 8 and Ďmin  0.11. Since the connectivity networks constructed
p1q and Pp2q corresponding to

by the phase synchrony measure are symmetric, the basis matrices Pt

t

p1q  Pp2q for each time point t. This is due to

the first two modes are identical to each other as Pt

t

the fact that the unfolding of the 3-mode tensor along modes 1 and 2 yield identical matrices with
identical subspaces and does not change the general algorithm introduced in Section 3.3.

p1q are given.

In Fig. 3.2, change points corresponding to the connectivity mode Lt
68

It can be

seen that for ERN networks there is an interval (-109, 141) ms corresponding to the response time
and the ERN response. There is also a longer interval (141, 766) ms corresponding to the Pe,
the error-related positivity which usually occurs 200-500ms after making an incorrect response,
following the error negativity (ERN). Similar time intervals are obtained for CRN, with the biggest
difference being a longer time interval around the response time as the physiological response for
CRN is not as pronounced as the one for ERN.

(a)

(b)

Figure 3.2: Average of (a) ERN and (b) CRN waveforms and the detected change points corresponding to the connectivity mode.

In order to better interpret the network structure corresponding to different time intervals,
FCCA reviewed in Section 3.2.4 was applied to the sets of 6363 networks obtained from low-rank
tensors LĚt

P R636391 within each time interval (pre-ERN, ERN, post-ERN, pre-CRN, CRN,

post-CRN) where p# of input adjacency matricesq  p# of subjectsqp# time points in the intervalq.
As seen in Fig. 3.3, the network structure in pre-ERN interval is similar to the network structure of
the post-ERN interval, while the identified modules are more segregated in the ERN interval relative to the pre and post-ERN. This is in line with previous results indicating that separable clusters
69

are apparent relative to left and right motor areas, and left and right lateral-PFC regions during
ERN [205]. For the CRN clusters, we observed that segregation of the lateral and central areas
is quite limited with one large fronto-central cluster present during all CRN intervals. The exception are small frontal- and central clusters during the CRN (a similar two were observed during
the ERN). This is consistent with the idea that medial frontal regions are activated during correct
trials, but to a smaller extent than during errors.
FP1

FPZ

AF3

AFZ

AF7
F7
FT7

F5
FC5

TP7
P7

FC3

CP5

C1

PO5
PO7

CZ

P1

P3

PO3
O1

PZ

P2

FC6 FT8

FC4

C2

F7

F8

F6

F4

F2

C4

POZ

PO4

OZ

O2

P6

FT7

T8

C6

CP4 CP6
P4

FP1

FPZ

FP2

AF3

AFZ

AF4

AF7

AF4

CP1 CPZ CP2

CP3

P5

FZ

AF8

FC1 FCZ FC2

C3

C5

T7

F1

F3

FP2

CP5

PO6
PO8

C1

CZ

P1
PO3

O1

(a)

PZ

F6

F4

F2

FC4

C2

C4

CP1 CPZ CP2

P3

PO5
PO7

FZ

FC1 FCZ FC2

CP3

P5

P7

P8

FC3
C3

C5

TP7

F1

F3

FC5

T7

TP8

F5

AF8

P2

F8
FC6 FT8
C6

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

T8
TP8

P8

PO6
PO8

(b)
FP1

FPZ

FP2

AF3

AFZ

AF4

AF7
F7
FT7

F5
FC5

TP7
P7

FC3

CP5

C1

PO5
PO7

CZ

P1

P3

PO3
O1

PZ

F6

F4

F2

FC4

C2

C4

CP1 CPZ CP2

CP3

P5

FZ

FC1 FCZ FC2

C3

C5

T7

F1

F3

AF8

P2

F8
FC6 FT8
C6

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

T8
TP8

P8

PO6
PO8

(c)

Figure 3.3: Network structures for the low-rank components of ERN networks obtained by HoRLSL: (a) pre-ERN, (b) ERN, (c) post-ERN.

To show the denoising performance of the proposed method compared to HoSVD, we applied
HoSVD to the EEG networks with the same Îą and Ďmin values used in Ho-RLSL. First, HoSVD
detected pre-ERN, ERN and post-ERN intervals very similar to the ones obtained from Ho-RLSL
70

FP1
AF7
F7
FT7

F5
FC5

TP7
P7

F1

F3
FC3

CP5

C1

PO5
PO7

AF8

AFZ
FZ

CZ

P1

P3

PO3
O1

PZ

P2

FC6 FT8

FC4

C2

F7

F8

F6

F4

F2

C4

POZ

PO4

OZ

O2

P6

FT7

T8

C6

CP4 CP6
P4

FP1

FPZ

FP2

AF3

AFZ

AF4

AF7

AF4

CP1 CPZ CP2

CP3

P5

FP2

FC1 FCZ FC2

C3

C5

T7

AF3

FPZ

CP5

PO6
PO8

C1

CZ

P1
PO3

O1

(a)

PZ

F6

F4

F2

FC4

C2

C4

CP1 CPZ CP2

P3

PO5
PO7

FZ

FC1 FCZ FC2

CP3

P5

P7

P8

FC3
C3

C5

TP7

F1

F3

FC5

T7

TP8

F5

AF8

P2

F8
FC6 FT8
C6

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

T8
TP8

P8

PO6
PO8

(b)
FP1

FPZ

FP2

AF3

AFZ

AF4

AF7
F7
FT7

F5
FC5

TP7
P7

FC3

CP5

C1

PO5
PO7

CZ

P1

P3

PO3
O1

PZ

F6

F4

F2

FC4

C2

C4

CP1 CPZ CP2

CP3

P5

FZ

FC1 FCZ FC2

C3

C5

T7

F1

F3

AF8

P2

F8
FC6 FT8
C6

CP4 CP6
P4

POZ

PO4

OZ

O2

P6

T8
TP8

P8

PO6
PO8

(c)

Figure 3.4: Network structures for the low-rank components of CRN networks determined obtained by Ho-RLSL: (a) pre-CRN, (b) CRN, (c) post-CRN.
(see Table 3.5). The change point at the end of the pre-ERN interval was shifted by Îą time points.
We then used FCCA to identify the common network structure for each interval (Fig. 3.5). As seen
in Figs. 3.3 and 3.5, both HoSVD and Ho-RLSL yield the same network structure for the ERN
interval, while HoSVD yields more noisy networks for pre- and post-ERN intervals. When the
physiological response is strong such as during the ERN, the network is less noisy yielding the exact cluster structure both for Ho-RLSL and HoSVD. However, when the networks are noisier such
as for the pre-ERN interval, our algorithm provides a cleaner low-rank approximation which yields
more distinct cluster structures. For CRN networks, Ho-RLSL and HoSVD detected very different pre-CRN, CRN and post-CRN intervals (see Table 3.5), and we cannot compare the network
71

structures obtained from both algorithms. In particular, HoSVD detects a very long pre-CRN time
interval and very short CRN and post-CRN intervals. The CRN and post-CRN intervals detected
by Ho-RLSL align better with well-known ERPs such as P300.
Table 3.5: Detected ERN and CRN intervals by Ho-RLSL and HoSVD
Interval
Pre-ERN
ERN
Post-ERN
Pre-CRN
CRN
Post-CRN

3.5

Ho-RLSL
-0.484ms to -0.109ms
-0.109ms to 0.141ms
0.141ms to 0.766ms
-0.422ms to -0.047ms
-0.047ms to 0.391ms
0.391ms to 0.641ms

HoSVD
-0.484ms to -0.047ms
-0.047ms to 0.141 ms
0.141ms to 0.766ms
-0.734ms to -0.109ms
-0.109ms to 0.141 ms
0.141ms to 0.391ms

Conclusions

In this chapter, we introduced a new recursive low-rank + sparse structure learning algorithm
for tensor type data to track dynamic modular structure of functional connectivity networks constructed from EEG recordings across multiple subjects. To this aim, a recent subspace tracking
approach, REPROCS, was adapted and extended to tensor type data. This extension offers several novelties with respect to the original algorithm. In original REPROCS, the measurements at
each time point are vectors and the algorithm uses the fact that each measurement vector is coming from a low-rank subspace. However, in our case, low-rank corresponds to a low Tucker rank,
i.e. the matricized version of the tensor along each mode is a low-rank matrix implying it can
be reconstructed through the outer product of a small number of eigenvectors. This is particularly suitable for dealing with data that has a modular structure such as FCNs. Using REPROCS
to analyze this type of dataset would require vectorization which breaks the intrinsic low-rank
structure of each measurement along with increased computational complexity. The proposed approach yields robust estimation of the low-rank component of a dynamic 3-way tensor at each time

72

FP1
AF7
F7
FT7
T7

F5
FC5

P7

F1

F3
FC3

CP5

C1

PO5
PO7

AF8

AFZ
FZ

CZ

P1

P3

PO3
O1

PZ

P2

FC6 FT8

FC4

C2

F7

F8

F6

F4

F2

C4

FT7

T8

C6

POZ

PO4

OZ

O2

P6

F5

FPZ

FP2

AF3

AFZ

AF4

CP5

PO6
PO8

PO5
PO7

FZ

AF8

C1

CZ

FC4

C2

C4

CP1 CPZ CP2
P1

P3

PO3
O1

(a)

PZ

F6

F4

F2

FC1 FCZ FC2

CP3

P5

P7

P8

FC3
C3

C5

TP7

F1

F3

FC5

T7

CP4 CP6
TP8
P4

FP1
AF7

AF4

CP1 CPZ CP2

CP3

P5

FP2

FC1 FCZ FC2

C3

C5

TP7

AF3

FPZ

P2

F8
FC6 FT8
C6

T8

CP4 CP6
TP8
P4

POZ

PO4

OZ

O2

P6

P8

PO6
PO8

(b)
FP1

FPZ

FP2

AF3

AFZ

AF4

AF7
F7
FT7
T7

F5
FC5

P7

FC3

CP5

C1

PO5
PO7

CZ

P1

P3

PO3
O1

PZ

F6

F4

F2

FC4

C2

C4

CP1 CPZ CP2

CP3

P5

FZ

FC1 FCZ FC2

C3

C5

TP7

F1

F3

AF8

P2

F8
FC6 FT8
C6

T8

CP4 CP6
TP8
P4

POZ

PO4

OZ

O2

P6

P8

PO6
PO8

(c)

Figure 3.5: Network structures for the low-rank components of ERN networks obtained by
HoSVD: (a) pre-ERN, (b) ERN, (c) post-ERN.
point and provides a recursive way to update the subspace information. This approach identifies
the time points where the low-rank subspaces change and recursively updates these subspaces to
improve the low-rank approximation to data. The proposed approach is first applied to a set of
simulated networks for performance evaluation, and is then used to separate low-rank and sparse
parts of dFCNs across multiple subjects. Low-rank component of dFCNs obtained for the detected
time intervals are summarized by using a recently introduced multiple graph clustering approach,
FCCA. The low-rank subspace approximation to each time interval provides a denoised version
of the original network, thus improving the quality of the clustering results. The results indicate
a clear change in the community structure before and after the physiological response. Moreover,

73

the detected community structures are in line with previous hypothesis regarding the networks
involved in cognitive control [221].
One main concern about the proposed algorithm is the selection of parameters Îą and Ďmin .
Both Îą and Ďmin are data-dependent parameters and their selection requires some a priori information. First, Îą should be smaller than the time interval between two consecutive change points in
order to identify each one of them. For example, in our simulations change points appear at t  40
and t  60 and there are 20 time points between consecutive change points. We selected Îą as 5 in
our simulations and we identified the change points correctly. When we apply our algorithm with
Îą  10 we can still identify the change points. However, if we select Îą  30, we identify the first
change point at t

 60 which corresponds to the beginning of the third interval.

Therefore, the

subspace estimate is based on the information from the second time interval and cannot do a good
job of representing the third time interval and the algorithm fails. Moreover, if we keep Îą very
small, then the algorithm will be more susceptible to instantaneous noise and we will have a lot of
incorrect change points or false positives. For the real dataset, since we know something about the
dynamics of EEG, we selected Îą to be at least as long as the duration of ERNs (50-75 ms) with
Îą

 8. Selection of Ďmin is also dependent on the datasets. For example, in our simulations, we

selected Ďmin as 10% of the maximum eigenvalue. If we select it too small, the algorithm includes
many basis vectors which correspond to noise and both the low-rank assumption and the algorithm
fail. If we select Ďmin too high, then the algorithm does not update the low-rank subspace for
the slow changes and the algorithm again fails. Selecting Ďmin for real dataset is more difficult,
because gaps between eigenvalues may not be very clear. For example, for our EEG datasets, the
first eigenvalue was much larger than the others and using a threshold around 10% of the maximum
eigenvalue did not work. In this case, we chose Ďmin

 0.11 empirically.

If we select it lower,

the algorithm includes many basis vectors at every time window which cause the algorithm to fail.
74

When we increased Ďmin , we lost a lot of the change points.
Future work will consider extending the proposed approach to higher order tensors by including various experimental conditions, different frequency bands and multiple modalities. Future
work will also consider extensions of linear low-rank subspace models to unions of subspaces or
manifolds. Modifications to the Ho-RLSL algorithm such as predicting support of the sparse component [204] or using partial subspace knowledge for the low-rank component [223] will further
improve the performance of Ho-RLSL.

75

Chapter 4
Multiscale Analysis for Higher-order
Tensors

4.1

Introduction

Data in the form of multidimensional arrays, also referred to as tensors, arise in a variety of applications including chemometrics, hyperspectral imaging, high resolution videos, neuroimaging,
biometrics and social network analysis [4, 5, 224]. These applications produce massive amounts
of data collected in various forms with multiple aspects and high dimensionality. Tensors, which
are multi-dimensional generalizations of matrices, provide a useful representation for such data. A
crucial step in many applications involving higher-orders tensors is multiway reduction of the data
to ensure that the reduced representation of the tensor retains certain characteristics. Early multiway data analysis approaches reformatted the tensor data as a matrix and resorted to methods developed for classical two-way analysis. However, one cannot discover hidden components within
multiway data using conventional matrix decomposition methods as matrix based representations
cannot capture multiway couplings focusing on standard pairwise interactions. To this end, many
different types of tensor decomposition methods have been proposed in literature [37,40,225â227].
In contrast to the matrix case, where data reduction is often accomplished via low-rank repre76

sentations such as singular value decomposition (SVD), the notion of rank for higher order tensors
is not uniquely defined. CANDECOMP/PARAFAC (CP) decomposition and Tucker decomposition are two of the most widely used tensor decomposition methods for data reduction [30,35]. For
CP, the goal is to approximate the given tensor as a weighted sum of Rank-1 tensors, where rank-1
tensor refers to the outer product of n vectors, with n being equal to the order of the tensor. Tucker
model allows for interactions between the factors from the different modes resulting in a typically
dense but small core tensor. This model also introduces the notion of Tucker rank or n-rank, which
refers to the n-tuple of ranks corresponding to the tensor unfoldings along each mode. Therefore,
low rank approximation with the Tucker model can be obtained by projections onto low-rank factor matrices. Unlike CP decomposition, Tucker decomposition is in general non-unique. To obtain
meaningful and unique representation by the Tucker decomposition, orthogonality, sparsity and
non-negativity constraints are often imposed on the factors yielding Non-Negative Tensor Factorization (NTF) and Sparse Non-Negative Tucker Decomposition [228â230]. Tucker decomposition
with orthogonality constraints on the factors, is known as Higher-Order Singular Value Decomposition (HoSVD) or Multilinear SVD [35]. HoSVD can simply be computed by flattening the tensor
in each mode and calculating the n-mode singular vectors corresponding to that mode.
With the emergence of multidimensional big data, classical tensor representation and decomposition methods have become inadequate since the size of these tensors exceeds available working
memory and the processing time is very long. In order to address the problem of large-scale tensor
decomposition, several block-wise tensor decomposition methods have been proposed [37]. The
basic idea is to partition a big data tensor into smaller blocks and perform tensor related operations
block-wise using suitable tensor format. Preliminary approaches relied on a hierarchical tree structure and reduced the storage of d-dimensional arrays to the storage of auxiliary three-dimensional
ones such as the tensor-train decomposition (T-Train), also known as the matrix product state
77

(MPS) decomposition, [40] and Hierarchical Tucker Decomposition (H-Tucker) [39]. In particular, in the area of large volumetric data visualization, tensor based multiresolution hierarchical
methods such as TAMRESH have attracted attention [231]. However, all of these methods are
interested in fitting a low-rank model to data which lies near a linear subspace, thus being limited
to learning linear structure.
Similar to the research efforts in tensor reduction, low-dimensional subspace and manifold
learning methods have also been extended for higher order data clustering and classification applications. In early work in the area, Vasilescu and Terzopoulos [232] extended the eigenface
concept to the tensorface by using higher order SVD and taking different modes such as expression, illumination and pose into account. Similarly, 2D-PCA for matrices has been used for feature
extraction from face images without converting the images into vectors [233]. He et al. [234]
extended locality preserving projections [235] to second order tensors for face recognition. Dai
and Yeung [236] presented generalized tensor embedding methods such as the extensions of local discriminant embedding methods [237], neighborhood preserving embedding methods [238],
and locality preserving projection methods [235] to tensors. Li et al. [239] proposed a supervised
manifold learning method for vector type data which preserves local structures in each class of
samples, and then extended the algorithm to tensors to provide improved performance for face
and gait recognition. Similar to vector-type manifold learning algorithms, the aim of these methods is to find an optimal linear transformation for the tensor-type training data samples without
vectorizing them and mapping these samples to a low dimensional subspace while preserving the
neighborhood information.
In this chapter, we propose a novel multi-scale analysis technique to efficiently approximate
tensor type data using locally linear low-rank approximations. The proposed method constructs
data-dependent multiscale dictionaries to better represent the data. The proposed algorithm con78

sists of two major steps: 1) Constructing a tree structure by partitioning the tensor into a collection
of permuted subtensors, and 2) Constructing multiscale dictionaries by applying HoSVD to each
subtensor. The major contributions of the proposed framework are three fold. First, we introduce
a multi-scale tensor approximation method which allows the user to approximate a given tensor
for given memory and processing power constraints allowing flexibility in the decomposition. The
computational complexity and memory requirements of the proposed method are given in comparison to conventional HoSVD. Second, the proposed tensor partitioning method clusters the tensor
data across each mode to obtain subtensors that are composed of similar entries. In this manner, we
show through a theoretical error analysis that the resulting subtensors can approximate the locally
linear structure more efficiently. Finally, we introduce different variations of the method for adaptively pruning the tree obtaining a better trade-off between compression rate and reconstruction
error. The proposed method is evaluated for two common signal processing applications: data reduction and classification. The efficiency and accuracy of the proposed method for data reduction
and classification is evaluated for different tensor type data and compared to state-of-the-art tensor
decomposition methods including HoSVD, T-Train and H-Tucker decompositions.
Although this chapter focuses on the integration of a single existing tensor factorization technique (i.e., the HoSVD) into a clustering-enhanced multiscale approximation framework, we would
like to emphasize that the ideas presented herein are significantly more general. In principal, for
example, there is nothing impeding the development of multiscale variants of other tensor factorization approaches (e.g., CP, T-Train, H-Tucker, etc.) in essentially the same way. In this chapter,
it is demonstrated that the use of the HoSVD as part of a multiscale approximation approach leads
to improved compression and classification performance over standard HoSVD approaches. However, this chapter should additionally be considered as evidence that similar improvements are also
likely possible for other tensor factorization-based compression and classification schemes, as well
79

as for other related applications.

4.2
4.2.1

Background
Tensor Notation and Algebra

A multidimensional array with N modes X

P RI1I2...IN is called a tensor, where xi1,i2,..iN

denotes the pi1 , i2 , ..iN qth element of the tensor X . The vectors in RIn obtained by fixing all of
the indices of such a tensor X except for the one that corresponds to its nth mode are called its
mode-n fibers. Let rN s : t1, . . . , N u for all N

P N. Basic tensor operations are reviewed below

(see, e.g., [30], [240], [241]).

P

Tensor addition and multiplication by a scalar: Two tensors X , Y

P RI1I2...IN

has

yi1 ,i2 ,..iN . Similarly, given a scalar Îą

PR

added using component-wise tensor addition. The resulting tensor X
its entries given by pX
and a tensor X

Y qi ,i ,..i
1 2 N

 xi1,i2,..iN

RI1 I2 ...IN can be

Y

P RI1I2...IN the rescaled tensor ÎąX P RI1I2...IN has its entries given by

pÎąX qi1,i2,..iN  Îą xi1,i2,..iN .

RI1 ...In ...IN and a matrix U

Mode-n products: The mode-n product of a tensor X

P

RJ In is denoted as Y

1 ,...,iN

 X n U, pY qi1,i2,...,in1,j,in

is of size I1  ...  In1  J

P

 Â°Iinn1 xi1,...,in,...,iN uj,in . It

 In 1  ...  IN . The following facts about mode-n products are

useful (see, e.g., [30], [241]).
Lemma 1. Let X , Y

P RI1I2...IN , Îą, Î˛ P R, and Upnq, Vpnq P CJnIn for all n P rN s. The

following are true:
(a) pÎąX
(b) X

n

Î˛Y q n Upnq


ÎąUpnq

Îą

Î˛Vpnq

	
X

Îą

n Upnq


X

	
n Upnq
80

Î˛ Y

n Upnq

	
Î˛ X

	
.

n Vpnq

	
.

(c) If n

Upnq
(d) If W
X

	
 m then X n Upnq m Vpmq 

n Upnq m Vpmq 

X

	
X

m Vpmq n

 X m Vpmq n Upnq .
P CP Jn then X n Upnq n W 


X

	
n Upnq n W  X n


WUpnq

	


n WUpnq.

Tensor matricization: Process of reordering the elements of the tensor into a matrix is known as
matricization or unfolding. The mode-n matricization of a tensor Y
as Ypnq

P RIn

Âą

mn Im

Ypnq

X

ÂĄ
 X 1 Up1q 2 Up2q... N UpN q : X Upnq
n 1
N

along mode-n is equivalent to


is denoted

and is obtained by arranging Yâs mode-n fibers to be the columns of

the resulting matrix. Unfolding the tensor Y

where

P RI1I2...IN

 UpnqXpnqpUpN q b ...Upn 1q b Upn1q... b Up1qqJ,

b is the matrix Kronecker product.

(4.1)

In particular, (4.1) implies that the matricization

	
n Upnq pnq  UpnqXpnq.1
It is worth noting that trivial inner product preserving isomorphisms exist between a tensor

space RI1 I2 ...IN and any of its matricized versions (i.e., mode-n matricization can be viewed
as an isomorphism between the original tensor vector space RI1 I2 ...IN and its mode-n ma-

Âą

tricized target vector space RIn  mn Im ). In particular, the process of matricizing tensors is
linear. If, for example, X , Y
X

Y

P RI1I2...IN then one can see that the mode-n matricization of

P RI1I2...IN is pX

Y qpnq

 Xpnq

Ypnq for all modes n P rN s.

Tensor Rank: Unlike matrices, which have a unique definition of rank, there are multiple rank definitions for tensors including tensor rank and tensor n-rank. The rank of a tensor X
1 Simply set Upmq  I (the identity) for all m
definition of the mode-n product.



P RI1...IN

n in (4.1). This fact also easily follows directly from the

81

is the smallest number of rank-one tensors that form X as their sum. The n-rank of X is the
collection of ranks of unfoldings Xpnq and is denoted as:


n-rankpX q  rankpXp1q q, rankpXp2q q, ..., rankpXpN q q .

Tensor inner product: The inner product of two same sized tensors X , Y

(4.2)

P RI1I2...IN is the

sum of the products of their elements.

xX , Y y 

I1 Â¸
I2
Â¸
i1 1 i2 1

INĚ§

...

iN 1

xi1 ,i2 ,...,iN yi1 ,i2 ,...,iN .

(4.3)

It is not too difficult to see that matricization preserves Hilbert-Schmidt/Frobenius matrix inner
products. That is, that xX , Y y



A

Xpnq , Ypnq

xX , Y y  0, X and Y are orthogonal.
Tensor norm: Norm of a tensor X

E

 Trace
F


XJ
pnq Ypnq holds for all n

P rN s. If

P RI1I2...IN is the square root of the sum of the squares

of all its elements.

k X k

a

xX , X y 

g
f
INĚ§
I1 Â¸
I2
fÂ¸
f
e
...
x2i ,i ,...,i .
1 2
N
i1 1 i2 1

iN 1

(4.4)

The fact that matricization preserves Frobenius matrix inner products also means that it preserves






Frobenius matrix norms. As a result we have that }X }  Xpnq  holds for all n P rN s. If X and
F

Y are orthogonal and also have unit norm (i.e., have }X }  }Y }  1) we will say that they are an
orthonormal pair.

82

4.2.2
Let I

Some Useful Facts Concerning Mode-n Products and Orthogonality

P

RIn In be the identity matrix. Given a (low rank) orthogonal projection matrix P

P RI1I2...IN

RIn In one can decompose any given tensor X

P

into two orthogonal tensors

using Lemma 1 (b)

X

 X n I  X n ppI  Pq

Pq  X

n pI  Pq

X

n P.

To check that the last two summands are orthogonal one can use (4.1) to compute that
A

xX n pI  Pq, X n Py  pI  PqXpnq, PXpnq

E
F

 Trace


XJ
pnq pI  PqPXpnq  0.

As a result one can also verify that the Pythagorean theorem holds, i.e., that }X }2

 }X n P}2

}X n pI  Pq}2.
If we now regard X

n P as a low rank approximation to X then we can see that its approxi-

mation error
X

 X n P  X n pI  Pq

is orthogonal to the low rank approximation X
of its approximation error satisfies }X

n P, as one would expect. Furthermore, the norm

n pI  Pq}2  }X }2  }X n P}2. By continuing to use

similar ideas in combination with lemma 1 for all modes one can prove the following more general
Pythagorean result (see, e.g., theorem 5.1 in [241]).

83

Lemma 2. Let X

P RI1I2...IN

and Upnq

P RInIn be an orthogonal projection matrix for

all n P rN s. Then,


X

4.2.3







2
2
 X 1 Up1q 2 Up2q... N UpN q : X  X ÂNn1 Upnq

	2
Â°N  Ân1 phq
p
n
q
 n1 X h1 U n I  U  .

Higher Order Singular Value Decomposition (HoSVD)

Any tensor X

P

RI1 I2 ...IN can be decomposed as mode products of a core tensor C

RI1 I2 ...IN with N orthogonal matrices Upnq

P

P RInIn each of which is composed of the

left singular vectors of Xpnq [35]:

 C 1 Up1q 2 Up2q... N UpN q  C

X

N
ÂĄ
n1

Upnq

(4.5)

where C is computed as

C

 X 1


Up1q

	J

2


Up2q

	J


... N UpN q

	J

.

(4.6)

Let Cin Îą be a subtensor of C obtained by fixing the nth index to Îą. This subtensor satisfies
the following properties:
â˘ all-orthogonality: Cin Îą and Cin Î˛ are orthogonal for all possible values of n, Îą and Î˛
subject to Îą  Î˛.

xCinÎą, CinÎ˛ y  0 when Îą  Î˛.

84

(4.7)

â˘ ordering:
k Cin 1 kÂĽk Cin 2 kÂĽ ... ÂĽk Cin In kÂĽ 0

(4.8)

for all possible values of n.

4.3

Multiscale Analysis of Higher-order Datasets

In this section, we present a new tensor decomposition method named Multiscale HoSVD (MSHoSVD) for an N th order tensor, X

P RI1I2...IN . The proposed method recursively applies

the following two-step approach: (i) Low-rank tensor approximation, followed by (ii) Decomposing the residual (original minus low-rank) tensor into subtensors.
A tensor X is decomposed using HoSVD as follows:

X

 C 1 Up1q 2 Up2q... N UpN q,

(4.9)

where the Upnq âs are the left singular vectors of the unfoldings Xpnq . The low-rank approximation
of X is obtained by
XĚ0
where UĚpnq

 C0 1 UĚp1q 2 UĚp2q... N UĚpN q

(4.10)

P RInrn s are the truncated matrices obtained by keeping the first rn columns of Upnq

	J

	J

	J
p
1
q
p
2
q
p
N
q
and C0  X 1 UĚ
2 UĚ
... N UĚ
. The multilinear-rank of XĚ0 , tr1 , ...,


r N u,

can either be given a priori, or an energy criterion can be used to determine the minimum number
of singular values to keep along each mode as:

rn

 arg min
i

iĚ§
l 1

pnq
Ď
l

85

pnq
l1 Ďl
Â°In pnq
l1 Ďl
Â°i

s.t.

ÂĄ Ď,

(4.11)

pnq is the lth singular value of the matrix obtained from the SVD of the unfolding X

where Ďl

p nq ,

and Ď is an energy threshold. Once XĚ0 is obtained, the tensor X can be written as

X

 XĚ0

W0 ,

(4.12)

where W0 is the residual tensor.
For the first scale analysis, to better encode the details of X , we adapted an idea similar to the
one presented in [242, 243]. The 0th scale residual tensor, W0 is first decomposed into subtensors
as follows. W0

P

RI1 I2 ...IN

is unfolded across each mode yielding W0,pnq

P

Âą

I  j  n Ij
Rn

whose columns are the mode-n fibers of W0 . For each mode, rows of W0,pnq are partitioned into cn
non-overlapping clusters using a clustering algorithm such as local subspace analysis (LSA) [244]
in order to encourage the formation of new subtensors which are intrinsically lower rank, and
therefore better approximated via a smaller HoSVD at the next scale. The Cartesian product of the
partitioning labels coming from the N modes yields K
k

 ÂąNi1 ci disjoint subtensors X1,k where

P rK s.
Let J0n be the index set corresponding to the nth mode of W0 with J0n

n
be the index set of the subtensor X1,k for the nth mode, where J1,k

n

n
 rIns, and let J1,k

Â J0n for all k P rK s and

n  J n for all n P rN s. For
P rN s. Index sets of subtensors for the nth mode satisfy ÂKk1 J1,k
0

1  J 2  ...  J N and the
example, the index set of the first subtensor X1,1 can be written as J1,1
1,1
1,1

kth subtensor X1,k

   
 
1 J 2 J N 
J1,k
1,k
1,k

PR

is obtained by

1 pi q, J 2 pi q, ..., J N pi qq,
X1,k pi1 , i2 , ..., iN q  W0 pJ1,k
1
1,k 2
1,k N

X1,k



p

1
W0 J1,k


2
J1,k

86

 ... 

N
J1,k

q,

(4.13)

where in

P



 n 
J1,k  . Low-rank approximation for each subtensor is obtained by applying HoSVD

as:
XĚ1,k

 C1,k 1 UĚp1,k1q 2 UĚp1,k2q ... N UĚp1,kN q,

(4.14)

n |rpnq
pnq P R|J1,k
1,k s correspond to the core tensor and low-rank projection basis

where C1,k and UĚ1,k

matrices of X1,k , respectively. We can then define XĚ1 as the 1st scale approximation of X formed
by mapping all of the subtensors onto XĚ1,k as follows:
1  J 2  ...  J n q  XĚ .
XĚ1 pJ1,k
1,k
1,k
1,k

(4.15)

Similarly, 1st scale residual tensor is obtained by

1  J 2  ...  J n q  W ,
W1 pJ1,k
1,k
1,k
1,k

where W1,k

(4.16)

 X1,k  XĚ1,k . Therefore, X can be rewritten as:
X

 XĚ0

W0

 XĚ0

XĚ1

W1 .

(4.17)

Continuing in this fashion the j th scale approximation of X is obtained by partitioning Wj 1,k s
into subtensors Xj,k s and fitting a low-rank model to each one of them in a similar fashion. Finally,
the j th scale decomposition of X can be written as:

X



j
Â¸
i0

XĚi

Wj .

Algorithm 4.1 describes the pseudo code for this approach.

87

(4.18)

Algorithm 4.1 Multiscale HoSVD

 pc1, c2, ..., cN q: the desired number of clusters for each mode, sH : the highest
scale of MS-HoSVD.
Output: T : Tree structure containing the MS-HoSVD decomposition of XĚ .
Create an empty tree T
Create an empty list L
Add the node containing X : X0,1 to L with Parentp0, 1q  H (i.e., this is the root of the the tree).
while L is not empty. do
Pop a node corresponding to Xs,t (the tth subtensor from sth scale) from the list L where s P
t0, ...,! sH u )and t P t1, ..., K su.
pnq
Cs,t , UĚs,t Ă truncatedHOSVD(Xs,t ).

1: Input: X : tensor , C
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:

! pnq )

Add the node containing Cs,t , UĚs,t to T as a child of Parentps, tq.
if s Â  sH then
Compute Ws,t  Xs,t  XĚs,t .
Create K subtensors Xs 1,K pt1q k with Jsn 1,K pt1q k from Ws,t where k P t1, 2, ..., K u and
n P t1, 2, ..., N u.
!
)
Add K nodes containing Xs 1,K pt1q k and Jsn 1,K pt1q k to L with Parentps 1, K pt  1q
k q  ps, tq.
end if
end while

4.3.1
Let X

Memory Cost of the First Scale Decomposition

P RI1I2....IN be an N th order tensor. To simplify the notation, assume that the dimen-

sion of each mode is the same, i.e. I1

 I2  ....  IN  I.

Assume X is approximated by

HoSVD as:

XĚ

 CH 1 UpH1q 2 UpH2q... N UpHN q,
piq 

by fixing the rank of each mode matrix as rankpUH q

rH for i

P t1, 2, ..., N u.

(4.19)
Let Fpq be

a function that quantifies the memory cost, then the storage cost of X decomposed by HoSVD is
FpCH q

Â°N

i1

pFpUHpiqqq  rHN

N IrH .

For multiscale analysis at scale 1, XĚ
Â°N

i1

pFpUĚpiqqq  r0N



XĚ0

XĚ1 . The cost of storing XĚ0 is FpC0 q

N Ir0 where the rank of each mode matrix is fixed at rankpUpiq q

88

 r0

for i
the K

P t1, 2, ..., N u.


ÂąN

i1 c

The cost of storing XĚ1 is the sum of the storage costs for each of

piq subtensors XĚ1,k .

Assume cpiq



c for all i

P t1, 2, ..., N u yielding

cN equally sized subtensors, and that each XĚ1,k is decomposed using the HoSVD as XĚ1,k

p1q

p2q

pN q

piq



C1,k 1 UĚ1,k 2 UĚ1,k ... N UĚ1,k . Let the rank of each mode matrix be fixed as rankpUĚ1,k q  r1
for all i
Â°K

k 1


P t1, 2, ..., N u and k P t1, 2, ..., K u.
	

piq
i1 FpUĚ1,k q 

Â°N

FpC1,k q


cN

Then, the memory cost for the first scale is


N Ir1
. Choosing r1
c

r1N

the storage cost does not grow exponentially so that FpXĚ1 q


approximately equal to r0N

1

ensures that

Â  FpXĚ0q since the total cost becomes

2N Ir0 . Thus, picking r0
2
cN 2N
lower storage cost for the first scale analysis than for HoSVD.

4.3.2

1


Ă pNr01q
c

 rH {2 can now provide

Computational Complexity

The computational complexity of MS-HoSVD at the first scale is equal to the sum of computational
complexity of computing HoSVD at the parent node, partitioning into subtensors and computing
HoSVD for each one of the subtensors. Computational complexity of HoSVD of an N-way tensor
X

P

RI1 I2 ...IN

where I1

 I2  ...  IN  I is O


N I pN 1q

	
[245]. By assuming that the

partitioning is performed using K-means (via Lloydâs algorithm) with ci


 c along each mode, the

complexity partitioning along each mode is O N I N ci , where i is the number of iterations used
in Lloydâs algorithm. Finally, the total complexity of applying the HoSVD to cN equally sized


subtensors is O cN N pI {cqpN 1q . Therefore, first scale MS-HoSVD has a total computational


complexity of O N I pN 1q

N I N ci

	
cN N pI {cqpN 1q . Note that this complexity is similar

to that of the HoSVD whenever ci is small compared to I. The runtime complexity of these
multiscale methods can be reduced even further by computing the HoSVDs for different subtensors
in parallel whenever possible, as well as by utilizing distributed and parallel SVD algorithms such
as [246] when computing all the required HoSVD decompositions.
89

4.3.3

A Linear Algebraic Representation of the Proposed Multiscale HoSVD
Approach

Definitions: Though the tree-based representation of the proposed MS-HoSVD approach used
above in Algorithm 4.1 is useful for algorithmic development, it is somewhat less useful for theoretical error analysis. In this subsection we will develop formulas for the proposed MS-HoSVD
approach which are more amenable to error analysis. In the process we will also formulate a criterion which, when satisfied, guarantees that the proposed fist scale MS-HoSVD approach produces
an accurate multiscale approximation to a given tensor.
We can construct full size first scale subtensors of the residual tensor W0
(4.12), X |k

P

RI1 I2 ...IN for all k

P RI1I2...IN from

n from (4.13) along with
P rK s, using the index sets J1,k

pnq P t0, 1uIn In be the diagonal matrix with entries given

diagonal restriction matrices. Let Rk
by

pnq
R pi, j q 
k

for all k

$
'
'
'
&1,

if i  j, and j

'
'
'
%0,

n
P J1,k

(4.20)

otherwise

P rK s, and n P rN s. We then define
X |k : W0

N
ÂĄ
n 1

pnq  W  Rp1q  Rp2q ...  RpN q .
0 1 k
2 k
N k

Rk

(4.21)

1  ...  J N q, in the
Thus, the kth subtensor X |k will only have nonzero entries, given by W0 pJ1,k
1,k
n from above. The properties of the index sets J n furthermore
locations indexed by the sets J1,k
1,k

guarantee that these subtensors all have disjoint support. As a result both

W0



KĚ§
k 1

90

X |k

(4.22)

and
@

X |k , X |j

D

 0 for all j, k P rK s with j  k

will always hold.
Recall that we want to compute the HoSVD of the subtensors we form at each scale in order
to create low-rank projection basis matrices along the lines of those in (4.14). Toward this end

p nq Â¤

pnq q  |J n | left singular vectors of the mode-n matriciza1,k
Âą
p nq
tion of each X |k , X|k pnq P RIn  mn Im , for all n P rN s. Note that X|k pnq  Rk X|k pnq
pnq is a projection matrix.2 Thus, the top rpnq
always holds for these matricizations since R
we compute the top rk

rank pRk

k

k

n . Let
left singular vectors of X|k pnq will only have nonzero entries in locations indexed by J1,k

pnq P RIn rkpnq be the matrix whose columns are these top singular vectors. As a result of the
pnq  Rpnq UĚpnq will hold for all n P rN s and k P rK s.
preceding discussion we can see that UĚ

UĚk

k

pnq P

Our low rank projection matrices Qk

k

k

RIn In used to produce low rank approximations of

each subtensor X |k can now be defined as

pnq : UĚpnq UĚpnq 	J .

Qk

k

(4.23)

k

pnq  Rpnq UĚpnq holding, combined with the fact that Rpnq 	J  Rpnq

As a consequence of UĚk

k

k

k

pnq matrix is diagonal, we have that

k

since each Rk


p
nq
p
nq
p
nq J
p
nq p nq
p
nq pnq J
 Rk UĚk Rk UĚk
Qk
: UĚk
UĚk

	 
p
nq J
p
nq pnq
p
nq J
 R UĚ UĚ
R
 RpnqQpnqRpnq
k

holds for all n

k

k

k

k

k

(4.24)

k

P rN s and k P rK s. Using (4.24) combined with the fact that Rpknq is a projection

2 Here we are implicitly using (4.1).

91

matrix we can further see that

pnq pnq  Rpnq Rpnq Qpnq Rpnq 	  Rpnq Qpnq Rpnq  Qpnq  Rpnq Qpnq Rpnq
k
k
k
k
k
k
k
k
k
k
k

	
 RpnqQpnqRpnq Rpnq  QpnqRpnq

R k Qk

k

also holds for all n P rN s and k

k

k

k

k

(4.25)

k

P rK s.

1-scale Analysis of MS-HoSVD: Using this linear algebraic formulation we are now able to reexpress the the 1st scale approximation of X
1st scale residual tensor tensor, W1

XĚ1



KĚ§
k 1


X |k

N
ÂĄ
n 1


p nq 

Qk




P RI1I2...IN , XĚ1 P RI1I2...IN , as well as the

P RI1I2...IN , as follows (see (4.15) â (4.17)). We have that


KĚ§

k 1

KĚ§

W0
W0

k 1

KĚ§


X

k 1

N
ÂĄ
n 1
N
ÂĄ
n 1

pnq pnq


pUsing Lemma 1 and (4.21)q

Qk R k

p nq
Q


pUsing the properties in (4.25)q

k

 XĚ0

N
	ÂĄ
n1

p nq


pUsing (4.12)q

Qk

(4.26)
holds. Thus, we see that the residual error W1 from (4.17) satisfies

X

 XĚ0

KĚ§
k 1


X

 XĚ0

N
	ÂĄ
n 1

pnq

Qk


W1 .

(4.27)

Having derived (4.27) it behooves us to consider when using such a first scale approximation
of X is actually better than, e.g., just using a standard HoSVD-based 0th scale approximation of
X along the lines of (4.12). As one might expect, this depends entirely on piq how well the 1st
scale partitions (i.e., the restriction matrices utilized (4.20)) are chosen, as well as on piiq how

92

well restriction matrices of the type used in (4.20) interact with the projection matrices used to
create the standard HoSVD-based approximation in question. Toward understanding these two
conditions better, recall that XĚ0

XĚ0

X

P RI1I2...IN in (4.27) is defined as

N
ÂĄ
n 1

Ppnq

 X 1 Pp1q 2 Pp2q    N PpN q

(4.28)

	J

p
n
q
p
n
q
I

I
p
n
q
p
n
q
n
n
where the orthogonal projection matrices P
PR
are given by P
 UĚ UĚ
for the matrices UĚpnq P RIn rn used in (4.10). For simplicity let the ranks of the Ppnq pro-

jection matrices momentarily satisfy r1
r0

Â  maxntrankpXpnqqu).

pnq in (4.23) be r

Qk

1



  

r2

rN

:

r0 (i.e., let them all be rank

pnq
Similarly, let all the ranks, rk , of the 1st scale projection matrices

for the time being.

Motivated by, e.g., the memory cost analysis of Section 4.3.1 above, one can now ask when the
multiscale approximation error, }W1 }, resulting from (4.27) will be less than a standard HoSVDbased approximation error, }X
XĚ0 : X

 XĚ0}, where

N
ÂĄ
n1

PĚpnq

 X 1 PĚp1q 2 PĚp2q    N PĚpN q,

and each orthogonal projection matrix PĚpnq is of rank rĚn



rH

ÂĽ

2r0

ÂĽ

(4.29)

r0

cN 1 r1 i.e.,



where each PĚpnq projects onto the top rH left singular vectors of Xpnq . In this situation having
both }W1 }

Â  }X  XĚ0} and rH ÂĽ

2r0

ÂĽ

r0

cN 1 r1 hold at the same time would imply

that one could achieve smaller approximation error using MS-HoSVD than using HoSVD while
simultaneously achieving better compression (recall Section 4.3.1). In order to help facilitate such
analysis we prove error bounds in Appendix that are implied by the choice of a good partitioning
scheme for the residual tensor W0 in (4.20) â (4.22).

93

In particular, with respect to the question concerning how well the 1st -scale approximation
error, }W1 }, from (4.27) might compare to the HoSVD-based approximation error }X

 XĚ0}

we can use the following notion of an effective partition of W0 . The partition of W0 formed

pnq in (4.20) â (4.22) will be called effective if there exists another

by the restriction matrices Rk

!

pnq )K together
k 1

pessimistic partitioning of W0 via (potentially different) restriction matrices RĚk
with a bijection f : rK s Ă rK s such that
NĚ§ 
n 1


| n I 


X k

holds for each k

	
p
nq 2
Qk 

P rK s. In (4.30) the

the HoSVD of W0 with ranks rĚn

Â¤
!



W0


NĚ§ 
n 1


n RĚpfnpkq q I  PĚpnq

N
	ÂĄ
hn

2

p
hq 
RĚf pkq 


(4.30)

)

PĚpnq are the orthogonal projection matrices obtained from

 rH

i.e., where each PĚpnq projects onto the top rĚn

 rH



left singular vectors of the matricization W0,pnq . In Appendix, we show that (4.30) holding for
W0 implies that the error }W1 } resulting from our 1st -scale approximation in (4.27) is less than an
upper bound of the type often used for HoSVD-based approximation errors of the form }X

 XĚ0}

(see, e.g., [241]). In particular, we prove the following result.
Theorem 1. Suppose that (4.30) holds. Then, the first scale approximation error given by MSHoSVD in (4.27) is bounded by

}W1} 
2

!



X

 XĚ0 

2

XĚ1 

Â¤

NĚ§ 
n 1


X

)

where PĚpnq are low-rank projection matrices of rank rĚn


n I  PĚpnq

	2

 ,

 rĚn  rH obtained from the truncated

HoSVD of X as per (4.29).
Proof. See Appendix.
Theorem 1 implies that }W1 } may be less than }X
94

 XĚ0} when (4.30) holds.

It does not,

however, actually prove that }W1 }
proving that }W1 } Â¤ }X

Â¤ }X  XĚ0} holds whenever (4.30) does.

In fact, directly

 XĚ0} whenever (4.30) holds does not appear to be easy. It also does not

appear to be easy to prove the error bound in theorem 1 without an assumption along the lines of
(4.30) which simultaneously controls both piq how well the restriction matrices utilized to partition
W0 in (4.21) are chosen, as well as piiq how poorly (worst case) restriction matrices interact with
the projection matrices used to create standard HoSVD-based approximations of W0 and/or X .
The development of simpler and/or weaker conditions than (4.30) which still yield meaningful
error guarantees along the lines of theorem 1 is left as future work.
Considering condition (4.30) above, we note that experiments show that it is regularly satisfied
!

pnq )K in (4.20) â (4.22) are first
k 1

on real datasets when piq the effective restriction matrices Rk

formed by clustering the rows of each unfolding of W0 using, e.g., local subspace analysis (LSA),
!

pnq )K are randomly generated in order to
k 1

after which piiq pessimistic restriction matrices RĚk

create another (random) partition of W0 into K different disjoint subtensors for comparison. The
bijection f can then be created by, e.g., piq sorting the left-hand side errors in (4.30) for each
k

P rks, piiq sorting the right-hand side errors in (4.30) for each k P rK s, and then piiiq matching

the largest left-hand and right-hand errors for comparision, the second largest left-hand and righthand errors for comparision, etc.. When checked in this way the sorted right-hand side errors
often dominate (entrywise) the sorted left-hand side errors for various reasonable ranks rĚn
rH

 r0

cN 1 r1 (as a function of r0 and r1 with, e.g., c

 rĚn 

 2) on every dataset considered in

Section 4.4 below, thereby verifying that (4.30) does indeed regularly hold.
We refer the reader to the strong empirical performance of MS-HoSVD in Section 4.4 for
additional evidence supporting the utility of (4.27) as a means of improving the compression performance of standard HoSVD-based compression techniques. In addition, we further refer the
reader to Section 4.5 where it is empirically demonstrated that MS-HoSVD is also capable of se95

lecting more informative features than HoSVD-based methods for the purposes of classification.
These two facts together provide strong evidence that combining the use of clustering-enhanced
multiscale approximation with existing tensor factorization techniques can lead to improved performance in multiple application domains.

4.3.4

Adaptive Pruning in Multiscale HoSVD for Improved Performance

In order to better capture the local structure of the tensor, it is important to look at higher scale
decompositions. However, as the scale increases, the storage cost and computational complexity
will increase making any gain in reconstruction error potentially not worth the additional memory
cost. For this reason, it is important to carefully select the subtensors adaptively at higher scales.
To help avoid the redundancy in decomposition structure we propose an adaptive pruning method
across scales.
In adaptive pruning, the tree is pruned by minimizing the following cost function H  Error
Îť  Compression similar to the rate-distortion criterion commonly used by compression algorithms
where Îť is the trade-off parameter [247]. To minimize this function we employ a greedy procedure
similar to sequential forward selection [248]. First, the root node which stores XĚ0 is created and
scale-1 subtensors XĚ1,k are obtained from the 0th order residual tensor WĚ0 as discussed in Section
4.3. These subtensors are stored in a list and the subtensor which decreases the cost function the
most is then added to the tree structure under its parent node. Next, scale-2 subtensors belonging
to the added node are created and added to the list. All of the scale-1 and scale-2 subtensors in
the list are again evaluated to find the subtensor that minimizes the cost function. This procedure
is repeated until the cost function H converges or the decrease is minimal. A pseudocode of the
algorithm is given in Algorithm 4.2. It is important to note that this algorithm is suboptimal similar
to other greedy search methods.
96

Algorithm 4.2 Multiscale HoSVD with Adaptive Pruning

 pc1, c2, ..., cN q: the desired number of clusters for each modes, sH : the
highest scale of MS-HoSVD.
Output: T : Tree structure containing the MS-HoSVD decomposition of XĚ .
Create an empty tree T .
Create an empty list L.
Add node containing X to L.
while There is a node in L that decreases the cost function HpT q. do
Find the node corresponding to Xs,t (the tth subtensor from sth scale) in the list L that decreases H
s
the most
! pwhere
) s P t0, ..., sH u and t P t1, ..., K u.
nq
Cs,t , UĚs,t Ă truncatedHOSVD(Xs,t ).

1: Input: X : tensor , C
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:

! pnq )

Add the node containing Cs,t , UĚs,t to T .
if s Â  sH then
Compute Ws,t  Xs,t  XĚs,t .
Create K subtensors Xs 1,K pt1q k with Jsn 1,K pt1q k from Ws,t where k
n P t1, 2, ..., N u.
!
)
Add K nodes containing Xs 1,K pt1q k and Jsn 1,K pt1q k to L.
end if
end while

4.4

P t1, 2, ..., K u and

Data Reduction

In this section, we demonstrate the performance of MS-HoSVD for tensor type data representation
on 3-mode and 4-mode real datasets compared with three other tensor decompositions: HoSVD,
H-Tucker and T-Train. The performance of tensor decomposition methods are evaluated in terms
of reconstruction error and compression rate. In the tables and figures below the error rate refers
to the normalized tensor approximation error
# total bits to store XĚ
# total bits to store X .

}X XĚ }F
}X }F

and the compression rate is computed as

Moreover, we show the performance of the proposed adaptive tree prunning

strategy for data reduction.

97

4.4.1

Datasets

4.4.1.1

PIE dataset

A 3-mode tensor X

P R244320138 is created from PIE dataset [249]. The tensor contains 138

images from 6 different yaw angles and varying illumination conditions collected from a subject
where each image is converted to gray scale. Fig. 4.1 illustrates the images from different frames
of the PIE dataset.

50

50

100

100

150

150

200

200

50

100

150

200

250

300

50

100

150

200

250

300

Figure 4.1: Sample frames from PIE dataset corresponding to the 30th (left) and 80th (right)
frames.

4.4.1.2

COIL-100 dataset

The COIL-100 database contains 7200 images collected from 100 objects where the images of
each object were taken at pose intervals of 5 . A 4-mode tensor X

P R12812872100 is created

from COIL-100 dataset [250]. The constructed 4-mode tensor contains 72 images of size 128  128
from 100 objects where each image is converted to gray scale. In Fig. 4.2, sample images of four
objects taken from different angles can be seen.

98

Figure 4.2: Image samples of four different objects from COIL-100 dataset from varying pose
angles (from 0 to 240 with 60 increments).

4.4.1.3

The Cambridge Hand Gesture Dataset

The Cambridge hand gesture database consists of 900 image sequences of nine gesture classes
of three primitive hand shapes and three primitive motions where each class contains 100 image
sequences (5 different illuminations

 10 arbitrary motions  2 subjects).

In Fig. 4.3, sample

image sequences collected for nine hand gestures can be seen. The created 4-mode tensor X

P

R608030900 contains 900 image sequences of size 60  80  30 where each image is converted
to gray scale.

Figure 4.3: Illustration of nine different classes in Cambridge Hand Gesture Dataset.

4.4.2

Data Reduction Experiments

In this section, we evaluate the performance of MS-HoSVD for 1 and 2-scale decompositions
compared to HoSVD, H-Tucker and T-Train decompositions. In the following experiments, tensor
99

partitioning is performed by LSA and the cluster number along each mode is chosen as ci

 2.

The rank used in HoSVD is selected adaptively using the energy criterion as described in Section
4.3. In our experiments, we performed MS-HoSVD with Ď

 0.7 and Ď  0.75 and we kept Ď the

same for each scale. For the same compression rates as the MS-HoSVD, the reconstruction error
of HoSVD, H-Tucker and T-Train models are computed.
Fig. 4.4 explores the interplay between compression rate and approximation error for MSHoSVD in comparison to HoSVD, H-Tucker and T-Train for PIE, COIL-100 and hand gesture
datasets. Starting from the left on Figs. 4.4(a), 4.4(b) and 4.4(c), the first two compression rates
correspond to 1-scale MS-HoSVD with Ď

 0.7 and Ď  0.75, respectively while the last two are

obtained from 2-scale approximation with Ď

 0.7 and Ď  0.75, respectively.

As seen in Fig.

4.4, MS-HoSVD outperforms other approaches with respect to reducing PIE, COIL-100 and hand
gesture tensors at varying compression rates. Moreover, adding 2nd scale increases the storage
requirements while decreasing the error of MS-HoSVD. Fig. 4.5 illustrates the influence of scale
on the visual quality of the reconstructed images. As expected, introducing additional finer scales
into a multiscale approximation of video data improves image detail in each frame. Morevoer, data
reduction performance of T-Train seen better than H-Tucker in most of the experiments.

100

Compression-Error results for PIE video

0.08

MS-HoSVD
HoSVD
H-Tucker
T-Train

MS-HoSVD
HoSVD
H-Tucker
T-Train

0.15

Error

Error

0.06

Compression-Error results for COIL-100

0.2

0.04

0.02

0.1

0.05

0

0
0.03

0.06

0.12

0.24

0.03

Compression

0.06

0.09

0.16

Compression

(a) PIE

(b) COIL-100
0.12

Compression-Error results for hand gesture dataset
MS-HoSVD
HoSVD
H-Tucker
T-Train

0.1

Error

0.08
0.06
0.04
0.02
0
0.05

0.11

0.17

0.30

Compression

(c) Hand Gesture

Figure 4.4: Compression rate versus Normalized Reconstruction Error for MS-HoSVD (dark blue),
HoSVD (light blue), H-Tucker (green) and T-Train (yellow) for a) PIE, b) COIL-100 and c) Hand
Gesture datasets. Starting from the left for all (a), (b) and (c), the first two compression rates
correspond to 1-scale MS-HoSVD with Ď  0.7 and Ď  0.75 while the last two are obtained from
2-scale approximation with Ď  0.7 and Ď  0.75, respectively. MS-HoSVD provides lower error
than HoSVD, H-Tucker and T-Train.

101

original frame

0th scale

20

20

40

40

60

60

80

80

100

100

120

120
50

100

150

50

0th + 1st scale

100

150

0th + 1st + 2nd scale

20

20

40

40

60

60

80

80

100

100

120

120
50

100

150

50

100

150

Figure 4.5: A single frame of the PIE dataset showing increasing accuracy with scale.

4.4.3

Data Reduction with Adaptive Tree Prunning

In this section, we evaluate the performance of adaptive tree pruning multiscale decompositions.
In the pruning experiments, clustering is performed by LSA and the cluster number along each
mode is chosen as ci

 2.

The rank used in HoSVD is selected adaptively based on the energy

threshold which is 0.7. A pruned version of MS-HoSVD with 2-scale analysis that minimizes the
cost function H  Error

Îť  Compression for 2-scale analysis is implemented for PIE, COIL-

100 and Hand Gesture datasets with varying Îť values as reported in Tables 4.1, 4.2 and 4.3. As
Îť increases, reducing the compression rate becomes more important and the algorithm prunes the
leaf nodes more. For example, a choice of Îť

 0.75 prunes all of the nodes corresponding to the

second scale subtensors for PIE data (see Table 4.1).
As it can be seen from Tables 4.1, 4.2 and 4.3, the optimal tradeoff between reconstruction
error and compression rate are achieved at different Îť values for different data. For example, for
102

PIE data, increasing Îť value does not provide much change in reconstruction error while increasing
the compression accuracy. On the other hand, for COIL-100, Îť

 0.75 provides a good tradeoff

between reconstruction error and compression rate. On the other hand, small changes on Îť yield
significant effects on pruning the subtensors of 2-scale decomposition of hand gesture data. Fig.
4.6 illustrates the performance of the pruning algorithm on PIE dataset. Applying pruning with
Îť  0.25 increases the reconstruction error from 0.0276 to 0.0506 while reducing the compression
rate by 4 (Table 4.1). As seen in Fig. 4.6, the 2nd scale approximation obtained by adaptive
pruning algorithm preserves most of the facial details in the image.
Table 4.1: Reconstruction error and compression rate computed for pruned tree structure obtained
by applying MS-HoSVD with 2-scales to PIE data.
Îť
Normalized error
Compression
Scales of subtensors

0
0.0276
0.1241
0+1+2

0.22
0.0395
0.0809
0+1+2

0.25
0.0506
0.0377
0+1+2

0.30
0.0530
0.0284
0+1+2

0.75
0.0540
0.0261
0+1

Table 4.2: Reconstruction error and compression rate computed for pruned tree structure obtained
by applying MS-HoSVD with 2-scales to COIL-100 dataset.
Îť
Normalized error
Compression
Scales of subtensors

0
0.0857
0.0863
0+1+2

0.25
0.0867
0.0840
0+1+2

0.50
0.0913
0.0734
0+1+2

0.75
0.1060
0.0526
0+1+2

0.80
0.1207
0.0347
0+1+2

Table 4.3: Reconstruction error and compression rate computed for pruned tree structure obtained
by applying MS-HoSVD with 2-scales to Hand Gesture dataset.
Îť
Normalized error
Compression
Scales of subtensors

0
0.0691
0.1694
0+1+2

0.25
0.0869
0.1056
0+1+2

103

0.26
0.0913
0.0827
0+1+2

0.27
0.0946
0.0698
0+1+2

0.28
0.0999
0.0514
0+1

0th scale

0th + 1st scale

50

50

100

100

150

150

200

200
100

200

300

100
200
300
0th + 1st + 2nd scale with prunning
(Îť = 0.25)

0th + 1st + 2nd scale
50

50

100

100

150

150

200

200
100

200

300

100

200

300

Figure 4.6: Reconstruction error and compression rate computed for pruned tree structure obtained
by applying MS-HoSVD with 2-scales to PIE dataset. Top-left and right image is the sample frame
by reconstructing the tensor using only 0th scale and recontruction by using 0th and 1st scales
respectively. Bottom-left image is a sample frame for reconstructed using 2-scale approximation
with all the subtensors and the bottom-right image is the reconstruction of 2 scale analysis with
pruning approach where Îť  0.25.

Performance of the pruning algorithm reported in Tables 4.1, 4.2 and 4.3 is also compared with
HoSVD, H-Tucker and T-Train decompositions in Fig. 4.7. As seen in Fig. 4.7 (b) and (c), MSHoSVD outperforms other approaches for compressing COIL-100 and Hand Gesture datasets at
varying compression rates. However, for PIE data, the performance of MS-HoSVD and HoSVD
are very close to each other while both approaches outperform H-Tucker and T-Train, as can be
seen in Fig. 4.7 (a). In Fig. 4.8, sample frames of PIE data reconstructed by T-Train (top-left),
H-Tucker (top-right), HoSVD (bottom-left) and pruned MS-HoSVD with 2-scales (bottom-right)
are shown. It can be easily seen that the reconstructed images by H-Tucker and T-Train are more
blurred than the ones obtained by HoSVD and MS-HoSVD. One can also see the facial details
captured by MS-HoSVD are clearer than HoSVD although the performances of both algorithms
104

are very similar to each other. The reason for capturing facial details better by MS-HoSVD is that
the higher scale subtensors encode facial details.
0.07

Compression-Error results for PIE video

0.15

Compression-Error results for COIL-100

0.06
0.1

0.04

Error

Error

0.05

0.03
0.02
0.01

0.05

MS-HoSVD-AP
HoSVD
H-Tucker
T-Train

0

MS-HoSVD-AP
HoSVD
H-Tucker
T-Train

0
0.03

0.04

0.08

0.03

Compression

0.05

0.08

Compression

(a) PIE

(b) COIL-100
0.12

Compression-Error results for hand gesture dataset

0.1

Error

0.08
0.06
0.04
0.02

MS-HoSVD-AP
HoSVD
H-Tucker
T-Train

0
0.07

0.08

0.10

Compression

(c) Hand Gesture

Figure 4.7: Compression rate versus Normalized Reconstruction Error for MS-HoSVD with adaptive pruning (dark blue), HoSVD (light blue), H-Tucker (green) and T-Train (yellow) for a) PIE, b)
COIL-100 and c) Hand Gesture datasets. 2-scale MS-HoSVD tensor approximations are obtained
using Ď  0.7 for each scale and varying pruning trade-off parameter Îť.

105

Figure 4.8: Reconstructed frame samples from PIE data compressed by T-Train (top-left), HTucker (top-right), HoSVD (bottom-left) and pruned MS-HoSVD with 2-scales (bottom-right).
2-scale MS-HoSVD tensor approximation is obtained using Ď  0.7 for each scale and Îť  0.25.

4.5

Feature Extraction and Classification

In this section, we evaluate the features extracted from MS-HoSVD for classification of 2-mode
and 3-mode tensors containing object images and hand gesture videos. Discrimination power and
classification accuracy of MS-HoSVD features are compared to the features extracted by HoSVD
and T-Train.

4.5.1

COIL-100 Image Dataset

For computational efficiency, each image was downsampled to a gray-scale image of 32

 32

pixels. Number of images per object used for training data was gradually increased from 18 to 54
and selected randomly. A 3-mode tensor X tr

P R3232Itr is constructed from training images
106

where Itr
X te

P 100  t18, 36, 54u and the rest of the images are used to create the testing tensor

P R3232Ite where Ite  7200  Itr .

4.5.2

The Cambridge Hand Gesture Dataset

For computational efficiency, each image was downsampled to a gray-scale image of 30  40 pixels.
Number of image sequences used for training data gradually increased from 25 to 75 per gesture
and selected randomly. A 4-mode tensor X tr
sequences where Itr
testing tensor X te

P R304030Itr is constructed from training image

P 9  t25, 50, 75u and the rest of the image sequences are used to create the

P R304030Ite where Ite  900  Itr .

4.5.3

Classification Experiments

4.5.3.1

Training

For MS-HoSVD, the training tensor X tr is decomposed using 1-scale MS-HoSVD as follows.
Tensor partitioning is performed by LSA and the cluster number along each mode is chosen as c 

t2, 3, 1u yielding 6 subtensors for COIL-100 dataset and c  t2, 2, 3, 1u yielding 12 subtensors
for hand gesture dataset. We did not partition the tensor along the last mode that corresponds to
the classes to make the comparison with other methods fair. The rank used in 0th scale is selected
adaptively depending on the energy criterion with Ď

 0.7, while full rank decomposition is used

for the 1st scale. Decomposing training data X tr by MS-HoSVD as

tr
Xj,k

tr  UĚtr,p1q  UĚtr,p2q ...  UĚtr,pN q ,
 Cj,k
1 j,k
2 j,k
N j,k

107

(4.31)

tr,piq

tr and factor matrices U
provides core tensors Cj,k
j,k

for jth scale and kth subtensor. Feature

tr for the training data is then created by projecting X tr s onto the first N
tensor Sj,k
j,k
tr,piq

matrices Uj,k

 1 factor

as:

tr
Sj,k

tr 
 Xj,k
1


tr,p1q

UĚj,k

	J

2


tr,p2q

UĚj,k

	J


tr,pN 1q

... N 1 UĚj,k

	J

.

(4.32)

tr along the sample mode N and concatenating them to each
Unfolding the feature tensors Sj,k

other yields a high dimensional feature vectors for the training samples. From these vectors, Nf
features with the highest Fisher Score [251] are selected to form the lower-dimensional feature
vectors xtr

P RNf 1 for each training sample where the number of features (Nf ) is determined

by maximizing the discrimination score. For HoSVD and T-Train, full rank decompositions are
computed and feature vectors are created by selecting Nf features with the highest Fisher Score
from the core tensors as described above.

4.5.3.2

Testing

To create feature vectors from the testing samples, first, the testing tensor Xte is projected onto
tr,piq

Uj,k

where i P rN

te
Sj,k



 1s as:
te
Xj,k

1


tr,p1q
UĚj,k

	J

2


tr,p2q
UĚj,k

	J

... N 1


tr,pN 1q
UĚj,k

	J

.

(4.33)

te along the sample mode N and
Similar to the training step, unfolding the feature tensors Sj,k

concatenating them with each other yields high dimensional feature vectors for the testing samples.
The features corresponding to the features selected from the training step are used to form the
feature vectors for testing samples xte

P RNf 1.
108

Once the feature vectors are obtained, 1-NN

is used to classify the test samples using the Euclidean distance. A similar two-step procedure
is used to create feature vectors for HoSVD and T-Train for testing data. First, the testing tensor
is projected onto factor matrices obtained from training data, then, Nf number of features are
selected to create feature vectors.
As seen in Tables 4.4 and 4.5, features obtained from MS-HoSVD have greater Fisher score
on average and classify the images better than both HoSVD and T-Train. It is also seen that
the performance of HoSVD, T-Train and MS-HoSVD become close to each other as the size of
the training dataset increases as expected. The reason behind the improved performance of MSHoSVD is that MS-HoSVD captures the variations and nonlinearities across the modes such as
rotation or translation better than the other methods. In both of the datasets used in this section,
the images are rotated across the different frames. Since these nonlinearities are encoded in the
higher scale (1st scale) features while the average characteristics , which are the same as HoSVD,
are captured by the lower scale (0th scale) MS-HoSVD features the classification performance of
the MS-HoSVD is slightly better than HoSVD. However, MS-HoSVD features, have much higher
Fisher score than HoSVD features on average which indicates that classification using 1-NN with
Euclidean distance may not be able to capture this difference in discrimination power since it treats
all of the features obtained from 0th and 1st scales equally. It is also seen that T-Train features are
not as good as MS-HoSVD and HoSVD features for capturing rotations and translations in the data
and requires larger training set to reach the performance of MS-HoSVD and HoSVD.

109

Table 4.4: 1NN classification results for COIL-100 dataset over 20 trials with Nf
Training Size
25%

50%

75%

Method
MS-HoSVD
HoSVD
T-Train
MS-HoSVD
HoSVD
T-Train
MS-HoSVD
HoSVD
T-Train

Accuracy
mean  std.
93.71  1.28
93.07  1.33
92.29  2.23
97.41  0.69
97.10  0.83
96.99  1.09
98.38  0.65
98.16  0.72
98.25  0.41

 100.

FS (103 )
mean  std.
1.11  0.38
0.25  0.03
0.26  0.02
1.01  0.38
0.24  0.02
0.24  0.02
0.97  0.34
0.26  0.01
0.25  0.01

Table 4.5: 1NN classification results for hand gesture dataset over 20 trials with Nf
Training Size
25%

50%

75%

Method
MS-HoSVD
HoSVD
T-Train
MS-HoSVD
HoSVD
T-Train
MS-HoSVD
HoSVD
T-Train

Accuracy
mean  std.
75.40  3.87
75.01  3.99
69.20  2.63
83.86  3.12
83.15  2.90
78.97  2.25
87.47  2.07
86.93  2.31
85.64  2.57

110

FS (103 )
mean  std.
6.37  1.35
5.51  0.50
7.55  0.32
6.04  1.68
4.01  0.21
5.64  0.36
5.14  1.57
3.68  0.17
4.59  0.21

 200.

4.6

Applications on fMRI

Advances in information technology are making it possible to collect increasingly massive amounts
of multidimensional, multi-modal neuroimaging data such as functional magnetic resonance imaging (fMRI). Current fMRI datasets involve multiple variables including multiple subjects, as well
as both temporal and spatial data. These high dimensional datasets pose a challenge to the signal processing community to develop data reduction methods that can exploit their rich structure
and extract meaningful summarizations. In this section, we demonstrate the performance of MSHoSVD for compressing 4-way tensor containing fMRI volume sequences compared with HoSVD
and 4D-Wavelet transform.

4.6.1

Data Description and Preprocessing

The data used in this section is obtained from 1000 Functional Connectomes Project [252] which
has aggregated previously collected test-retest imaging datasets from more than 36 labs around the
world. The data acquired from above url is referred to as Bangor which contains open-eye resting
state fMRI scans of 20 male participants aged between 19-38 (Magnet: 3T, TR = 2, 34 slices, 265
time points).
The data were pre-processed using CONN functional connectivity toolbox [253]. First, structural images were co-registered to the mean functional image for each subject and normalized to
MNI space. Then, slice timing correction and motion correction were performed for each functional images. The functional images were warped to Talairach Daemon atlas [254] provided by
CONN toolbox and smoothed with an 4-mm FWHM Gaussian kernel. Confounds such as motion parameters obtained from reallignment and bold signals obtained from white matter and CSF
masks were regressed out and band-pass (0.008-0.09 Hz) temporal filtering was applied to func-

111

tional images of each subject. After pre-processing, the fMRI dataset can be represented as a
4-mode tensor X m

P R1099191265 for each subject m P t1, 2, ..., 20u where the first three

modes correspond to the preprocessed volume data and the fourth mode to time.

4.6.2

Results

In this section, we evaluated the performance of the MS-HoSVD in comparison to HoSVD and 4-D
Wavelet for both compression and error rate on fMRI data. X m s obtained from preprocessing are
m , HoSVD XĚ m and wavelet XĚ m , respectively.
decomposed by using MS-HoSVD yielding XĚM
S
HO
W

In the following experiments, clustering is performed by by local subspace analysis (LSA) [244]
and the number of clusters along each mode is chosen as ci



4. The rank used in truncated

HoSVD is selected adaptively depending on the energy criterion. Energy criterion determines
the minimum number of singular values kept during the SVD of the unfolded tensors along each
mode such that the cumulative energy is above a certain threshold. For MS-HoSVD, the energy
thresholds are selected as 0.7 and 0.95 for the SVDs computed for 0th and 1st scales, respectively.
For HoSVD, the energy threshold increased gradually from 0.990 to 0.999 with a step size of
0.0005 to compare the reconstruction error at similar compression ratios (experiment-1) and the
compression rate for the similar error rates (experiment-2). For the 4-D Wavelet compression, 2scale 1-D temporal Wavelet transform followed by a 2-scale 3-D spatial Wavelet transform with
Db3 wavelet functions were applied. Significant wavelet coefficients were selected to have similar
compression ratio close to MS-HoSVD. In Table 4.6 the error rate refers to the normalized tensor
approximation error

}X XĚ }F
}X }F

and the compression ratio is computed as # total bits to store X . As
# total bits to store XĚ

seen in Table, 4.6, MS-HoSVD provides reduced error (experiment-1) and better compression
than HoSVD (experiment-2) for 20 subjects. MS-HoSVD also provides smaller reconstruction
error than 4-D Wavelet (Table 4.6).
112

Table 4.6: Average compression ratio (meanst.dev) and reconstruction error (meanst.dev) obtained by MS-HoSVD, HoSVD and 4-D Wavelet over 20 subjects.
MS-HoSVD
Compression
Ratio
Reconstruction
Error

10.3275

0.6287
0.0231

0.0018

HoSVD
(Exp-1)
10.3736
0.5955
0.0404
0.0061

HoSVD
(Exp-2)
8.3068
0.6427
0.0228
0.0040

4D-Wavelet
10.3456

0.1605
0.0493

0.0026

Once low-rank approximations are obtained, mean ROI signals Ym
m
R88265 , YHO

P

m
R88265 and YW

P

P

m
R88265 , YM
S

P

m , XĚ m and XĚ m
R88265 corresponding to X m , XĚM
S
HO
W

are computed for each subject using Talairach Daemon atlas. Connectivity networks for each subject m are denoted as Am

P R8888, AmM S P R8888, AmHO P R8888 , AmW P R8888, and are

m , Y m and
constructed by computing the correlation coefficient between all ROIs in Ym , YM
S
HO
m , respectively. Significant connections (p Â¤ 0.01, Bonferroni corrected) for each method were
YW

determined by performing t-tests for each edge of connectivity matrices over subjects. Table 4.7
shows the miss and false alarm rates for the connectivity networks constructed using MS-HoSVD,
HoSVD and 4-D Wavelet in comparison to the original network. As it can be seen from Table
2, we obtain lower error rates for MS-HoSVD compared to HoSVD and Wavelet for all of the
experiments.
Table 4.7: Comparisons of probability of miss (PM iss ) and probability of false alarm (PF A ) obtained by MS-HoSVD, HoSVD and 4-D Wavelet.
MS-HoSVD
PM iss
PF A

0.0020
0

HoSVD
(Exp-1)
0.0033
0.0006

113

HoSVD
(Exp-2)
0.0031
0

4D-Wavelet
0.0023
0

4.7

Conclusions

In this chapter, we proposed a new multi-scale tensor decomposition technique for better approximating the local nonlinearities in generic tensor data. The proposed approach constructs a tree
structure by considering similarities along different fibers of the tensor and decomposes the tensor
into lower dimensional subtensors hierarchically. A low-rank approximation of each subtensor is
then obtained by HoSVD. We also introduced a pruning strategy to find the optimum tree structure
by keeping the important nodes and eliminating redundancy in the data. The proposed approach is
applied to a set of 3-way and 4-way tensors to evaluate its performance on both data reduction and
classification applications.
Future work will consider automatic selection of parameters such as the number of clusters
and the appropriate rank along each mode.The computational efficiency of the proposed method
can be improved through parallelization of the algorithm such as parallel construction of subtensors and parallel implementation of HoSVD [255]. This efficient implementation will enable the
implementation of finer scale decompositions for higher order and higher dimensional tensors.
Proposed algorithm currently constructs the tree structure based on decomposing the tensor using
HoSVD. The proposed tensor decomposition structure can also be implemented using other tensor
decomposition methods such as PARAFAC and tensor-train decompositions.

114

Chapter 5
Conclusions and Future Work

5.1

Conclusions

This thesis makes fundamental contributions to data reduction of tensor type data with a particular
focus on providing a better understanding of dynamic functional connectivity brain networks. In
Chapter 2, we discuss the problem of tensor data reduction through clustering where the tensor
contains functional connectivity brain networks collected in time across multiple subjects, and we
approach the problem as a multi-graph clustering problem. To determine the common community
structure underlying the functional connectivity networks of all subjects, we propose a hierarchical
consensus spectral clustering approach, FCCA. To obtain a hierarchical decomposition, Fiedler
vector based graph bi-partitioning method is applied first to the data and later to its partitionins
repeatedly. Multiple subjects are taken into account by first obtaining an initial bipartition of each
subjectâs connectivity network and then by iteratively partitioning the co-occurrence matrix across
subjects. Furthermore, new information-theoretic cluster quality measures are introduced for selecting the optimal community structure as an alternative to the standard modularity measure. This
measure depends on optimizing the tradeoff between maximizing the divergence between clusters
and minimizing the entropy of individual clusters. It is also shown that FCCA is computationally
more efficient than standard consensus clustering approaches such as voting and is more accurate
115

than averaging in the case of outliers and overlapping community structures. Moreover, the application of the proposed algorithm to functional connectivity networks constructed from EEG data
during a study of ERN produced clusters consistent with published work [178,179] whereas voting
and averaging methods failed to identify meaningful lateral and medial frontal communities.
In Chapter 3, we proposed an alternative way to reduce this high dimensional data through
linear subspace estimation and update methods. In this approach, the dynamics of the connectivity networks are taken into account such that the data reduction is done across subjects for time
intervals determined by the subspace tracking approach. For this purpose, we introduced a new
recursive low-rank + sparse structure learning algorithm for tensor type data in order to track the
modular structure of functional connectivity networks through EEG recordings across multiple
subjects. This approach yields robust estimation of the low-rank component of a dynamic 3-way
tensor at each time point and provides a recursive way to update the subspace information. The
proposed approach, Ho-RLSL, identifies the time points where the low-rank subspaces change and
recursively updates these subspaces to improve the low-rank approximation of data. The proposed
approach is used to separate low-rank and sparse parts of dFCNs across multiple subjects. Lowrank component of dFCNs obtained for the detected time intervals are summarized using FCCA.
The low-rank subspace approximation to each time interval provides a denoised version of the
original network, thus improving the quality of the clustering results. The results indicate a clear
change in the community structure before and after the physiological response. Moreover, the detected community structures are in line with previous hypothesis regarding the networks involved
in cognitive control [221].
In Chapter 4, we proposed a new tensor decomposition technique which better approximates
the underlying nonlinear structure of tensors compared to HoSVD. The proposed method constructs data-dependent multiscale dictionaries to better represent the data. The multiscale structure
116

of the proposed decomposition is obtained by constructing a tree structure depending on the local
similarities in the tensor. Based on these local similarities, the proposed method decomposes the
tensor into lower dimensionsional subtensors hierarchically. Low-rank structure of each subtensor
obtained by HoSVD provides finer linear approximation for the points in the region of interest.
Moreover, a pruning strategy which keeps important nodes in the tree is introduced to find suboptimal tree structure. The proposed approach is applied to 3-way and 4-way tensors containing
simulated and real datasets to illustrate the improvement in the compression performance compared
to HoSVD and Wavelet transform. In addition, it is shown that features obtained from multiscale
representation provide better classification performance than using HoSVD features for tensors
containing nonlinearities such as rotation or translation.

5.2

Future Work

There are still remaining challenges involving data reduction of high-dimensional and higher-order
tensors. Some of these challenges and possible solutions include:
â˘ Unsupervised model selection: Tensor based approaches have been widely used in a variety of fields e.g. computer vision, data science and biomedical imaging and, identifying
the most appropriate model such as Tucker, PARAFAC or Tensor-Train to exploit the multilinear structure of the data is one of the most important challenges. Bro et al. [256] has
already addressed this issue and proposed an approach known as Core consistency diagnostics (CORCONDIA) to select Tucker versus CP model. Core consistency metric provides a
larger value if the Tucker core is close to super diagonal tensor indicating that the CP model
is more appropriate. However, this metric is limited to Tucker and CP models. Therefore,
there is a need to develop new metrics appropriate for other models including Tensor-Train
117

and Hierarchical Tucker. Using such a metric may also enable us to select appropriate model
to use in proposed multiscale tensor decomposition structure and may improve compression
performance.
â˘ Robust rank selection: Finding the appropriate rank for the selected model is another crucial problem for tensor based approaches and robust rank selection algorithms improve the
compression performance of the selected models directly. Several different approaches have
been already proposed to identify rank for both CP and Tucker decomposition [257â264].
For CP rank, Papalexakis et al. [259] proposed fast and exact algorithm for CORCONDIA
which scales with tensor size while [262] and [260] used Bayesian approaches. To identify
Tucker rank, Timmerman et al. [263] introduced difference in fit (DIFFIT) procedure which
finds the combination of dimensions that gives the best-fit by assigning different ranks to
each mode. Similar to CP, Bayesian approaches have been also used to identify Tucker
rank [262]. Adapting these methods to Ho-RLSL to estimate low-rank subspace dimension
in a robust way may enhance the denoising performance of the proposed algorithm. Moreover, robust rank selection may also improve the compression performance of MS-HoSVD.
â˘ Efficient algorithms: In order to overcome the curse of dimensionality emerging from higherorder data, developing scalable tools for tensor decomposition has become inevitable. Recently, many approaches have been developed to compute Tucker and CP decompositions
efficiently [255, 265â269]. Tsourakakis [265] showed that an accurate low-rank Tucker approximation of the tensor can be computed much faster from a sparsified version of the
tensor. Similarly, Papalexakis et al. [267] proposed a fast and parallelizable method using random sampling techniques for speeding up CP decomposition which produces sparse
outer product approximations for sparse tensors. Scalable and distributed implementations

118

of tensor mode-N product and singular value decomposition, which are essential to both CP
and Tucker decompositions have also been developed [246, 266]. More recently, Austin et
al. [269] proposed the first distributed memory implementation of a parallel algorithm for
computing a Tucker decomposition. Adapting these techniques to the implementation of
proposed Ho-RLSL and MS-HoSVD will surely reduce the time complexity.

119

APPENDIX

120

Proofs for Chapter 3
Î˛t Is Small
In this section, we will prove that Î˛t is small and that eqn (3.14) can be treated as a sparse recovery
in noise problem. Define the subspace estimation error for ith mode as

SEpPpiq , PĚpiq q :k pI

 PĚpiqPĚJ,piqqPpiq kF  i,

(1)

piq 

where Ppiq and PĚpiq are true and estimated basis matrices of the ith mode, respectively. Pj


piq

piq



piq

Pj 1 Pj,new where Pj,new is a ni

iq
piq
piq
 cpj,new
basis matrix with pPj,new qJ Pj 1 

the low-rank tensor at time t, Lt , has components Am s with m

piq

piq

0. Since

P t1, 2, ..., 8u which are the

projections of Lt in both the previous subspace Pj 1 s and Pj,new s. Therefore, Lt can be written
as the sum of these components:

Lt



p1q

p2q

p3q

A1 1 Pj 1 2 Pj 1 3 Pj 1

p1q

p2q

p3q

A2 1 Pj,new 2 Pj 1 3 Pj 1

p1q

p2q

p1q

p2q

p3q

A3 1 Pj 1 2 Pj,new 3 Pj 1

p3q

A4 1 Pj 1 2 Pj 1 3 Pj,new

p1q
p2q
p3q
A5 1 Pj,new 2 Pj,new 3 Pj 1
p1q
p2q
p3q
A6 1 Pj,new 2 Pj 1 3 Pj,new
p1q
p2q
p3q
A7 1 Pj 1 2 Pj,new 3 Pj,new
p1q  Pp2q  Pp3q
A  P
8

1

j,new

121

2

j,new

3

j,new

(2)

where we redefine the old and new parts of the projection as

At,

At,new

J,p1q 

 A1 

Lt 1 Pj 1

2

J,p1q

J,p2q 

Pj  1

3
3

J,p2q 

A2



Lt 1 Pj,new 2 Pj 1

A3



Lt 1 Pj 1

2

A4



Lt 1 Pj 1

2

A5



A6



A7



J,p1q 
J,p1q 

J,p3q

Pj  1

J,p3q

Pj  1

J,p2q

J,p3q

Pj,new 3 Pj 1

J,p2q 

Pj  1

3

J,p3q

Pj,new

(3)

J,p1q
J,p2q
J,p3q
Lt 1 Pj,new 2 Pj,new 3 Pj 1
J,p1q
J,p2q
J,p3q
Lt 1 Pj,new 2 Pj 1 3 Pj,new
J,p1q
J,p2q
J,p3q
Lt 1 Pj 1 2 Pj,new 3 Pj,new
J,p1q  PJ,p2q  PJ,p3q .
L  P

 A8 

t

1

j,new

2

j,new

3

j,new

Assumptions:
1. Assume that subspace estimation error is i

pkq be the ith column of L

t,pk q

2. Let li

Îłnew,k

k pI  PĚpiqPĚJ,piqqPpiq kF Â¤ r0piqÎś for Îś ! 1.
pk q

and assume that k li

Â Â  Îł,k .

3. Define Îł

 minkPt1, 2, 3upÎł,k

4. Define Îłnew

Â¤ c 1 pkq and
Îśr
J

bÂą
3

i1,ik Ni

 minkPt1, 2, 3upÎłnew,k

Î˛t is defined as Î˛t

kF Â¤ Îł,k , Îł,k

bÂą
3

q and assume that k Lt kF Â¤ Îł.

i1,ik Ni

q and assume that Îłnew ! Îł.

 Lt 1 Ďpt1q 2 Ďpt2q 3 Ďpt3q where Ďptiq  I  PĚptiq1pPĚptiq1qJ and its norm

is:

122

p1q  Ďp2q  Ďp3q k
2 t
3 t
F

k Î˛t kF  k Lt 1 Ďt

Â¤

1 2 3 ||At, ||F
1 3 ||A3 ||F

2 3 ||A2 ||F
1 2 ||A4 ||F

3 ||A5 ||F

2 ||A6 ||F

1 ||A7 ||F

||At,new ||F

(4)

From section 5.2:
k A2 kF Â¤ Îł
..
.
k A7 kF Â¤ Îł
Let NĚ
ÎłĚnew

b

p1q

p2q

p3q

rankpPj,new qrankpPj 1 qrankpPj 1 q
(5)

b

p1q

p2q

p3q

rankpPj 1 qrankpPj,new qrankpPj,new q

 maxipNiq, rĚJ  maxiprJpiqq and ÎłÂŻ  maxk pÎł,k q  NĚ where ÎłĚ ÂĄ Îł.

Define

 maxk pÎłnew,k q and assume that ÎłĚnew ! ÎłĚ.

Note that;
â˘ First term:
1 2 3 ||At, ||F

Â¤ prĚj 1Îś q3  NĚ  ÎłĚ
Â¤ prĚj 1Îś q3  NĚ  a1

Îś rĚJ

(6)

Â¤ NĚ  Îś 5{2  prĚj 1q2.
â˘ Last term:

||At,new ||F Â¤ NĚ  ÎłĚnew,1.

123

(7)

â˘ Cross terms:
2 3 ||A2 ||F

Â¤ prĚj 1Îś q2  ÎłĚarĚj,new rĚj 1rĚj 1
Â¤ NĚ  prĚj 1Îś q2  a1

Îś rĚJ

rĚj 1

a

rĚj,new

(8)

Â¤ NĚ  Îś 3{2  prĚj 1q2  prĚj,new q1{2.
3 ||A5 ||F

Â¤ prĚj 1Îś q  ÎłĚarĚj,new rĚj,new rĚj 1
Â¤ NĚ  prĚj 1Îś q  a1

Îś rĚJ

rĚj,new

a

rĚj 1

(9)

Â¤ NĚ  Îś 1{2  prĚj 1q1{2  rĚj,new .
Therefore;
k Î˛t kF Â¤ NĚ  Îś 5{2  prĚj 1 q2

3  NĚ  Îś 3{2  prĚj 1 q2  prĚj,new q1{2

3  NĚ  Îś 1{2  prĚj 1 q1{2  rĚj,new

NĚ  ÎłĚnew .

k Î˛t kF Â¤ NĚ 
 ÎłĚp4q 3  NĚ  Îś 1{2  prĚj 1q1  ÎłĚp2q  prĚj,new q1{2
p1q  rĚ
3  NĚ  ÎłĚ
NĚ  ÎłĚnew .
j,new
Îś 1{2

(10)

Since Îś is small and ÎłĚ is large, the last term is dominant in the upper bound. Thus, Î˛t can be
considered as a noise by the slow subspace change assumption k Îłnew kF !k St kF .

124

Upper bound for Frobenious Norm of Cross Terms
Upper bound for the norm of the cross terms is derived as follows:

J,p1q

J,p2q 

k At,2 kF  k Lt 1 Pj,new 2 Pj 1

J,p1q

Â¤

p3q

3

J,p3q

Pj 1 kF

p2q

k Pj,new Lt,p1q kF  k Pj 1 b Pj 1 kF
c

b


p1q
J,p3q
J,p2q p3q
p2q 
rj,new  Îł  tr pPj 1 b Pj 1 qpPj 1 b Pj 1 q
c 
b
p
1q
J,p3q p3q
J,p2q p2q 
rj,new  Îł  tr pPj 1 Pj 1 qJ b pPj 1 Pj 1 q
c 
b
p
1q
J,p3q p3q   J,p2q p2q 
rj,new  Îł  tr pPj 1 Pj 1 qJ  tr pPj 1 Pj 1 q
b
p1q p2q p3q
Îł rj,new rj 1 rj 1 .

Â¤




(11)

Proofs for Chapter 4
In order to facilitate error analysis for the 1-scale MS-HoSVD that is similar to the types of error
analysis available for various HoSVD-based low-rank approximation strategies (see, e.g., [241]),
we will engage in a more in depth discussion of condition (4.30) herein. Recall that the partition

pnq in (4.20) â (4.22) is called effective if there exists

of W0 formed by the restriction matrices Rk

!

pnq )K together with a bijection
k 1

another pessimistic partitioning of W0 via restriction matrices RĚk
f : rK s Ă rK s such that
NĚ§ 
n 1

| n


X k

holds for each k


P r K s.

pnq 	2 Â¤
IQ
k

!

In (12) the

from the HoSVD of W0 with ranks rĚn



W0


NĚ§ 
n 1

PĚpnq

2

N

	ÂĄ
p
nq
phq 
p
n
q
n RĚf pkq I  PĚ
RĚf pkq 

hn

(12)

)

are the orthogonal projection matrices obtained
i.e., where each PĚpnq projects onto the top

ÂĽ rĚn ÂĽ rn


rĚn left singular vectors of the matricization W0,pnq . Below we will show that (12) holding for
W0 implies that the error }W1 } resulting from our 1st -scale approximation in (4.27) is less than
125

an upper bound of the type given for a high-rank standard HoSVD-based approximation (4.29)
in [241].
Here we will begin with a lemma that shows our subtensor-based approximation of W0 is
accurate whenever (12) is satisfied.
Lemma 3. Let W0

 X  XĚ0 P RI1I2...IN . Suppose that

!

pnq ) is a collection of effective

Rk

restriction matrices that form an effective partition of W0 with respect to a pessimistic partition
!

pnq ) as per (12) above.

formed via pessimistic restriction matrices RĚk
rank rĚn

Similarly, let PĚpnq be the

ÂĽ rĚn @n orthogonal projection matrices from (12) obtained via the truncated HoSVD of

W0 as above. Then,


W0

2



 XĚ1  

	
 XĚ0  Â°Kk1

X



Â¤ Â°Nn1 

	
X


X


 XĚ0

pnq 	2
N
Q
n 1 k

	Â

 XĚ0 n I  PĚpnq

	2

 .

Proof. We have that


W0


2

XĚ1 







2
N
KĚ§


ÂĄ
p
nq 

Q
Use (4.12) and (4.26)
W
W
 0
0
k 

n1
k 1

2
N
KĚ§
N
 KĚ§

ÂĄ
ÂĄ
p
nq
p
nq pnq 

Use (4.21), (4.22), and (4.25)
W0
Rk
W0
Qk R k 



n 1
n 1
k 1
k 1


N 
 KĚ§
	2
ÂĄ
p
nq
p
nq p nq 

W0
Rk
Qk Rk 
Use Lemma 1



n 1
k 1


KĚ§ 
N 
	2
ÂĄ
pnq 

I Qk  .
Use Lemma 1, (4.21), and (4.25)
X k




k 1

|

n 1


(13)

126

Applying lemmas 1 and 2 to (13) we can now see that


W0


2

XĚ1 


Â¤

KĚ§


n
1 phq
ÂĄ

Qk
X k


NĚ§ 

k  1 n 1
KĚ§ NĚ§ 
k  1 n 1

|

h1

| n


X k

n

2

pnq 	
IQ




k


	
p
nq 2
IQ

k

pnq matrices are orthogonal projections. Using assumption (12) we now get that

since the Qk



W0


2

XĚ1 

Â¤


KĚ§

k 1 n1
NĚ§ 
n 1



W0


NĚ§ 


W0


n RĚpknq I  PĚpnq


n I  PĚpnq

	2



N
	ÂĄ
hn

2
phq 
RĚk 


pnq partition W

where we have used the fact that the pessimistic restriction matrices RĚk

0

in the last

line.
Lemma 3 indicates that the error in approximating W0 via low-rank approximations of its
effective subtensors is potentially smaller than the error obtained by approximating W0 via (higher
rank) truncated HoSVDs whenever (12) holds.1 The following theorem shows that this good error
behavior extends to the entire 1st scale approximation provided by (4.27) whenever (12) holds.
Theorem 2 (Restatement of Theorem 1). Let X

P RI1I2...IN . Suppose that (12) holds. Then,

the first scale approximation error given by MS-HoSVD (4.27) is bounded by

}W1} 
2



X

 XĚ0 

2

XĚ1 

Â¤

NĚ§ 
n 1


X


n I  PĚpnq

	2



1 That is, the upper bound on the error provided by Lemma 3 is less than or equal to the upper bound on the error
for truncated HoSVDs provided by, e.g., [241] when/if (12) holds.

127

!

where PĚpnq

)

are low-rank projection matrices of rank rĚn

ÂĽ

rn obtained from the truncated

HoSVD of X as per (4.29).
Proof. Using (4.12) and (4.17) together with lemma 3 we can see that

}W1} 
2



X

 XĚ0 

2

XĚ1 





W0


2

XĚ1 

Â¤
Â¤

where QĚpnq

P

NĚ§ 


	2



	
	2



I  PĚpnq

 XĚ0 n


 X

 XĚ0 n I  QĚpnq

n 1
NĚ§ 
n 1

	

 X

(14)

RIn In is the orthogonal projection matrix of rank rĚn which projects onto the

subspace spanned by the top rĚn left singular vectors of Xpnq . Here (14) holds because the orthog-

onal projection matrices PĚpnq are chosen in (12) so that PĚpnq W0,pnq is a best possible rank rĚn
approximation to W0,pnq . As a result, we have that


 X

	
 XĚ0 n I  PĚpnq

	2





  I  PĚpnq
Â¤


	
2


W0,pnq 

F

	
2

 I QĚpnq W0,pnq 
F

	

	2

p
nq 
I
QĚ
 X XĚ0
n


(15)


must hold for each n P rN s.
Continuing from (14) we can use the definition of XĚ0 in (4.28) to see that

}W1}2 Â¤





N
	2

ÂĄ

pnq 
Pphq
 X X
n I QĚ


n1
h1

NĚ§ 
N

	

pnq X ÂĄ Pphq n I
X n I QĚ

NĚ§ 

n 1


h1



	2

QĚpnq 


(16)

by lemma 1. Due to the definition of QĚpnq together with the fact that its rank is rĚn


can see that I  QĚpnq Ppnq

 0.

As a consequence, lemma 1 implies that X
128

ÂN

ÂĽ

rn we

phq n

h1 P


I  QĚpnq

	
 0 for all n P rN s. Continuing from (16) we now have that
}W1}2 Â¤

NĚ§ 
n 1


X


n I  QĚpnq

	2

 .

(17)

Again appealing to the definition of both QĚpnq and PĚpnq in (4.29), combined with the fact that
rĚn

ÂĽ rĚn, finally yields the desired result.

129

BIBLIOGRAPHY

130

BIBLIOGRAPHY

[1] M. MĂ¸rup, L. K. Hansen, C. S. Herrmann, J. Parnas, and S. M. Arnfred, âParallel factor
analysis as an exploratory tool for wavelet transformed event-related eeg,â NeuroImage,
vol. 29, no. 3, pp. 938â947, 2006.
[2] A. Cichocki, âTensor decompositions: A new concept in brain data analysis?â arXiv
preprint arXiv:1305.0395, 2013.
[3] D. Lahat, T. Adali, and C. Jutten, âMultimodal data fusion: an overview of methods, challenges, and prospects,â Proceedings of the IEEE, vol. 103, no. 9, pp. 1449â1477, 2015.
[4] D. Letexier, S. Bourennane, and J. Blanc-Talon, âNonorthogonal tensor matricization for
hyperspectral image filtering,â Geoscience and Remote Sensing Letters, IEEE, vol. 5, no. 1,
pp. 3â7, 2008.
[5] T.-K. Kim and R. Cipolla, âCanonical correlation analysis of video volume tensors for action categorization and detection,â Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 8, pp. 1415â1428, 2009.
[6] G. Shakhnarovich and B. Moghaddam, âFace recognition in subspaces,â in Handbook of
Face Recognition. Springer, 2011, pp. 19â49.
[7] C. C. Aggarwal and C. K. Reddy, Data clustering: algorithms and applications.
Press, 2013.

CRC

[8] A. Cichocki, âEra of big data processing: a new approach via tensor networks and tensor
decompositions,â arXiv preprint arXiv:1403.2048, 2014.
[9] M. Steinbach, G. Karypis, V. Kumar et al., âA comparison of document clustering techniques,â in KDD workshop on text mining, vol. 400, no. 1. Boston, 2000, pp. 525â526.
[10] A. K. Jain, M. N. Murty, and P. J. Flynn, âData clustering: a review,â ACM computing
surveys (CSUR), vol. 31, no. 3, pp. 264â323, 1999.
[11] A. Ben-Dor, R. Shamir, and Z. Yakhini, âClustering gene expression patterns,â Journal of
computational biology, vol. 6, no. 3-4, pp. 281â297, 1999.

131

[12] A. C. Cameron, J. B. Gelbach, and D. L. Miller, âRobust inference with multiway clustering,â Journal of Business & Economic Statistics, 2012.
[13] R. Xu, D. Wunsch et al., âSurvey of clustering algorithms,â Neural Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645â678, 2005.
[14] D. Fasulo, âAn analysis of recent work on clustering algorithms,â Department of Computer
Science & Engineering, University of Washington, 1999.
[15] P. Smyth, âModel selection for probabilistic clustering using cross-validated likelihood,â
Statistics and computing, vol. 10, no. 1, pp. 63â72, 2000.
[16] U. Von Luxburg, âA tutorial on spectral clustering,â Statistics and computing, vol. 17, no. 4,
pp. 395â416, 2007.
[17] N. Nguyen and R. Caruana, âConsensus clusterings,â in Data Mining, 2007. ICDM 2007.
Seventh IEEE International Conference on. IEEE, 2007, pp. 607â612.
[18] A. Strehl and J. Ghosh, âCluster ensemblesâa knowledge reuse framework for combining
multiple partitions,â The Journal of Machine Learning Research, vol. 3, pp. 583â617, 2003.
[19] S. Bickel and T. Scheffer, âMulti-view clustering.â in ICDM, vol. 4, 2004, pp. 19â26.
[20] X. Liu, S. Ji, W. Glanzel, and B. De Moor, âMultiview partitioning via tensor methods,â
Knowledge and Data Engineering, IEEE Transactions on, vol. 25, no. 5, pp. 1056â1069,
2013.
[21] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie, âBeyond
pairwise clustering,â in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, vol. 2. IEEE, 2005, pp. 838â845.
[22] A. Shashua, R. Zass, and T. Hazan, âMulti-way clustering using super-symmetric nonnegative tensor factorization,â in Computer VisionâECCV 2006. Springer, 2006, pp. 595â
608.
[23] Z. He, A. Cichocki, S. Xie, and K. Choi, âDetecting the number of clusters in n-way probabilistic clustering,â Pattern Analysis and Machine Intelligence, IEEE Transactions On,
vol. 32, no. 11, pp. 2006â2021, 2010.
[24] E. Acar and B. Yener, âUnsupervised multiway data analysis: A literature survey,â Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 1, pp. 6â20, 2009.
132

[25] E. Acar, S. A. CĚ§amtepe, M. S. Krishnamoorthy, and B. Yener, âModeling and multiway
analysis of chatroom tensors,â in Intelligence and Security Informatics. Springer, 2005,
pp. 256â268.
[26] F. Estienne, N. Matthijs, D. Massart, P. Ricoux, and D. Leibovici, âMulti-way modelling of
high-dimensionality electroencephalographic data,â Chemometrics and Intelligent Laboratory Systems, vol. 58, no. 1, pp. 59â72, 2001.
[27] S. GourveĚnec, I. Stanimirova, C.-A. Saby, C. Airiau, and D. Massart, âMonitoring batch
processes with the statis approach,â Journal of chemometrics, vol. 19, no. 5-7, pp. 288â300,
2005.
[28] J. B. Kruskal, âThree-way arrays: rank and uniqueness of trilinear decompositions, with
application to arithmetic complexity and statistics,â Linear algebra and its applications,
vol. 18, no. 2, pp. 95â138, 1977.
[29] ââ, âRank, decomposition, and uniqueness for 3-way and n-way arrays,â Multiway data
analysis, vol. 33, pp. 7â18, 1989.
[30] T. G. Kolda and B. W. Bader, âTensor decompositions and applications,â SIAM review,
vol. 51, no. 3, pp. 455â500, 2009.
[31] L. Grasedyck, D. Kressner, and C. Tobler, âA literature survey of low-rank tensor approximation techniques,â GAMM-Mitteilungen, vol. 36, no. 1, pp. 53â78, 2013.
[32] M. Rajih, P. Comon, and R. A. Harshman, âEnhanced line search: A novel method to accelerate parafac,â SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3, pp.
1128â1147, 2008.
[33] D. Nion and L. De Lathauwer, âAn enhanced line search scheme for complex-valued tensor
decompositions. application in ds-cdma,â Signal Processing, vol. 88, no. 3, pp. 749â755,
2008.
[34] A.-H. Phan, P. Tichavsky, and A. Cichocki, âFast alternating ls algorithms for high order candecomp/parafac tensor factorizations,â Signal Processing, IEEE Transactions on,
vol. 61, no. 19, pp. 4834â4846, 2013.
[35] L. De Lathauwer, B. De Moor, and J. Vandewalle, âA multilinear singular value decomposition,â SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253â1278,
2000.

133

[36] ââ, âOn the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors,â
SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324â1342, 2000.
[37] L. De Lathauwer, âDecompositions of a higher-order tensor in block terms-part ii: definitions and uniqueness,â SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3,
pp. 1033â1066, 2008.
[38] P. Tichavsky, A. H. Phan, and A. Cichocki, âNon-orthogonal tensor diagonalization, a tool
for block tensor decompositions,â arXiv preprint arXiv:1402.1673, 2014.
[39] L. Grasedyck, âHierarchical singular value decomposition of tensors,â SIAM Journal on
Matrix Analysis and Applications, vol. 31, no. 4, pp. 2029â2054, 2010.
[40] I. V. Oseledets, âTensor-train decomposition,â SIAM Journal on Scientific Computing,
vol. 33, no. 5, pp. 2295â2317, 2011.
[41] J. Brachat, P. Comon, B. Mourrain, and E. Tsigaridas, âSymmetric tensor decomposition,â
in Signal Processing Conference, 2009 17th European. IEEE, 2009, pp. 525â529.
[42] G. Allen, âSparse higher-order principal components analysis,â in International Conference
on Artificial Intelligence and Statistics, 2012, pp. 27â36.
[43] J. Chen and Y. Saad, âOn the tensor svd and the optimal low rank orthogonal approximation
of tensors,â SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 4, pp. 1709â
1734, 2009.
[44] I. Jolliffe, Principal component analysis.

Wiley Online Library, 2002.

[45] P. Comon, âIndependent component analysis, a new concept?â Signal processing, vol. 36,
no. 3, pp. 287â314, 1994.
[46] D. D. Lee and H. S. Seung, âLearning the parts of objects by non-negative matrix factorization,â Nature, vol. 401, no. 6755, pp. 788â791, 1999.
[47] B. Scholkopft and K.-R. Mullert, âFisher discriminant analysis with kernels,â Neural networks for signal processing IX, vol. 1, no. 1, p. 1, 1999.
[48] B. Thompson, âCanonical correlation analysis,â Encyclopedia of statistics in behavioral
science, 2005.

134

[49] T. Zhang and G. Lerman, âA novel m-estimator for robust pca,â The Journal of Machine
Learning Research, vol. 15, no. 1, pp. 749â808, 2014.
[50] M. McCoy, J. A. Tropp et al., âTwo proposals for robust pca using semidefinite programming,â Electronic Journal of Statistics, vol. 5, pp. 1123â1160, 2011.
[51] H. Xu, C. Caramanis, and S. Sanghavi, âRobust pca via outlier pursuit,â in Advances in
Neural Information Processing Systems, 2010, pp. 2496â2504.
[52] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, âThe convex geometry of
linear inverse problems,â Foundations of Computational mathematics, vol. 12, no. 6, pp.
805â849, 2012.
[53] A. Ganesh, K. Min, J. Wright, and Y. Ma, âPrincipal component pursuit with reduced linear
measurements,â in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on. IEEE, 2012, pp. 1281â1285.
[54] G. Mateos and G. B. Giannakis, âRobust pca as bilinear decomposition with outlier-sparsity
regularization,â Signal Processing, IEEE Transactions on, vol. 60, no. 10, pp. 5176â5190,
2012.
[55] D. Hsu, S. M. Kakade, and T. Zhang, âRobust matrix decomposition with sparse corruptions,â Information Theory, IEEE Transactions on, vol. 57, no. 11, pp. 7221â7234, 2011.
[56] J. Wright, A. Ganesh, K. Min, and Y. Ma, âCompressive principal component pursuit,â
Information and Inference, vol. 2, no. 1, pp. 32â68, 2013.
[57] M. Tao and X. Yuan, âRecovering low-rank and sparse components of matrices from incomplete and noisy observations,â SIAM Journal on Optimization, vol. 21, no. 1, pp. 57â81,
2011.
[58] M. B. McCoy and J. A. Tropp, âSharp recovery bounds for convex deconvolution, with
applications,â Tech. Rep., 2012.
[59] M. Mardani, G. Mateos, and G. Giannakis, âDecentralized sparsity-regularized rank minimization: Algorithms and applications,â Signal Processing, IEEE Transactions on, vol. 61,
no. 21, pp. 5374â5388, 2013.
[60] M. Mardani and G. Giannakis, âRobust network traffic estimation via sparsity and low rank,â
in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference
on. IEEE, 2013, pp. 4529â4533.
135

[61] M. Mardani, G. Mateos, and G. Giannakis, âRecovery of low-rank plus compressed sparse
matrices with application to unveiling traffic anomalies,â Information Theory, IEEE Transactions on, vol. 59, no. 8, pp. 5186â5205, 2013.
[62] E. J. CandeĚs and B. Recht, âExact matrix completion via convex optimization,â Foundations
of Computational mathematics, vol. 9, no. 6, pp. 717â772, 2009.
[63] M. Brand, âIncremental singular value decomposition of uncertain data with missing values,â in Computer VisionECCV 2002. Springer, 2002, pp. 707â720.
[64] Y. Li, âOn incremental and robust subspace learning,â Pattern recognition, vol. 37, no. 7,
pp. 1509â1518, 2004.
[65] B. Yang, âProjection approximation subspace tracking,â Signal Processing, IEEE Transactions on, vol. 43, no. 1, pp. 95â107, 1995.
[66] Y. Chi, Y. C. Eldar, and R. Calderbank, âPetrels: Parallel subspace estimation and tracking
by recursive least squares from partial observations,â Signal Processing, IEEE Transactions
on, vol. 61, no. 23, pp. 5947â5959, 2013.
[67] H. Guo, C. Qiu, and N. Vaswani, âAn online algorithm for separating sparse and lowdimensional signal sequences from their sum,â Signal Processing, IEEE Transactions on,
vol. 62, no. 16, pp. 4284â4297, 2014.
[68] J. He, L. Balzano, and J. Lui, âOnline robust subspace tracking from partial information,â
arXiv preprint arXiv:1109.3827, 2011.
[69] J. Feng, H. Xu, and S. Yan, âOnline robust pca via stochastic optimization,â in Advances in
Neural Information Processing Systems, 2013, pp. 404â412.
[70] J. He, L. Balzano, and A. Szlam, âIncremental gradient on the grassmannian for online foreground and background separation in subsampled video,â in Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1568â1575.
[71] J. Sun, D. Tao, and C. Faloutsos, âBeyond streams and graphs: dynamic tensor analysis,â in
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining. ACM, 2006, pp. 374â383.
[72] J. Sun, S. Papadimitriou, and S. Y. Philip, âWindow-based tensor analysis on highdimensional and multi-aspect streams.â in ICDM, vol. 6, 2006, pp. 1076â1080.

136

[73] D. Goldfarb and Z. Qin, âRobust low-rank tensor recovery: Models and algorithms,â SIAM
Journal on Matrix Analysis and Applications, vol. 35, no. 1, pp. 225â253, 2014.
[74] J. Li, G. Han, J. Wen, and X. Gao, âRobust tensor subspace learning for anomaly detection,â
International Journal of Machine Learning and Cybernetics, vol. 2, no. 2, pp. 89â98, 2011.
[75] D. Nion and N. D. Sidiropoulos, âAdaptive algorithms to track the parafac decomposition
of a third-order tensor,â Signal Processing, IEEE Transactions on, vol. 57, no. 6, pp. 2299â
2310, 2009.
[76] M. Mardani, G. Mateos, and G. B. Giannakis, âSubspace learning and imputation for
streaming big data matrices and tensors,â Signal Processing, IEEE Transactions on, vol. 63,
no. 10, pp. 2663â2677, 2015.
[77] H. Kasai, âOnline low-rank tensor subspace tracking from incomplete data by cp decomposition using recursive least squares,â arXiv preprint arXiv:1602.07067, 2016.
[78] M. MĂ¸rup, L. K. Hansen, S. M. Arnfred, L.-H. Lim, and K. H. Madsen, âShift-invariant
multilinear decomposition of neuroimaging data,â NeuroImage, vol. 42, no. 4, pp. 1439â
1450, 2008.
[79] H. Lee, Y.-D. Kim, A. Cichocki, and S. Choi, âNonnegative tensor factorization for continuous eeg classification,â International journal of neural systems, vol. 17, no. 04, pp. 305â317,
2007.
[80] J. Li, L. Zhang, D. Tao, H. Sun, and Q. Zhao, âA prior neurophysiologic knowledge free
tensor-based scheme for single trial eeg classification,â Neural Systems and Rehabilitation
Engineering, IEEE Transactions on, vol. 17, no. 2, pp. 107â115, 2009.
[81] J. Li and L. Zhang, âRegularized tensor discriminant analysis for single trial eeg classification in bci,â Pattern Recognition Letters, vol. 31, no. 7, pp. 619â628, 2010.
[82] Q. Zhao, C. F. Caiafa, A. Cichocki, L. Zhang, and A. H. Phan, âSlice oriented tensor decomposition of eeg data for feature extraction in space, frequency and time domains,â in
Neural Information Processing. Springer, 2009, pp. 221â228.
[83] A. H. Phan and A. Cichocki, âTensor decompositions for feature extraction and classification of high dimensional datasets,â Nonlinear theory and its applications, IEICE, vol. 1,
no. 1, pp. 37â68, 2010.

137

[84] ââ, âExtended hals algorithm for nonnegative tucker decomposition and its applications
for multiway analysis and classification,â Neurocomputing, vol. 74, no. 11, pp. 1956â1969,
2011.
[85] D. Tao, X. Li, X. Wu, and S. J. Maybank, âGeneral tensor discriminant analysis and gabor
features for gait recognition,â Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 10, pp. 1700â1715, 2007.
[86] H. Becker, L. Albera, P. Comon, M. Haardt, G. Birot, F. Wendling, M. Gavaret, C.-G. BeĚnar,
and I. Merlet, âEeg extended source localization: tensor-based vs. conventional methods,â
NeuroImage, vol. 96, pp. 143â157, 2014.
[87] E. Acar, C. Aykut-Bingol, H. Bingol, R. Bro, and B. Yener, âMultiway analysis of epilepsy
tensors,â Bioinformatics, vol. 23, no. 13, pp. i10âi18, 2007.
[88] M. De Vos, A. Vergult, L. De Lathauwer, W. De Clercq, S. Van Huffel, P. Dupont,
A. Palmini, and W. Van Paesschen, âCanonical decomposition of ictal scalp eeg reliably
detects the seizure onset zone,â NeuroImage, vol. 37, no. 3, pp. 844â854, 2007.
[89] M. Weis, F. RoĚmer, M. Haardt, D. Jannek, and P. Husar, âMulti-dimensional space-timefrequency component analysis of event related eeg data using closed-form parafac,â in
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 349â352.
[90] M. Barnathan, V. Megalooikonomou, C. Faloutsos, S. Faro, and F. B. Mohamed, âTwave:
high-order analysis of functional mri,â Neuroimage, vol. 58, no. 2, pp. 537â548, 2011.
[91] C. F. Beckmann and S. M. Smith, âTensorial extensions of independent component analysis
for multisubject fmri analysis,â Neuroimage, vol. 25, no. 1, pp. 294â311, 2005.
[92] S. Ferdowsi, V. Abolghasemi, and S. Sanei, âEeg-fmri integration using a partially constrained tensor factorization,â in Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on. IEEE, 2013, pp. 6191â6195.
[93] E. Karahan, P. A. Rojas-Lopez, M. L. Bringas-Vega, P. A. Valdes-Hernandez, and P. A.
Valdes-Sosa, âTensor analysis and fusion of multimodal brain images,â Proceedings of the
IEEE, vol. 103, no. 9, pp. 1531â1559, 2015.
[94] J. Dauwels, K. Srinivasan, M. R. Reddy, and A. Cichocki, âNear-lossless multichannel eeg
compression based on matrix and tensor decompositions,â Biomedical and Health Informatics, IEEE Journal of, vol. 17, no. 3, pp. 708â714, 2013.
138

[95] J. Dauwels, K. Srinivasan, M. Ramasubba Reddy, and A. Cichocki, âMulti-channel eeg
compression based on matrix and tensor decompositions,â in Acoustics, Speech and Signal
Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 629â632.
[96] X. Li, H. Zhou, and L. Li, âTucker tensor regression and neuroimaging analysis,â arXiv
preprint arXiv:1304.5637, 2013.
[97] L. Lee, L. M. Harrison, and A. Mechelli, âA report of the functional connectivity workshop,
dusseldorf 2002,â Neuroimage, vol. 19, no. 2, pp. 457â465, 2003.
[98] C. Stam, W. De Haan, A. Daffertshofer, B. Jones, I. Manshanden, A. V. C. Van Walsum,
T. Montez, J. Verbunt, J. De Munck, B. Van Dijk et al., âGraph theoretical analysis of
magnetoencephalographic functional connectivity in alzheimerâs disease,â Brain, vol. 132,
no. 1, pp. 213â224, 2009.
[99] V. Krause, A. Schnitzler, and B. Pollok, âFunctional network interactions during sensorimotor synchronization in musicians and non-musicians,â Neuroimage, vol. 52, no. 1, pp.
245â251, 2010.
[100] Q. Luo, D. Mitchell, X. Cheng, K. Mondillo, D. Mccaffrey, T. Holroyd, F. Carver, R. Coppola, and J. Blair, âVisual awareness, emotion, and gamma band synchronization,â Cerebral
Cortex, vol. 19, no. 8, pp. 1896â1904, 2009.
[101] M. A. Kramer, F.-L. Chang, M. E. Cohen, D. Hudson, and A. J. Szeri, âSynchronization
measures of the scalp electroencephalogram can discriminate healthy from alzheimerâs subjects,â International journal of neural systems, vol. 17, no. 02, pp. 61â69, 2007.
[102] M. Arthuis, L. Valton, J. ReĚgis, P. Chauvel, F. Wendling, L. Naccache, C. Bernard, and
F. Bartolomei, âImpaired consciousness during temporal lobe seizures is related to increased
long-distance corticalâsubcortical synchronization,â Brain, vol. 132, no. 8, pp. 2091â2101,
2009.
[103] K. Maharajh, P. Teale, D. C. Rojas, and M. L. Reite, âFluctuation of gamma-band phase
synchronization within the auditory cortex in schizophrenia,â Clinical Neurophysiology, vol.
121, no. 4, pp. 542â548, 2010.
[104] J. Altenburg, R. J. Vermeulen, R. L. Strijers, W. P. Fetter, and C. J. Stam, âSeizure detection
in the neonatal eeg with synchronization likelihood,â Clinical neurophysiology, vol. 114,
no. 1, pp. 50â55, 2003.
[105] C. J. Stam and J. C. Reijneveld, âGraph theoretical analysis of complex networks in the
brain,â Nonlinear Biomedical Physics, vol. 1, no. 1, pp. 1â19, 2007.
139

[106] E. T. Bullmore and D. S. Bassett, âBrain graphs: graphical models of the human brain
connectome,â Annual review of clinical psychology, vol. 7, pp. 113â140, 2011.
[107] R. Prabhakaran, S. E. Blumstein, E. B. Myers, E. Hutchison, and B. Britton, âAn eventrelated fmri investigation of phonologicalâlexical competition,â Neuropsychologia, vol. 44,
no. 12, pp. 2209â2221, 2006.
[108] R. Goebel, F. Esposito, and E. Formisano, âAnalysis of functional image analysis contest
(fiac) data with brainvoyager qx: From single-subject to cortically aligned group general
linear model analysis and self-organizing group independent component analysis,â Human
brain mapping, vol. 27, no. 5, pp. 392â401, 2006.
[109] P. Boveroux, A. Vanhaudenhuyse, M.-A. Bruno, Q. Noirhomme, S. Lauwick, A. Luxen,
C. Degueldre, A. Plenevaux, C. Schnakers, C. Phillips et al., âBreakdown of within-and
between-network resting state functional magnetic resonance imaging connectivity during
propofol-induced loss of consciousness.â Anesthesiology, vol. 113, no. 5, 2010.
[110] J. Schrouff, V. Perlbarg, M. Boly, G. Marrelec, P. Boveroux, A. Vanhaudenhuyse, M.-A.
Bruno, S. Laureys, C. Phillips, M. PeĚleĚgrini-Issac et al., âBrain functional integration decreases during propofol-induced loss of consciousness,â Neuroimage, vol. 57, no. 1, pp.
198â205, 2011.
[111] C. Chang and G. H. Glover, âTimeâfrequency dynamics of resting-state brain connectivity
measured with fmri,â Neuroimage, vol. 50, no. 1, pp. 81â98, 2010.
[112] N. Leonardi and D. Van De Ville, âIdentifying network correlates of brain states using tensor
decompositions of whole-brain dynamic functional connectivity,â in Pattern Recognition in
Neuroimaging (PRNI), 2013 International Workshop on. IEEE, 2013, pp. 74â77.
[113] A. G. Mahyari and S. Aviyente, âIdentification of dynamic functional brain network states
through tensor decomposition,â in Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on. IEEE, 2014, pp. 2099â2103.
[114] F. Varela, J.-P. Lachaux, E. Rodriguez, and J. Martinerie, âThe brainweb: phase synchronization and large-scale integration,â Nature reviews neuroscience, vol. 2, no. 4, pp. 229â
239, 2001.
[115] M. G. Knyazeva, M. Jalili, A. Brioschi, I. Bourquin, E. Fornari, M. Hasler, R. Meuli,
P. Maeder, and J. Ghika, âTopography of eeg multivariate phase synchronization in early
alzheimerâs disease,â Neurobiology of aging, vol. 31, no. 7, pp. 1132â1144, 2010.

140

[116] J. ZĚygierewicz, C. SieluzĚycki, R. KoĚnig, and P. Durka, âEvent-related desynchronization
and synchronization in meg: framework for analysis and illustrative datasets related to
discrimination of frequency-modulated tones,â Journal of neuroscience methods, vol. 168,
no. 1, pp. 239â247, 2008.
[117] J. R. Petrella, âUse of graph theory to evaluate brain networks: a clinical tool for a small
world?â Radiology, vol. 259, no. 2, pp. 317â320, 2011.
[118] B. C. Van Wijk, C. J. Stam, and A. Daffertshofer, âComparing brain networks of different
size and connectivity density using graph theory,â PloS one, vol. 5, no. 10, p. e13701, 2010.
[119] X.-N. Zuo, R. Ehmke, M. Mennes, D. Imperati, F. X. Castellanos, O. Sporns, and M. P. Milham, âNetwork centrality in the human functional connectome,â Cerebral Cortex, vol. 22,
no. 8, pp. 1862â1875, 2012.
[120] M. Boersma, D. J. Smit, H. de Bie, G. C. M. Van Baal, D. I. Boomsma, E. J. de Geus,
H. A. Delemarre-van de Waal, and C. J. Stam, âNetwork analysis of resting state eeg in the
developing young brain: structure comes with maturation,â Human brain mapping, vol. 32,
no. 3, pp. 413â425, 2011.
[121] M. E. Lynall, D. S. Bassett, R. Kerwin, P. J. McKenna, M. Kitzbichler, U. Muller, and
E. Bullmore, âFunctional connectivity and brain networks in schizophrenia,â The Journal of
Neuroscience, vol. 30, no. 28, pp. 9477â9487, 2010.
[122] B. C. Bernhardt, Z. Chen, Y. He, A. C. Evans, and N. Bernasconi, âGraph-theoretical analysis reveals disrupted small-world organization of cortical thickness correlation networks in
temporal lobe epilepsy,â Cerebral cortex, vol. 21, no. 9, pp. 2147â2157, 2011.
[123] Y. Liu, M. Liang, Y. Zhou, Y. He, Y. Hao, M. Song, C. Yu, H. Liu, Z. Liu, and T. Jiang,
âDisrupted small-world networks in schizophrenia,â Brain, vol. 131, no. 4, pp. 945â961,
2008.
[124] P. Barttfeld, B. Wicker, S. Cukier, S. Navarta, S. Lew, and M. Sigman, âA big-world network
in asd: dynamical connectivity analysis reflects a deficit in long-range connections and an
excess of short-range connections,â Neuropsychologia, vol. 49, no. 2, pp. 254â263, 2011.
[125] F. D. V. Fallani, L. Astolfi, F. Cincotti, D. Mattia, M. G. Marciani, S. Salinari, J. Kurths,
S. Gao, A. Cichocki, A. Colosimo et al., âCortical functional connectivity networks in normal and spinal cord injured patients: evaluation by graph analysis,â Human brain mapping,
vol. 28, no. 12, pp. 1334â1346, 2007.

141

[126] K. Supekar, V. Menon, D. Rubin, M. Musen, and M. D. Greicius, âNetwork analysis of
intrinsic functional brain connectivity in alzheimerâs disease,â PLoS computational biology,
vol. 4, no. 6, p. e1000100, 2008.
[127] M. Chavez, M. Valencia, V. Navarro, V. Latora, and J. Martinerie, âFunctional modularity
of background activities in normal and epileptic brain networks,â Phys. Rev. Lett., vol. 104,
no. 11, p. 118701, Mar 2010.
[128] M. Valencia, M. Pastor, M. FernaĚndez-Seara, J. Artieda, J. Martinerie, and M. Chavez,
âComplex modular structure of large-scale brain networks,â Chaos: An Interdisciplinary
Journal of Nonlinear Science, vol. 19, no. 2, p. 023119, 2009.
[129] R. Guimera and L. A. N. Amaral, âFunctional cartography of complex metabolic networks,â
Nature, vol. 433, pp. 895â900, 2005.
[130] M. Kirschner and J. Gerhart, âEvolvability,â Proceedings of the National Academy of Sciences, vol. 95, no. 15, pp. 8420â8427, 1998.
[131] R. S. J. Frackowiak, K. J. Friston, C. Frith, R. Dolan, and C. J. Price, Human Brain Function,
2nd ed. Academic Press, 2003.
[132] O. Sporns, G. Tononi, and R. KoĚtter, âThe human connectome: A structural description of
the human brain,â PLoS Comput Biol, vol. 1, no. 4, p. e42, 09 2005.
[133] S. L. Bressler, âLarge-scale cortical networks and cognition,â Brain Research Reviews,
vol. 20, no. 3, pp. 288 â 304, 1995.
[134] R. Baumgartner, L. Ryner, W. Richter, R. Summers, M. Jarmasz, and R. Somorjai, âComparison of two exploratory data analysis methods for fmri: fuzzy clustering vs. principal
component analysis,â Magnetic Resonance Imaging, vol. 18, no. 1, pp. 89â94, 2000.
[135] A. Meyer-Baese, A. Wismueller, and O. Lange, âComparison of two exploratory data analysis methods for fmri: unsupervised clustering versus independent component analysis,â
Information Technology in Biomedicine, IEEE Transactions on, vol. 8, no. 3, pp. 387â398,
2004.
[136] U. Von Luxburg, âA tutorial on spectral clustering,â Statistics and computing, vol. 17, no. 4,
pp. 395â416, 2007.
[137] C. Allefeld, âEigenvalue decomposition as a generalized synchronization cluster analysis,â
International Journal of Bifurcation and Chaos, vol. 17, no. 10, pp. 3493â3497, 2008.
142

[138] D. Meunier, R. Lambiotte, A. Fornito, K. D. Ersche, and E. T. Bullmore, âHierarchical modularity in human brain functional networks,â Frontiers in Neuroinformatics, vol. 3, no. 37,
2009.
[139] C. Zhou, L. ZemanovaĚ, G. Zamora, C. C. Hilgetag, and J. Kurths, âHierarchical organization
unveiled by functional connectivity in complex brain networks,â Physical review letters,
vol. 97, no. 23, p. 238103, 2006.
[140] L. ZemanovaĚ, C. Zhou, and J. Kurths, âStructural and functional clusters of complex brain
networks,â Physica D: Nonlinear Phenomena, vol. 224, no. 1, pp. 202â212, 2006.
[141] D. Meunier, R. Lambiotte, and E. T. Bullmore, âModular and hierarchically modular organization of brain networks,â Frontiers in neuroscience, vol. 4, 2010.
[142] A. Gionis, H. Mannila, and P. Tsaparas, âClustering aggregation,â ACM Transactions on
Knowledge Discovery from Data (TKDD), vol. 1, no. 1, p. 4, 2007.
[143] M. Fiedler, âA property of eigenvectors of nonnegative symmetric matrices and its application to graph theory,â Czechoslovak Mathematical Journal, vol. 25, no. 4, pp. 619â633,
1975.
[144] E. Rodriguez, N. George, J.-P. Lachaux, J. Martinerie, B. Renault, and F. J. Varela, âPerceptionâs shadow: long-distance synchronization of human brain activity,â Nature, vol. 397,
no. 6718, pp. 430â433, 1999.
[145] M. Le Van Quyen, J. Foucher, J.-P. Lachaux, E. Rodriguez, A. Lutz, J. Martinerie, and F. J.
Varela, âComparison of hilbert transform and wavelet methods for the analysis of neuronal
synchrony,â Journal of neuroscience methods, vol. 111, no. 2, pp. 83â98, 2001.
[146] P. Tass, M. Rosenblum, J. Weule, J. Kurths, A. Pikovsky, J. Volkmann, A. Schnitzler, and
H.-J. Freund, âDetection of n: m phase locking from noisy data: application to magnetoencephalography,â Physical Review Letters, vol. 81, no. 15, p. 3291, 1998.
[147] J.-P. Lachaux, A. Lutz, D. Rudrauf, D. Cosmelli, M. Le Van Quyen, J. Martinerie, and
F. Varela, âEstimating the time-course of coherence between single-trial brain signals: an
introduction to wavelet coherence,â Neurophysiologie Clinique/Clinical Neurophysiology,
vol. 32, no. 3, pp. 157â174, 2002.
[148] S. Aviyente, E. M. Bernat, W. S. Evans, and S. R. Sponheim, âA phase synchrony measure
for quantifying dynamic functional integration in the brain,â Human brain mapping, vol. 32,
no. 1, pp. 80â93, 2011.
143

[149] S. Aviyente and A. Y. Mutlu, âA time-frequency-based approach to phase and phase synchrony estimation,â Signal Processing, IEEE Transactions on, vol. 59, no. 7, pp. 3086â3098,
2011.
[150] A. Rihaczek, âSignal energy distribution in time and frequency,â in IEEE Transactions:
Information Theory, vol. 14, 1968, pp. 369â274.
[151] E. Bullmore and O. Sporns, âComplex brain networks: graph theoretical analysis of structural and functional systems,â Nature Reviews Neuroscience, vol. 10, no. 3, pp. 186â198,
2009.
[152] J. Forman, P. Clemons, S. Schreiber, and S. Haggarty, âSpectralnet - an application for
spectral graph analysis and visualization,â BMC Bioinformatics, vol. 6, no. 1, p. 260, 2005.
[153] S. White and P. Smyth, âA spectral clustering approach to finding communities in graphs,â
SIAM International Conference on Data Mining, pp. 76â84, 2005.
[154] A. K. Jain, âData clustering: 50 years beyond k-means,â Pattern Recognition Letters,
vol. 31, no. 8, pp. 651â666, 2010.
[155] A. Y. Ng, M. I. Jordan, and Y. Weiss, âOn spectral clustering: Analysis and an algorithm,â
in Advances in Neural Information Processing Systems. MIT Press, 2001, pp. 849â856.
[156] F. Lin and W. W. Cohen, âPower iteration clustering,â in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 655â662.
[157] C. Ding and X. He, âLinearized cluster assignment via spectral ordering,â in Proceedings of
the twenty-first international conference on Machine learning, ser. ICML â04. New York,
NY, USA: ACM, 2004, p. 30.
[158] M. Holzrichter and S. Oliveira, âA graph based method for generating the fiedler vector of
irregular problems,â in In Lecture Notes in Computer Science, 1999, pp. 978â985.
[159] N. Nguyen and R. Caruana, âConsensus clusterings,â in Data Mining, 2007. ICDM 2007.
Seventh IEEE International Conference on. IEEE, 2007, pp. 607â612.
[160] A. P. Topchy, M. H. Law, A. K. Jain, and A. L. Fred, âAnalysis of consensus partition in
cluster ensemble,â in Data Mining, 2004. ICDMâ04. Fourth IEEE International Conference
on. IEEE, 2004, pp. 225â232.

144

[161] A. P. Topchy, A. K. Jain, and W. F. Punch, âA mixture model of clustering ensembles,â in
SDM. SIAM, 2004, pp. 379â390.
[162] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, âMultilevel hypergraph partitioning:
Application in vlsi domain,â in IEEE Trans. Very Larg Scale Integration (VLSI) Systems,
1999, pp. 69â529.
[163] M. E. Newman, âModularity and community structure in networks,â Proceedings of the
National Academy of Sciences, vol. 103, no. 23, pp. 8577â8582, 2006.
[164] K. Steinhaeuser and N. V. Chawla, âIdentifying and evaluating community structure in complex networks,â Pattern Recognition Letters, vol. 31, no. 5, pp. 413 â 421, 2010.
[165] M. Ovelgonne and A. Geyer-Schulz, âCluster cores and modularity maximization,â Data
Mining Workshops, International Conference on, vol. 0, pp. 1204â1213, 2010.
[166] J. Cohen, âA coefficient of agreement for nominal scales,â Educational and Psychological
Measurement, vol. 20, no. 1, pp. 37â46, 1960.
[167] J. B. Garner, âThe standard error of cohenâs kappa,â Statistics in medicine, vol. 10, no. 5,
pp. 767â775, 1991.
[168] C. J. Rijsbergen, âInformation retrieval,â Journal of the American Society for Information
Science, vol. 30, no. 6, pp. 374â375, 1979.
[169] A. Rosenberg and J. Hirschberg, âV-measure: A conditional entropy-based external cluster
evaluation measure,â Proceedings of the 2007 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning, pp. 410â
420, 2007.
[170] J. Lin, âDivergence measures based on the shannon entropy,â Information Theory, IEEE
Transactions on, vol. 37, no. 1, pp. 145â151, 1991.
[171] M. C. V. Nascimento and A. C. P. L. F. de Carvalho, âSpectral methods for graph clustering
- a survey,â European Journal of Operational Research, vol. 211, no. 2, pp. 221 â 231, 2011.
[172] S. Yu, X. Liu, L. C. Tranchevent, W. GlaĚnzel, J. A. K. Suykens, B. De Moor, and Y. Moreau,
âOptimized data fusion for k-means laplacian clustering,â Bioinformatics, vol. 27, pp. 118â
126, January 2011.

145

[173] M. Falkenstein, J. Hohnsbein, J. Hoormann, and L. Blanke, âEffects of crossmodal divided
attention on late erp components. ii. error processing in choice reaction tasks,â Electroencephalography and clinical neurophysiology, vol. 78, no. 6, pp. 447â455, 1991.
[174] W. J. Gehring, B. Goss, M. G. Coles, D. E. Meyer, and E. Donchin, âA neural system for
error detection and compensation,â Psychological science, vol. 4, no. 6, pp. 385â390, 1993.
[175] J. R. Hall, E. M. Bernat, and C. J. Patrick, âExternalizing psychopathology and the errorrelated negativity,â Psychological Science, vol. 18, no. 4, pp. 326â333, 2007.
[176] J. Kayser and C. E. Tenke, âPrincipal components analysis of laplacian waveforms as a
generic method for identifying erp generator patterns: Ii. adequacy of low-density estimates,â Clinical neurophysiology, vol. 117, no. 2, pp. 369â380, 2006.
[177] C. E. Tenke and J. Kayser, âGenerator localization by current source density (csd): Implications of volume conduction and field closure at intracranial and scalp resolutions,â Clinical
neurophysiology, vol. 123, no. 12, pp. 2328â2345, 2012.
[178] M. BolanĚos, E. M. Bernat, B. He, and S. Aviyente, âA weighted small world network measure for assessing functional connectivity,â Journal of neuroscience methods, vol. 212, no. 1,
pp. 133â142, 2013.
[179] J. F. Cavanagh, M. X. Cohen, and J. J. B. Allen, âPrelude to and resolution of an error: Eeg
phase synchrony reveals cognitive control dynamics during action monitoring,â The Journal
of Neuroscience, vol. 29, no. 1, pp. 98â105, 2009.
[180] S. Fortunato and M. Barthelemy, âResolution limit in community detection,â Proceedings
of the National Academy of Sciences, vol. 104, no. 1, pp. 36â41, 2007.
[181] C. M. Michel, M. M. Murray, G. Lantz, S. Gonzalez, L. Spinelli, and R. Grave de Peralta,
âEeg source imaging,â Clinical neurophysiology, vol. 115, no. 10, pp. 2195â2222, 2004.
[182] C. S. Herrmann, M. H. Munk, and A. K. Engel, âCognitive functions of gamma-band activity: memory match and utilization,â Trends in cognitive sciences, vol. 8, no. 8, pp. 347â355,
2004.
[183] M. Steriade, P. Gloor, R. Llinas, F. L. Da Silva, and M.-M. Mesulam, âBasic mechanisms of cerebral rhythmic activities,â Electroencephalography and clinical neurophysiology, vol. 76, no. 6, pp. 481â508, 1990.
[184] S. L. Bressler, âLarge-scale cortical networks and cognition,â Brain Research Reviews,
vol. 20, no. 3, pp. 288â304, 1995.
146

[185] R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Corbetta,
S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo et al., âDynamic functional
connectivity: promise, issues, and interpretations,â Neuroimage, vol. 80, pp. 360â378, 2013.
[186] F. Freyer, K. Aquino, P. A. Robinson, P. Ritter, and M. Breakspear, âBistability and nongaussian fluctuations in spontaneous cortical activity,â The Journal of Neuroscience, vol. 29,
no. 26, pp. 8512â8524, 2009.
[187] J. Tang, S. Scellato, M. Musolesi, C. Mascolo, and V. Latora, âSmall-world behavior in
time-varying graphs,â Physical Review E, vol. 81, no. 5, p. 055101, 2010.
[188] P. Grindrod, M. C. Parsons, D. J. Higham, and E. Estrada, âCommunicability across evolving networks,â Physical Review E, vol. 83, no. 4, p. 046120, 2011.
[189] M. Valencia, J. Martinerie, S. Dupont, and M. Chavez, âDynamic small-world behavior in
functional brain networks unveiled by an event-related networks approach,â Physical Review
E, vol. 77, no. 5, p. 050905, 2008.
[190] F. D. V. Fallani, V. Latora, L. Astolfi, F. Cincotti, D. Mattia, M. G. Marciani, S. Salinari,
A. Colosimo, and F. Babiloni, âPersistent patterns of interconnection in time-varying cortical networks estimated from high-resolution eeg recordings in humans during a simple
motor act,â Journal of Physics A: Mathematical and Theoretical, vol. 41, no. 22, p. 224014,
2008.
[191] E. A. Allen, E. Damaraju, S. M. Plis, E. B. Erhardt, T. Eichele, and V. D. Calhoun, âTracking
whole-brain connectivity dynamics in the resting state,â Cerebral cortex, p. bhs352, 2012.
[192] S. Dimitriadis, N. Laskaris, and A. Tzelepi, âOn the quantization of time-varying phase
synchrony patterns into distinct functional connectivity microstates (fcÂľstates) in a multitrial visual erp paradigm,â Brain topography, vol. 26, no. 3, pp. 397â409, 2013.
[193] T. Koenig, L. Prichep, D. Lehmann, P. V. Sosa, E. Braeker, H. Kleinlogel, R. Isenhart,
and E. R. John, âMillisecond by millisecond, year by year: normative eeg microstates and
developmental stages,â Neuroimage, vol. 16, no. 1, pp. 41â48, 2002.
[194] S. Ma, V. D. Calhoun, R. Phlypo, and T. AdalÄą, âDynamic changes of spatial functional
network connectivity in healthy individuals and schizophrenia patients using independent
vector analysis,â NeuroImage, vol. 90, pp. 196â206, 2014.
[195] R. F. Betzel, M. A. Erickson, M. Abell, B. F. OâDonnell, W. P. Hetrick, and O. Sporns,
âSynchronization dynamics and evidence for a repertoire of network states in resting eeg,â
Frontiers in computational neuroscience, vol. 6, 2012.
147

[196] N. Leonardi, J. Richiardi, M. Gschwind, S. Simioni, J.-M. Annoni, M. Schluep, P. Vuilleumier, and D. Van De Ville, âPrincipal components of functional connectivity: a new approach to study dynamic brain connectivity during rest,â NeuroImage, vol. 83, pp. 937â950,
2013.
[197] S. Mehrkanoon, M. Breakspear, and T. W. Boonstra, âLow-dimensional dynamics of
resting-state cortical activity,â Brain topography, vol. 27, no. 3, pp. 338â352, 2014.
[198] A. Y. Mutlu, E. Bernat, and S. Aviyente, âA signal-processing-based approach to timevarying graph analysis for dynamic brain network identification,â Computational and mathematical methods in medicine, vol. 2012, 2012.
[199] N. Leonardi, W. R. Shirer, M. D. Greicius, and D. Van De Ville, âDisentangling dynamic
networks: Separated and joint expressions of functional connectivity patterns in time,â Human brain mapping, vol. 35, no. 12, pp. 5984â5995, 2014.
[200] I. Cribben, R. Haraldsdottir, L. Y. Atlas, T. D. Wager, and M. A. Lindquist, âDynamic connectivity regression: determining state-related changes in brain connectivity,â Neuroimage,
vol. 61, no. 4, pp. 907â920, 2012.
[201] J. Zhang, X. Li, C. Li, Z. Lian, X. Huang, G. Zhong, D. Zhu, K. Li, C. Jin, X. Hu et al., âInferring functional interaction and transition patterns via dynamic bayesian variable partition
models,â Human brain mapping, vol. 35, no. 7, pp. 3314â3331, 2014.
[202] J. Ou, Z. Lian, L. Xie, X. Li, P. Wang, Y. Hao, D. Zhu, R. Jiang, Y. Wang, Y. Chen et al.,
âAtomic dynamic functional interaction patterns for characterization of adhd,â Human brain
mapping, vol. 35, no. 10, pp. 5262â5278, 2014.
[203] M. H. Law and A. K. Jain, âIncremental nonlinear dimensionality reduction by manifold
learning,â Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 3,
pp. 377â391, 2006.
[204] C. Qiu and N. Vaswani, âAutomated recursive projected cs (reprocs) for real-time video
layering,â in International Conference on Computer Vision and Pattern Recognition, CVPR
2012, vol. 130. Citeseer, 2012.
[205] A. Ozdemir, M. Bolanos, E. Bernat, and S. Aviyente, âHierarchical spectral consensus
clustering for group analysis of functional brain networks,â Biomedical Engineering, IEEE
Transactions on, vol. 62, no. 9, pp. 2158â2169, 2015.
[206] F. De La Torre and M. J. Black, âA framework for robust subspace learning,â International
Journal of Computer Vision, vol. 54, no. 1-3, pp. 117â142, 2003.
148

[207] S. Roweis, âEm algorithms for pca and spca,â Advances in neural information processing
systems, pp. 626â632, 1998.
[208] E. J. CandeĚs, X. Li, Y. Ma, and J. Wright, âRobust principal component analysis?â Journal
of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
[209] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, âRank-sparsity incoherence for matrix decomposition,â SIAM Journal on Optimization, vol. 21, no. 2, pp. 572â596,
2011.
[210] A. Ganesh, Z. Lin, J. Wright, L. Wu, M. Chen, and Y. Ma, âFast algorithms for recovering a
corrupted low-rank matrix,â in Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2009 3rd IEEE International Workshop on. IEEE, 2009, pp. 213â216.
[211] D. Skocaj and A. Leonardis, âWeighted and robust incremental method for subspace learning,â in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on.
IEEE, 2003, pp. 1494â1501.
[212] Y. Li, L.-Q. Xu, J. Morphett, and R. Jacobs, âAn integrated algorithm of incremental and
robust pca,â in Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1. IEEE, 2003, pp. Iâ245.
[213] C. Qiu, N. Vaswani, and L. Hogben, âRecursive robust pca or recursive sparse recovery
in large but structured noise,â in Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on. IEEE, 2013, pp. 5954â5958.
[214] C. Qiu and N. Vaswani, âRecursive sparse recovery in large but structured noisepart 2,â in
Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on. IEEE,
2013, pp. 864â868.
[215] L. R. Tucker, âThe extension of factor analysis to three-dimensional matrices,â Contributions to mathematical psychology, pp. 109â127, 1964.
[216] D. Meunier, R. Lambiotte, A. Fornito, K. D. Ersche, and E. T. Bullmore, âHierarchical
modularity in human brain functional networks,â Frontiers in neuroinformatics, vol. 3, 2009.
[217] D. Meunier, S. Achard, A. Morcom, and E. Bullmore, âAge-related changes in modular
organization of human brain functional networks,â Neuroimage, vol. 44, no. 3, pp. 715â723,
2009.
[218] S. Friedland, Q. Li, and D. Schonfeld, âCompressive sensing of sparse tensors,â Image
Processing, IEEE Transactions on, vol. 23, no. 10, pp. 4438â4447, 2014.
149

[219] M. Rubinov and O. Sporns, âComplex network measures of brain connectivity: uses and
interpretations,â Neuroimage, vol. 52, no. 3, pp. 1059â1069, 2010.
[220] J. R. Hall, E. M. Bernat, and C. J. Patrick, âExternalizing psychopathology and the errorrelated negativity,â Psychological Science, vol. 18, no. 4, pp. 326â333, 2007.
[221] J. F. Cavanagh, M. X. Cohen, and J. J. Allen, âPrelude to and resolution of an error: Eeg
phase synchrony reveals cognitive control dynamics during action monitoring,â The Journal
of Neuroscience, vol. 29, no. 1, pp. 98â105, 2009.
[222] J. Kayser and C. E. Tenke, âPrincipal components analysis of laplacian waveforms as a
generic method for identifying erp generator patterns: Ii. adequacy of low-density estimates,â Clinical neurophysiology, vol. 117, no. 2, pp. 369â380, 2006.
[223] J. Zhan and N. Vaswani, âRobust pca with partial subspace knowledge,â in Information
Theory (ISIT), 2014 IEEE International Symposium on. IEEE, 2014, pp. 2192â2196.
[224] F. Miwakeichi, E. MartÄąnez-Montes, P. A. ValdeĚs-Sosa, N. Nishiyama, H. Mizuhara, and
Y. Yamaguchi, âDecomposing eeg data into spaceâtimeâfrequency components using parallel factor analysis,â NeuroImage, vol. 22, no. 3, pp. 1035â1045, 2004.
[225] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan,
âTensor decompositions for signal processing applications: From two-way to multiway
component analysis,â IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 145â163, 2015.
[226] L. Cheng, Y.-C. Wu, and H. V. Poor, âProbabilistic tensor canonical polyadic decomposition
with orthogonal factors,â IEEE Transactions on Signal Processing, vol. 65, no. 3, pp. 663â
676, 2015.
[227] X. Fu, K. Huang, W.-K. Ma, N. D. Sidiropoulos, and R. Bro, âJoint tensor factorization
and outlying slab suppression with applications,â IEEE Transactions on Signal Processing,
vol. 63, no. 23, pp. 6315â6328, 2015.
[228] A. Shashua and T. Hazan, âNon-negative tensor factorization with applications to statistics
and computer vision,â in Proceedings of the 22nd international conference on Machine
learning. ACM, 2005, pp. 792â799.
[229] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari, âNon-negative tensor
factorization using alpha and beta divergences,â in Acoustics, Speech and Signal Processing,
2007. ICASSP 2007. IEEE International Conference on, vol. 3. IEEE, 2007, pp. IIIâ1393.

150

[230] ââ, âNovel multi-layer non-negative tensor factorization with sparsity constraints,â in
International Conference on Adaptive and Natural Computing Algorithms. Springer, 2007,
pp. 271â280.
[231] S. K. Suter, M. Makhynia, and R. Pajarola, âTamreshâtensor approximation multiresolution
hierarchy for interactive volume visualization,â in Computer Graphics Forum, vol. 32, no.
3pt2. Wiley Online Library, 2013, pp. 151â160.
[232] M. A. O. Vasilescu and D. Terzopoulos, âMultilinear image analysis for facial recognition,â
in Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 2. IEEE,
2002, pp. 511â514.
[233] J. Yang, D. Zhang, A. F. Frangi, and J.-y. Yang, âTwo-dimensional pca: a new approach
to appearance-based face representation and recognition,â Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 26, no. 1, pp. 131â137, 2004.
[234] X. He, D. Cai, and P. Niyogi, âTensor subspace analysis,â in Advances in neural information
processing systems, 2005, pp. 499â506.
[235] X. Niyogi, âLocality preserving projections,â in Neural information processing systems,
vol. 16. MIT, 2004, p. 153.
[236] G. Dai and D.-Y. Yeung, âTensor embedding methods,â in Proceedings of the National
Conference on Artificial Intelligence, vol. 21, no. 1. Menlo Park, CA; Cambridge, MA;
London; AAAI Press; MIT Press; 1999, 2006, p. 330.
[237] H.-T. Chen, H.-W. Chang, and T.-L. Liu, âLocal discriminant embedding and its variants,â
in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, vol. 2. IEEE, 2005, pp. 846â853.
[238] X. He, D. Cai, S. Yan, and H.-J. Zhang, âNeighborhood preserving embedding,â in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2. IEEE,
2005, pp. 1208â1213.
[239] X. Li, S. Lin, S. Yan, and D. Xu, âDiscriminant locally linear embedding with high-order
tensor data,â Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
vol. 38, no. 2, pp. 342â352, 2008.
[240] V. De Silva and L.-H. Lim, âTensor rank and the ill-posedness of the best low-rank approximation problem,â SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3, pp.
1084â1127, 2008.
151

[241] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen, âA new truncation strategy for the
higher-order singular value decomposition,â SIAM Journal on Scientific Computing, vol. 34,
no. 2, pp. A1027âA1052, 2012.
[242] A. OĚzdemir, M. A. Iwen, and S. Aviyente, âMultiscale tensor decomposition,â in Signals,
Systems and Computers, 2016 50th Asilomar Conference on. IEEE, 2016, pp. 625â629.
[243] A. Ozdemir, M. A. Iwen, and S. Aviyente, âLocally linear low-rank tensor approximation,â in Signal and Information Processing (GlobalSIP), 2015 IEEE Global Conference
on. IEEE, 2015, pp. 839â843.
[244] J. Yan and M. Pollefeys, âA general framework for motion segmentation: Independent,
articulated, rigid, non-rigid, degenerate and non-degenerate,â in Computer VisionâECCV
2006. Springer, 2006, pp. 94â106.
[245] A. Karami, M. Yazdi, and G. Mercier, âCompression of hyperspectral images using discerete wavelet transform and tucker decomposition,â IEEE journal of selected topics in
applied earth observations and remote sensing, vol. 5, no. 2, pp. 444â450, 2012.
[246] M. Iwen and B. Ong, âA distributed and incremental svd algorithm for agglomerative data
analysis on large networks,â SIAM Journal on Matrix Analysis and Applications, vol. 37,
no. 4, pp. 1699â1718, 2016.
[247] K. Ramchandran and M. Vetterli, âBest wavelet packet bases in a rate-distortion sense,â
IEEE Transactions on Image Processing, vol. 2, no. 2, pp. 160â175, 1993.
[248] P. Pudil, J. NovovicĚovaĚ, and J. Kittler, âFloating search methods in feature selection,â Pattern recognition letters, vol. 15, no. 11, pp. 1119â1125, 1994.
[249] T. Sim, S. Baker, and M. Bsat, âThe cmu pose, illumination, and expression database,â
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 12, pp. 1615â
1618, 2003.
[250] S. A. Nene, S. K. Nayar, H. Murase et al., âColumbia object image library (coil-20),â 1996.
[251] Q. Gu, Z. Li, and J. Han, âGeneralized fisher score for feature selection,â arXiv preprint
arXiv:1202.3725, 2012.
[252] B. B. Biswal, M. Mennes, X.-N. Zuo, S. Gohel, C. Kelly, S. M. Smith, C. F. Beckmann,
J. S. Adelstein, R. L. Buckner, S. Colcombe et al., âToward discovery science of human

152

brain function,â Proceedings of the National Academy of Sciences, vol. 107, no. 10, pp.
4734â4739, 2010.
[253] S. Whitfield-Gabrieli and A. Nieto-Castanon, âConn: a functional connectivity toolbox for
correlated and anticorrelated brain networks,â Brain connectivity, vol. 2, no. 3, pp. 125â141,
2012.
[254] J. L. Lancaster, M. G. Woldorff, L. M. Parsons, M. Liotti, C. S. Freitas, L. Rainey, P. V.
Kochunov, D. Nickerson, S. A. Mikiten, and P. T. Fox, âAutomated talairach atlas labels for
functional brain mapping,â Human brain mapping, vol. 10, no. 3, pp. 120â131, 2000.
[255] A. P. Liavas and N. D. Sidiropoulos, âParallel algorithms for constrained tensor factorization
via alternating direction method of multipliers,â IEEE Transactions on Signal Processing,
vol. 63, no. 20, pp. 5450â5463, 2015.
[256] R. Bro and H. A. Kiers, âA new efficient method for determining the number of components
in parafac models,â Journal of chemometrics, vol. 17, no. 5, pp. 274â286, 2003.
[257] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, âTensors for data mining and data
fusion: Models, applications, and scalable algorithms,â ACM Transactions on Intelligent
Systems and Technology (TIST), vol. 8, no. 2, p. 16, 2016.
[258] J. P. C. da Costa, M. Haardt, and F. Romer, âRobust methods based on the hosvd for estimating the model order in parafac models,â in Sensor Array and Multichannel Signal Processing
Workshop, 2008. SAM 2008. 5th IEEE. IEEE, 2008, pp. 510â514.
[259] E. E. Papalexakis and C. Faloutsos, âFast efficient and scalable core consistency diagnostic
for the parafac decomposition for big sparse tensors,â in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5441â5445.
[260] Q. Zhao, L. Zhang, and A. Cichocki, âBayesian cp factorization of incomplete tensors with
automatic rank determination,â IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1751â1763, 2015.
[261] M. Araujo, S. Papadimitriou, S. GuĚnnemann, C. Faloutsos, P. Basu, A. Swami, E. E. Papalexakis, and D. Koutra, âCom2: fast automatic discovery of temporal (comet) communities,â in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,
2014, pp. 271â283.
[262] M. MĂ¸rup and L. K. Hansen, âAutomatic relevance determination for multi-way models,â
Journal of Chemometrics, vol. 23, no. 7-8, pp. 352â363, 2009.
153

[263] M. E. Timmerman and H. A. Kiers, âThree-mode principal components analysis: Choosing
the numbers of components and sensitivity to local optima,â British journal of mathematical
and statistical psychology, vol. 53, no. 1, pp. 1â16, 2000.
[264] H. A. Kiers and A. Kinderen, âA fast method for choosing the numbers of components in
tucker3 analysis,â British Journal of Mathematical and Statistical Psychology, vol. 56, no. 1,
pp. 119â125, 2003.
[265] C. E. Tsourakakis, âMach: Fast randomized tensor decompositions,â in Proceedings of the
2010 SIAM International Conference on Data Mining. SIAM, 2010, pp. 689â700.
[266] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, âHaten2: Billion-scale tensor decompositions,â in Data Engineering (ICDE), 2015 IEEE 31st International Conference on.
IEEE, 2015, pp. 1047â1058.
[267] E. Papalexakis, C. Faloutsos, and N. Sidiropoulos, âParcube: Sparse parallelizable tensor
decompositions,â Machine Learning and Knowledge Discovery in Databases, pp. 521â536,
2012.
[268] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, âA flexible and efficient algorithmic framework for constrained matrix and tensor factorization,â IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5052â5065, 2016.
[269] W. Austin, G. Ballard, and T. G. Kolda, âParallel tensor compression for large-scale scientific data,â in Parallel and Distributed Processing Symposium, 2016 IEEE International.
IEEE, 2016, pp. 912â922.

154