CONTRIBUTIONS TO MACHINE LEARNING

IN BIOMEDICAL INFORMATICS

By

Inci Meliha Baytas

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science – Doctor of Philosophy

2019

ABSTRACT

CONTRIBUTIONS TO MACHINE LEARNING

IN BIOMEDICAL INFORMATICS

By

Inci Meliha Baytas

With innovations in digital data acquisition devices and increased memory capacity, virtually

all commercial and scientiﬁc domains have been witnessing an exponential growth in the amount

of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital

patient information due to the high adaptation rate of electronic health record systems in hospitals.

The abundance of data offers many opportunities to develop robust and versatile systems, as long

as the underlying salient information in data can be captured. On the other hand, today’s data,

often named big data, is challenging to analyze due to its large scale and high complexity. For this

reason, efﬁcient data-driven techniques are necessary to extract and utilize the valuable information

in the data. The ﬁeld of machine learning essentially develops such techniques to learn effective

models directly from the data. Machine learning models have been successfully employed to solve

complicated real world problems. However, the big data concept has numerous properties that pose

additional challenges in algorithm development. Namely, high dimensionality, class membership

imbalance, non-linearity, distributed data, heterogeneity, and temporal nature are some of the big

data characteristics that machine learning must address.

Biomedical informatics is an interdisciplinary domain where machine learning techniques are

used to analyze electronic health records (EHRs). EHR comprises digital patient data with various

modalities and depicts an instance of big data. For this reason, analysis of digital patient data is

quite challenging although it provides a rich source for clinical research. While the scale of EHR

data used in clinical research might not be huge compared to the other domains, such as social

media, it is still not feasible for physicians to analyze and interpret longitudinal and heteroge-

neous data of thousands of patients. Therefore, computational approaches and graphical tools to

assist physicians in summarizing the underlying clinical patterns of the EHRs are necessary. The

ﬁeld of biomedical informatics employs machine learning and data mining approaches to provide

the essential computational techniques to analyze and interpret complex healthcare data to assist

physicians in patient diagnosis and treatment.

In this thesis, we propose and develop machine learning algorithms, motivated by prevalent

biomedical informatics tasks, to analyze the EHRs. Speciﬁcally, we make the following contri-

butions: (i) A convex sparse principal component analysis approach along with variance reduced

stochastic proximal gradient descent is proposed for the patient phenotyping task, which is de-

ﬁned as ﬁnding clinical representations for patient groups sharing the same set of diseases. (ii)

An asynchronous distributed multi-task learning method is introduced to learn predictive models

for distributed EHRs. (iii) A modiﬁed long-short term memory (LSTM) architecture is designed

for the patient subtyping task, where the goal is to cluster patients based on similar progression

pathways. The proposed LSTM architecture, T-LSTM, performs a subspace decomposition on the

cell memory such that the short term effect in the previous memory is discounted based on the

length of the time gap. (iv) An alternative approach to T-LSTM model is proposed with a decou-

pled memory to capture the short and long term changes. The proposed model, decoupled memory

gated recurrent network (DM-GRN), is designed to learn two types of memories focusing on dif-

ferent components of the time series data. In this study, in addition to the healthcare applications,

behavior of the proposed model is investigated for trafﬁc speed prediction problem to illustrate its

generalization ability. In summary, the aforementioned machine learning approaches have been

developed to address complex characteristics of electronic health records in routine biomedical in-

formatics tasks such as computational patient phenotyping and patient subtyping. Proposed models

are also applicable to different domains with similar data characteristics as EHRs.

Copyright by
INCI MELIHA BAYTAS
2019

To my parents

v

ACKNOWLEDGMENTS

As I am approaching to the end of my Ph.D., I would like to express my appreciation and

gratitude for special people, who were with me in this journey. I am so grateful to be a Ph.D.

student of Prof. Anil K. Jain. Since the beginning of my Ph.D., I always felt his endless support

and encouragement. His wisdom and positive approach made me feel more conﬁdent and peaceful

during difﬁcult times of Ph.D. His reference opened many doors, including the faculty position

I have been dreaming. I will always be thankful to Dr. Jain for everything he has done for my

training and future. I am also grateful to have an opportunity to work with my co-advisor Dr. Jiayu

Zhou. I would like to thank him for being such a great teacher and mentor. I learned a lot about

conducting research, writing papers and presenting my work from him. I am so appreciative that

my both advisors provided me several opportunities to improve my teaching skills and they made

sure I have everything I need for my future career goals. I also would like to use this opportunity

to thank Prof. Bilge Gunsel for believing in me and encouraging me. Her endless support and

reference have a big role in earning this degree today.

I would like to thank my beloved parents for encouraging me to pursue this degree. Without

their love, support, understanding, trust, encouragement, and friendship, I would not be who I am

today. Throughout my life, they never stop believing in me and this always gives me conﬁdence

and strength to move on. They have always been a perfect role model with their hard-work and

dedication. They always inspire me with their tenacity and determination.

I cannot thank my

parents, my eternal best friends, enough for loving me, guiding me and giving me such a warm

family I can always hold on to. I also would like to thank my cousin Aydeniz for never leaving me

alone and giving me her endless support, friendship, and sisterhood. She has been a tremendous

support mechanism for me during my Ph.D.

Being in graduate school was very challenging from time to time and making beautiful friend-

ships with people who can understand me was so comforting. I would like to thank Soweon and

Serhat for answering my questions before I joined the lab. It has been a pleasure to share this

vi

journey with my lab mates Debayan, Sixue, Josh, Yichun, Luan, Tarang, Charles, Lacey, Sunpreet,

and Keyur. I will miss the time I spent with my dear friends Sudipta, Sixue, Nupur and Vidhya.

A very special thanks to Debayan and Sudipta for being such fun, supportive, caring and loving

friends; their friendship supported me to the core. Last but not least, I would like to thank my life

long friends in Turkey; Unzile, Emine, and Cigdem. After all those years and even we were far

from each other, they never stopped being such beautiful sisters to me.

Finally, I would like to thank CSE department; Prof. Eric Torng for his guidance about Ph.D.

program, Prof. Abdol Esfahanian for providing me a teaching opportunity, and Brenda Hodge and

Steve Smith for their help with enrollments and reimbursements. I also would like to express my

gratitude for my committee members; Prof. Pang-Ning Tan, Prof. Xiaoming Liu, and Prof. Selin

Aviyente for their time and valuable feedback.

vii

TABLE OF CONTENTS

LIST OF TABLES .

.

LIST OF FIGURES .

.

.

.

.

.

.

LIST OF ALGORITHMS .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. xiii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii

Chapter 1

.

.
.
.

. .

. .

Introduction .

.
1.1 Biomedical Informatics .
. .

1
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.1.1 Data .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Major Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.1.3 Role of Machine Learning in Biomedical Informatics . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 12
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1
1.2.2 Challenges
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Machine Learning Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 15
. 17

.
Problem Setting .
.

1.3 Dissertation Focus and Contributions . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Machine Learning .

.

.

.

.

Chapter 2

Introduction .

2.1
.
2.2 Literature Review .

. .

2.3 PHENOTREE

.

.

.

.

.

.

.

.
.
.

.
.
.

.
.
.

PHENOTREE: Hierarchical Phenotyping via Sparse Principal Compo-
nent Analysis .
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Data-Driven Phenotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2
Sparse Principal Component Analysis and Stochastic Proximal Optimization 23
2.2.3 Visual Analytics for EHR . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
. 25
Phenotyping via SPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Stochastic Convex SPCA (Cvx-SPCA) . . . . . . . . . . . . . . . . . . .
. 28
. . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3.1 Optimization Scheme
2.3.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . 32
Interactive Hierarchical Phenotyping via PHENOTREE . . . . . . . . . . . 38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . 43
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

.
Synthetic Dataset
Private Electronic Health Records Dataset

2.3.1 Electronic Health Records
2.3.2
2.3.3

2.4.1
2.4.2
2.4.3 Diabetes Dataset
.

. . . . . . . . . . . . . . . . . . . . . . . . .

2.3.4

. .

. .

. .

.
.

.

.

.

2.5 Summary .

.

.

2.4 Experiments .

Chapter 3

Introduction .

3.1
.
3.2 Literature Review .

Asynchronous Distributed Multi-Task Learning . . . . . . . . . . . . . . 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.2 Distributed Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . 63

. .

.
.

.
.

.
.

.
.

viii

3.3 Distributed Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.1 Regularized Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . 65
Synchronized Multi-Task Learning (SMTL) . . . . . . . . . . . . . . . . . 66
3.3.2
3.3.3 Asynchronous Multi-Task Learning (AMTL)
. . . . . . . . . . . . . . . . 68
3.3.3.1 Asynchronous Updates of AMTL . . . . . . . . . . . . . . . . . 72
3.3.3.2 Dynamic Step Size in AMTL . . . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 74
3.4.2 Comparison between AMTL and SMTL . . . . . . . . . . . . . . . . . . . 76
3.4.2.1
Synthetic Dataset Experiments . . . . . . . . . . . . . . . . . . . 76
3.4.2.2 Real World Datasets Experiments . . . . . . . . . . . . . . . . . 78
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.4.3 Dynamic step size .
.

3.4 Experiments .

3.5 Summary . .

. .

. .

.

.

.

.

.

.

.

.

Chapter 4

.
.

.
.

.
.

.
.

. .

Introduction .

4.1
.
4.2 Literature Review .

4.3.1 Long Short-Term Memory (LSTM)
4.3.2
4.3.3

Patient Subtyping via Time-Aware LSTM Networks . . . . . . . . . . . . 82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.1 RNNs for Biomedical Informatics . . . . . . . . . . . . . . . . . . . . . . 86
. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.2 Auto-Encoder Networks
4.3 Time-Aware Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . 88
. . . . . . . . . . . . . . . . . . . . . 88
Proposed Time-Aware LSTM (T-LSTM) . . . . . . . . . . . . . . . . . . . 90
. . . . . . . . . . . . . . . 92
Patient Subtyping with T-LSTM Auto-Encoder
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Supervised Experiment . . . . . . . . . . . . . . . . . . . . . . . 96
. . . . . . . . . . . . . . . . . . . . . 97
. . . . . . . . . . 99
Target Sequence Prediction . . . . . . . . . . . . . . . . . . . . 101
Patient Subtyping of PPMI Data . . . . . . . . . . . . . . . . . . 102
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

.
Synthetic Dataset
4.4.1.1
4.4.1.2 Unsupervised Experiment
Parkinson’s Progression Markers Initiative (PPMI) Data
4.4.2.1
4.4.2.2
.

4.4 Experiments .

4.5 Summary .

4.4.1

4.4.2

. .

. .

.
.

.

.

.

.

.

.

Chapter 5

.
.
.

.
.
.

.
.
.

. .

Introduction .

5.1
.
5.2 Literature Review .
5.3 Methodology .
.

5.3.1.1

5.3.1 Memory in Recurrent Networks

Decoupled Memory Gated Recurrent Network . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . . . 114
. . . . . . . . . . . . . . . . . . . . . . . 118
5.3.2.1 Neural Turing Machine
. . . . . . . . . . . . . . . . . . . . . . 119
5.3.2.2 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.2.3 Disadvantages of External Memory . . . . . . . . . . . . . . . . 121
5.3.3 Decoupled Memory Gated Recurrent Network (DM-GRN) . . . . . . . . . 122
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Internal Memory of Recurrent Networks

5.3.2 External Memory Architectures

.
Synthetic Data

5.4.1

5.4 Experiments .

. .

.
.

.

.

.

ix

5.4.1.1
5.4.1.2

Periodic Signal . . . . . . . . . . . . . . . . . . . . . . . . . .
Synthetic EHR . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Trafﬁc Speed Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Diabetes Data .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

. 126
. 127
. 128
. 131
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

. .

.
.

.

.

5.5 Summary . .

Chapter 6

Summary and Suggestions for Future Work . . . . . . . . . . . . . . . . . 140
. 140
6.1 Summary .
6.2 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

.

.

.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

x

LIST OF TABLES

Table 1.1 Different domains offer different challenges pertaining to big data. Scale
is only one of the factors that increases the complexity of the traditional machine
learning approaches. In some applications, the amount of data used for algorithm
development may not be large, but other additional challenges are introduced.

. . .

4

Table 2.1 We sample patients with female, male, child and old age speciﬁc diagnoses.
These samples may overlap with each other. For instance, a patient may have
dementia and a prostate diagnosis together. We did not include medical conditions
which can be encountered for any age patient and both genders into these groups
of patients. Old patients are assumed to be above 60 years old. The age range of
child patients are determined between infants (birth to 1 year) to adolescence (12-18). 43

Table 2.2 EHR data features which contributes to the output dimensions after Cvx-
SPCA algorithm was applied to the whole patient population (168,431 patients).
Feature descriptions were provided by the private dataset and they can also be
found in [44].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

.

.

.

.

.

.

.

Table 3.1 Computation times (sec.) of AMTL and SMTL with different network de-
lays. The offset value of the delay for AMTL-5, 10, 30 was chosen as 5, 10, 30
seconds. Same network settings were used to compare the performance of AMTL
and SMTL. AMTL performed better than SMTL for all the network settings and
numbers of tasks considered here.

. . . . . . . . . . . . . . . . . . . . . . . . . . 78

Table 3.2 Benchmark public datasets used in this study. Sample sizes vary per task.
Third column of the table summarizes the minimum and maximum number of data
points tasks might have in each dataset. . . . . . . . . . . . . . . . . . . . . . . . . 79

Table 3.3 Training time (sec.) comparison of AMTL and SMTL for the three pub-
lic datasets. Similar to the synthetic data experiments, AMTL generally requires
smaller training time compared to SMTL under different network settings.

. . . . . 79

Table 3.4 Objective values of the synthetic dataset with 5, 10, and 15 tasks. Objective
values of different network settings are shown. Dynamic step size yields lower
objective values at the end of the last iteration than ﬁxed step size. This result
indicates that the dynamic step size in asynchronous distributed setting is recom-
mended to boost the convergence.

. . . . . . . . . . . . . . . . . . . . . . . . . . 81

Table 4.1 Supervised synthetic EHR experimental results showing the average AUC of
testing on 10 different splits. Training and testing proportion was chosen as 70%
to 30%.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

.

.

.

.

.

.

.

.

.

.

xi

Table 4.2 Average Rand index of k-means over 10 runs. T-LSTM auto-encoder out-

performs LSTM and MF1-LSTM auto-encoders. . . . . . . . . . . . . . . . . . . . 98

Table 4.3 Details of PPMI data used in this study. Elapsed time encountered in the
data is measured in months and it varies between 1 month to nearly 24 months.
Here, the elapsed time interval is not the time interval of PPMI data recording, but
elapsed times seen in records of individual patients.
. . . . . . . . . . . . . . . .

. 101

Table 4.4 Average mean square error (MSE) for 10 different train-test splits for T-
LSTM, LSTM, MF1-LSTM, and MF2-LSTM. T-LSTM yielded a slightly better
result than the standard LSTM in the presence of the unstructured time gaps. . . . . 101

Table 4.5 Some common target features from PPMI dataset on which T-LSTM per-
formed better than LSTM and MF1-LSTM during 10 trials. These target features
are mainly related to the effects of Parkinson’s disease on muscle control.

. . . . . 102

Table 4.6 Results of the statistical analysis for T-LSTM, MF1-LSTM and MF2-
LSTM. DaTScan1 corresponds to DaTScan SBR-CAUDATE RIGHT, DaTScan2
is DaTScan SBR-CAUDATE LEFT, and DaTScan4 is DaTScan SBR-PUTAMEN
LEFT.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

. .

. .

.

.

.

. .

.

Table 5.1 Mathematical models of gated RNN architectures covered in this chapter.

. . 115

Table 5.2 Diabetes classiﬁcation experiment using synthetic EHR dataset. Average

AUC and ACC are reported for 5 different train and test data splits. . . . . . . . . . 128

Table 5.3 Mean absolute error (MAE) and Root Mean Square (RMSE) are reported for

the trafﬁc prediction task.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Table 5.4 Diabetes dataset statistics.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Table 5.5 Readmission prediction performance of diabetes dataset. Average accuracy
(ACC) and area under ROC curve (AUC) are reported for 5 different train and test
splits. Standard deviation is given inside the parentheses.

. . . . . . . . . . . . . . 132

xii

LIST OF FIGURES

Figure 1.1 Evolution of machine learning. Since 1997, machine learning domain has
been witnessing groundbreaking developments and breakthroughs. Large datasets
and competitions, such as MNIST, ImageNet, Netﬂix, and Kaggle, have become
the catalyst for many state-of-the-art algorithms. . . . . . . . . . . . . . . . . . . .

Figure 1.2 Percentage of ofﬁce-based physicians using EHR. Adoption of EHRs dou-
bled between 2008 and 2015 [90]. Basic EHR requires EHR systems to have at
least minimal information, such as patient demographics, medication list, discharge
summary, and lab reports. Whereas certiﬁed EHR systems comprise detailed clin-
ical information, different functionalities, and security requirements.

. . . . . . . .

Figure 1.3 Healthcare data is growing rapidly. A 48% annual growth rate is expected

leading to 2, 314 Exabytes of data in 2020 [45].

. . . . . . . . . . . . . . . . . . .

3

5

5

Figure 1.4 Several rows of an admission table in MIMIC-III [66] database. Admission,
discharge dates, diagnoses, and subject ID are some of the major attributes for
biomedical informatics applications.
. . . . . . . . . . . . . . . . . . . . . . . .

.

7

Figure 1.5 Several rows of diagnoses, lab events and patient tables in MIMIC-III [66]
database. Subject ID of a patient is generally used to retrieve information from
different tables. Diagnoses are represented by ICD-9 codes. Quantitative values of
results, time, unit, and type of the lab tests are often stored in lab events table.

. . .

8

Figure 1.6 Medical record of a patient contains different medical events at each time
step. The gap between two consecutive time steps is often irregular in EHR
datasets. For this reason, EHRs are harder to analyze compared to univariate regu-
larly sampled time series data.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Figure 1.7 An example of hierarchical patient phenotyping visualization. Each node in
this tree gives a structured clinical phenotype and a stable subcohort characterized
by the phenotype.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

.

.

.

.

Figure 1.8 Expected growth of the world market for artiﬁcial intelligence (AI) systems

between 2017-2025 [110].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 1.9 Overall procedure of supervised learning problem. N is the total number of
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

training data points.

.

.

.

. 14

Figure 2.1

ICD-9 codes are used to represent different diagnostic groups. These uni-

versal codes have a hierarchical structure.

. . . . . . . . . . . . . . . . . . . . . . 26

xiii

Figure 2.2 SPCA learns a sparse loading vector. Principal components are represented
. . . . . . . . . . . .

as a linear combination of a subset of the input dimensions.

. 28

Figure 2.3 Computational phenotyping procedure with SPCA. SPCA is iteratively ap-

plied until the desired number of levels is reached. . . . . . . . . . . . . . . . . . . 39

Figure 2.4 An example of hierarchical phenotyping with stochastic Cvx-SPCA. (a) the
ﬁrst, (b) the second and (c) the third level features of the patient population. This
procedure can be applied repeatedly until the desired number of levels is reached.

. 49

Figure 2.5 Convergence for synthetic data with n samples and d features. Convergence
of the proposed stochastic Cvx-SPCA with (prox-SVRG) and without variance
reduction (prox-SGD). Proximal stochastic gradient with variance reduction has a
faster convergence rate, since the variance caused by random sampling is bounded
in prox-SVRG.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

. .

. .

.

Figure 2.6 Convergence of sparse pattern in the log scale. Cvx-SPCA with Prox-SGD
takes 275 epochs, whereas Cvx-SPCA with Prox-SVRG takes 45 epochs to con-
verge a similar sparsity pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 50

Figure 2.7 Running times (in seconds) are compared to obtain similar cardinality of
loading vector. A machine with 2.8GHz Intel(R) Xeon(R) CPU and 142 GB mem-
ory was used in the experiments. The methods denoted by Reg-SPCA, InvPow-
SPCA, and InvPow-SPCA are from [55, 61], and from [86], respectively. . . . . .

. 51

Figure 2.8 Regularization path for Cvx-SPCA, where different colored curves repre-
sent the values of 10 features. When the regularization term was around −0.11
(dashed vertical line), the non-zero loading values of the known principal compo-
nent, which was used to generate the data, are recovered.
. . . . . . . . . . . . .

. 51

Figure 2.9 An example branch of the PHENOTREE. The ﬁrst level diagnosis with
ICD-9 code 239 denotes Neoplasm Of Unspeciﬁed Nature. Children nodes of the
patients who have ICD-9 239 are ICD-9 176 Karposi’s Sarcoma, 196 Secondary
and Unspeciﬁed Malignant Neoplasm of Lymph Nodes, 693 Dermatitis Due To
Drugs, 702 Other Dermatoses, and ICD-9 957, Injury to Other and Unspeciﬁed
Nerves. Karposi’s Sarcoma is a type of cancer. Patients who have diagnosis of
neoplasm of unspeciﬁed nature may have other types of neoplasms as well.
In
addition, we can also see dermatological issues in the second level. . . . . . . . .

Figure 2.10 Hierarchical stratiﬁcation via Cvx-SPCA. Cvx-SPCA is applied on the en-
tire patient population and the features with the largest absolute loading values on
the leading principal component are selected. Each feature dimension corresponds
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
to a speciﬁc disease. .

. .

.

. 52

. 52

xiv

Figure 2.11 Hierarchical stratiﬁcation via Cvx-SPCA for female and male patients. One
of the ﬁrst level features in Figure 2.11a is ICD-9 636, Illegal Abortion. In the sec-
ond level, we see diagnoses like ICD-9 596 Disorders of Bladder, and 37 Tetanus,
which could be some of the side effects of illegal abortion.

. . . . . . . . . . . . . 53

Figure 2.12 Hierarchical stratiﬁcation via Cvx-SPCA for children and old patients. Ac-
cording to our observations, different patient groups tend to have common diseases
which was illustrated for general population before. On the other hand, they also
yielded speciﬁc diagnoses as well.

. . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 2.13 Hierarchical representation of the whole diabetes data. . . . . . . . . . . . . 55

Figure 2.14 Hierarchical Stratiﬁcation via Cvx-SPCA for patients who were readmit-
ted and not readmitted in diabetes database. Injury and poisoning are commonly
encountered for not readmitted patients. However, wider range of diagnoses are
observed for readmitted patients.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 2.15 Hierarchical Stratiﬁcation via Cvx-SPCA for female and male patients in
diabetes data. We can observe female/male diagnoses along with common dis-
eases. For instance, Figure 2.15b has ICD-9 602, disorder of prostate, and ICD-9
401, hypertension. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

. .

Figure 2.16 Hierarchical Stratiﬁcation via Cvx-SPCA for old and teen/adult patients in
diabetes data. Neoplasm is very common in patient populations with a wide range
of age or different genders for both the EHR datasets examined in this study. . . . . 58

Figure 3.1 Overview of the proposed asynchronous multi-task learning (AMTL)
framework. After a task node computes the gradient, it sends the gradient or the
updated model to the central server. Central server couples all the individual mod-
els to perform proximal projection and send the updated individual models back to
the corresponding task nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Figure 3.2

Illustration of asynchronous updates in AMTL. The asynchronous update
scheme has an inconsistency when it comes to reading model vectors from the
central server.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

.

.

.

.

.

.

.

Figure 3.3 Computation times of AMTL and SMTL for (a) varying number of tasks
when the number of dimensions and the number of data points per node were ﬁxed
at 50 and 100, respectively, for (b) varying number of task sizes with 50 dimen-
sions, and 5 tasks, and for (c) varying dimensionality of 5 tasks with 100 samples
in each task. As expected, computational time of SMTL is higher than AMTL for
a ﬁxed number of iterations.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 77

xv

Figure 3.4 Convergence of AMTL and STML under the same network conﬁgurations.
Experiment was conducted for randomly generated synthetic datasets with 5 and
10 tasks.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

.

.

.

.

.

.

.

.

.

.

Figure 4.1 An example segment of longitudinal patient records where the patient had
6 ofﬁce visits between Sept 5, 2015 and Sept 27, 2016.
In each visit, medical
information such as diagnosis of the patient is recorded. Diagnoses are usually
encoded as ICD-9 codes. Time gap between two successive visits varies. . . . . . . 84

Figure 4.2

Illustration of time-aware long-short term memory (T-LSTM) unit, and its
application on medical records. Green boxes indicate networks and yellow circles
denote point-wise operators. T-LSTM takes input record and the elapsed time at
the current time step. T-LSTM decomposes the previous memory into long and
short term components and utilizes the elapsed time (∆t) to discount the short term
effects. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

. .

. .

. .

.

.

.

Figure 4.3 Finding patient groups with a single-layer T-LSTM Auto-Encoder. Blue
arrows denote the cell memory and the black arrows denote the hidden states. After
the representations (Ri, i = 1, 2,··· , 8) are learned for the population, we can
cluster the patients and obtain subtypes for each group.

. . . . . . . . . . . . . . . 94

Figure 4.4

Illustration of the clustering results. Different colors denote ground-truth
assignments of different clusters. T-LSTM auto-encoder learns a mapping for the
sequences such that 4 separate groups of points can be represented in the 2-D space. 99

Figure 4.5 Change in the objective values of T-LSTM, MF1-LSTM and LSTM with
It is observed that the modiﬁcations related to the time
.

respect to 500 epochs.
irregularity does not deteriorate the convergence of the original LSTM network.

. 100

Figure 4.6 Heat map illustration of the patient subtyping results of T-LSTM for two
clusters. Light red color represents the cluster mean which is higher than the to-
tal mean of the patients and the shades of blue show lower mean values for the
corresponding feature with p-value< 0.05. . . . . . . . . . . . . . . . . . . . . . . 104

Figure 5.1 Comparison between some of the auto-reggresive models and deep learning

techniques for time series analysis.

. . . . . . . . . . . . . . . . . . . . . . . . . . 108

Figure 5.2 Regular and irregular time series data example. . . . . . . . . . . . . . . . . 110

Figure 5.3 Standard LSTM unit.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Figure 5.4 Neural Turing Machine [52] . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Figure 5.5 Memory Networks [118]

. . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xvi

Figure 5.6 Proposed decoupled memory gated recurrent network (DM-GRN) unit. . . . 125

Figure 5.7 Synthetically generated input signal and its corresponding target.

. . . . . . 126

Figure 5.8 Prediction performance of the synthetic data. When only short term mem-
ory is used to predict the target, prediction error is high, but the predicted values
can follow the temporal changes. On the other hand, long term memory component
demonstrates the overall trend and complements the short term memory. Thus, the
. . . . . . . . . . .
ﬁnal prediction perfectly superimposes on the original target.

. 134

Figure 5.9 Predictions obtained using the hidden state and the cell memory of the stan-
dard LSTM unit. Cell memory is assumed to behave as a long term memory.
According to our observation, information contained in the cell memory does not
directly reﬂect the predictive power.

. . . . . . . . . . . . . . . . . . . . . . . . . 135

Figure 5.10 Long and short term memory contents of DM-GRN, and hidden state and
cell memory contents of LSTM for the synthetic data. LSTM short term and long
term memories correspond to hidden states and the cell memory, respectively.
Heatmaps are used to visualize the memory activations. The periodic nature of
the signal and the short term changes are more observable in DM-GRN memory
components than LSTM hidden states and the cell memory. . . . . . . . . . . . .

. 136

Figure 5.11 Road network of a subregion in the city of Chengdu, China. Intersections
are represented with nodes, and the roads are represented by the edges. Due to
noise and errors during map matching procedure, connected roads on the graph
. . . . . . . . . . . . . . . . . . . . . . . . .
might not be connected in real life.

. 137

Figure 5.12 Speed time series of one road over 4 days of the same week in November,
2016. Trafﬁc data suffers from missing values. Roads with relatively few missing
values are kept and the missing values are interpolated.

. . . . . . . . . . . . . . . 138

Figure 5.13 Speed time series for one road for 4 days in November, 2016. Each time
step, in the 24 steps long sequence, represents the aggregated speed value of one
hour.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

.

.

.

.

.

.

.

.

.

.

.

Figure 5.14 Trafﬁc prediction performance. Long term memory of DM-GRN underesti-
mates the actual speed values, however the long term prediction follows the global
trend of the time series while the short term component is more susceptible to the
short term changes.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

.

.

. 139

xvii

LIST OF ALGORITHMS

Algorithm 1 Prox-SVRG algorithm for solving Cvx-SPCA.

. . . . . . . . . . . . . . . 31

Algorithm 2 Construction of a PHENOTREE

. . . . . . . . . . . . . . . . . . . . . . . 40

Algorithm 3 The proposed asynchronous multi-task learning framework . . . . . . . . . 69

xviii

Chapter 1

Introduction

Digital data has been an important driving force of technology and the economy since the 1990s.

Development of devices for the capture and storage of digital data has rapidly increased the amount

of information collected in various media, including text, audio, image, and video. The digital age

is witnessing an ever-growing information explosion, where the size of the digital universe is es-

timated to double every two years; a 50-fold growth from 2010 to 2020 [2]. Digital data is very

valuable for scientiﬁc research and industry as long as insightful information can be extracted.

On the other hand, comprehensive analysis of today’s data, often named big data, is an ongo-

ing challenge. IBM data scientists deﬁned big data with four properties: volume (scale of data),

variety (different forms of data), velocity (temporal nature of data) and veracity (uncertainty of

data) 1. As a result, big data is challenging not only because of its large scale, but also its high

complexity. Primary data types and their major challenges are summarized for different domains

in Table 1.1. As presented in the table, big data is characterized by a high dimensional, non-linear,

heterogeneous, distributed, and temporal nature. However, big data offers numerous opportunities

for machine learning researchers in different domains, such as healthcare. The sheer volume of

digital health information is projected to grow even faster than other sectors over the next seven

years, according to the report by the International Data Corporation (IDC) [70]. Therefore, with-

1http://www.ibmbigdatahub.com/infographic/four-vs-big-data

1

out efﬁcient and comprehensive data analysis techniques, it is not feasible for physicians to draw

clinical conclusions from the growing amount of digitized patient data.

Biomedical informatics is an interdisciplinary domain whose goal is to improve the patient

care using efﬁcient data-driven techniques. In particular, biomedical informatics leverages ma-

chine learning and data mining techniques to enable preventive and personalized medicine, reduce

healthcare costs, and facilitate clinical research. Digital health records, as big data in general, are

heterogeneous, non-linear 2 and temporal. Thus, biomedical informatics has become an evolving

domain for machine learning research.

Machine learning has become an inevitable tool to solve challenging real world problems due

to its capability to infer underlying patterns in data. As the data complexity increases, machine

learning techniques need to become more sophisticated. The chronological developments in ma-

chine learning are summarized in Figure 1.1. After Hinton’s introduction of deep learning in 2006,

powerful learning models have emerged to solve complicated real world problems, such as disease

detection, temporal clinical event prediction, and concept embedding. In this thesis, we developed

machine learning and deep learning models to address some of the challenging biomedical infor-

matics tasks, such as computational phenotyping and patient subtyping. The proposed methods, in

particular, focus on high dimensional, non-linear, distributed, and temporal aspects of the digital

patient data. In the subsequent sections, a brief introduction to biomedical informatics, challenges,

machine learning solutions, and the main contributions of the thesis are summarized.

1.1 Biomedical Informatics

Biomedical informatics is an evolving domain for machine learning researchers ever since hospi-

tals started storing digital patient records. The informatics component focuses on the acquisition,

storage and the retrieval of the health information, whereas the biomedical component refers to the

medical tasks using the patient data [16]. The essential purpose of the biomedical informatics is to

2Here, non-linear refers to non-linear interactions between the attributes (e.g., diagnoses, medication, demograph-

ics) provided in patient data.

2

Figure 1.1 Evolution of machine learning. Since 1997, machine learning domain has been wit-
nessing groundbreaking developments and breakthroughs. Large datasets and competitions, such
as MNIST, ImageNet, Netﬂix, and Kaggle, have become the catalyst for many state-of-the-art
algorithms.

convert the vast amount of digital health information into relevant knowledge using computational

techniques [6]. The digital patient data is provided by Electronic Health Records (EHRs), which

represent longitudinal patient information, including diagnoses, medication, lab results, and medi-

cal images. EHRs follow a similar growth trend as big data; the EHR market is expected to reach

33.41 billion US dollars by 2025 [100]. Adoption rate of EHR systems is also increasing every

year. The change in the percentage of physicians using EHR systems between 2004-2015 [90] is

given in Figure 1.2. Consequently, an exponential growth in EHR data volume is observed and this

trend is predicted to continue through 2020 as shown in Figure 1.3 [45].

EHRs provide valuable and diverse information about patients and physicians. Biomedical

informatics studies systematic techniques to extract the salient information which EHR data can

offer. Adoption of EHR systems and the data analytics techniques improve different aspects of

healthcare. One important aspect is the personalized healthcare, which requires analyzing trends

in different patient populations and identiﬁcation of patterns in patient outcomes. A vast number of

EHRs facilitate learning the common trends in a patient cohort, and consequently, enable prediction

of health outcomes. One of the important beneﬁts of using data analytics tools in healthcare is

the reduction in healthcare costs. Currently, the healthcare expenses are approximately 18% of

GDP (Gross Domestic Product), which corresponds to nearly 600 billion dollars in the United

States [40]. In particular, the predictive analysis using EHRs plays an important role to increase

the performance and the efﬁciency of healthcare services, and consequently assists in optimizing

3

Table 1.1 Different domains offer different challenges pertaining to big data. Scale is only one
of the factors that increases the complexity of the traditional machine learning approaches.
In
some applications, the amount of data used for algorithm development may not be large, but other
additional challenges are introduced.

Domain

Data

Healthcare

Electronic Health Records

Computer Vision

Images, Videos

E-Commerce

data
Behavioral
clicks, transactions)

(e.g.,

Finance

Financial and stock mar-
ket data (e.g., transactions,
portfolios)

Trafﬁc Management GPS, surveillance data

4

Major Challenges
• Large scale (e.g., data on 53K patient

admissions in MIMIC-III [66])

• High dimensional (e.g.,

ICD-9 codes [113])

thousands of

• Heterogeneous (universal codes, medi-

cal images, hand-written notes, etc.)

IMAGENET with

• Distributed
• Temporal
• Non-linear interactions
• Class imbalance
• Large scale (e.g.,
14M images [102])
• Temporal (videos)
• Class imbalance
• Large scale (e.g., 2M events [3])
• Heterogeneous
• Distributed
• Temporal
• Large scale (e.g., data on 542K transac-

tions [25])
• Temporal
• Large scale (e.g., 1.3M taxi trips in
NYC between January 2009 and June
2015 [4])
• Distributed
• Temporal

Figure 1.2 Percentage of ofﬁce-based physicians using EHR. Adoption of EHRs doubled between
2008 and 2015 [90]. Basic EHR requires EHR systems to have at least minimal information,
such as patient demographics, medication list, discharge summary, and lab reports. Whereas cer-
tiﬁed EHR systems comprise detailed clinical information, different functionalities, and security
requirements.

Figure 1.3 Healthcare data is growing rapidly. A 48% annual growth rate is expected leading to
2, 314 Exabytes of data in 2020 [45].

5

healthcare costs.

In summary, biomedical informatics research aims to develop computational

techniques to improve diagnosis and prognosis of diseases.

In addition, clinical treatment and

medical research are also enhanced by the biomedical informatics techniques [6]. In the following

section, EHR data characteristics and major challenges encountered in biomedical informatics are

discussed.

1.1.1 Data

Before the adoption of EHRs, clinical information used to be recorded in paper format and trans-

ferred between clinics physically [112]. Paper based patient records were not effective for clinical

research since extracting patient statistics and the patterns in healthcare outputs of a large patient

cohort was not attainable for physicians. As a result, the clinical treatments used to focus more on

the cure rather than predicting and preventing possible risks [99]. After EHR systems are launched

in hospitals, quality of the patient records is improved and the scope of the patient’s medical history

is expanded. Various types of patient information, such as diagnoses, lab results, medical imaging,

medication, doctor’s handwritten notes, and most importantly time stamp of every clinical outcome

can be found in EHRs. As a result of the availability of comprehensive patient information, clinical

decision making has been enriched and become more proactive.

In a standard EHR database, multiple tables are constructed to store different types of informa-

tion. As an example, in Figures 1.4 and 1.5, several rows of admission, diagnosis, lab result, and

patient demographics tables of a publicly available critical care database, named MIMIC-III [66],

are shown. In a typical patient table, true identity of the individuals is never shared, but assigned

unique IDs for each patient, demographics, dates of admission and discharge are stored. Dates are

also encrypted to avoid releasing the actual time that a patient visited the hospital. Since it provides

the time stamps, the patient table is necessary for the temporal analysis of EHRs. Another very

important piece of information found in EHRs is the diagnoses. At the end of the patient visit, ﬁnal

diagnoses, usually represented by a universal code (e.g., ICD-9 [113]), are recorded for insurance

purposes. Patients may have more than one code at the end of each admission, depending on their

6

Figure 1.4 Several rows of an admission table in MIMIC-III [66] database. Admission, discharge
dates, diagnoses, and subject ID are some of the major attributes for biomedical informatics appli-
cations.

conditions. Diagnosis information is very useful when designing predictive biomedical informat-

ics tasks, such as disease prediction. Other clinical details, such as medication, lab results, and

procedures, are often encoded in different tables.

Patient data is challenging to process and analyze since the electronic records have various data

types (e.g., numerical, symbolic, ratio, interval, ordinal, nominal) as seen in Figures 1.4 and 1.5.

Furthermore, medical images and handwritten notes of physicians can also be available in some

EHR databases. In summary, EHRs provide a comprehensive patient information with various

modalities. Analysis, fusion and inference of heterogeneous patient data are some of the key factors

leading healthcare improvements [60]. On the other hand, due to its diversity in format, type,

and context, physicians and clinical researchers cannot directly harness the raw EHR data. For

this reason, biomedical informatics tries to leverage data-driven techniques to retrieve the useful

knowledge from digital records. However, designing data-driven models to infer the underlying

information from EHRs is a complicated task due to the fact that EHR is an instance of big data. In

the following sections, major challenges biomedical informatics needs to tackle and the solutions

machine learning can offer are discussed.

1.1.2 Major Challenges

The biomedical informatics domain tackles particular challenges stemming from EHR data charac-

teristics. One of the most challenging data characteristic is the scale. Each hospital stores hundreds

7

Figure 1.5 Several rows of diagnoses, lab events and patient tables in MIMIC-III [66] database.
Subject ID of a patient is generally used to retrieve information from different tables. Diagnoses
are represented by ICD-9 codes. Quantitative values of results, time, unit, and type of the lab tests
are often stored in lab events table.

8

of thousands patient records in their databases. Sorting and prioritizing this amount of informa-

tion cannot be easily handled by conventional data-driven algorithms (e.g., models requiring full

gradient descent and data localization). Moreover, to utilize EHRs for algorithm design, rigorous

data cleaning and encryption are imperative since patient privacy is an inevitable concern. For

these reasons, public healthcare datasets to evaluate the data-driven models are not abundant. Dis-

tributed nature of EHR datasets is another issue that often needs to be addressed, especially while

designing predictive models. A large proportion of biomedical informatics tasks require predictive

analysis, where data diversity plays a crucial role. Each hospital has a similar type of patient in-

formation that can serve the same predictive task. In this case, combining EHR datasets collected

in different regions into one dataset would enhance the generalization property of the predictive

models. However, in practice, patient distribution varies for different locations (e.g., ethnicity, in-

come groups, cultural and dietary practices). This violates the training data points drawn from the

same stationary distribution premise of machine learning. As a result, predictive models cannot in

general be learned for a combination of all the available EHRs from different hospitals.

Missing values in EHRs and interpretability of the learning model are some of the other promi-

nent challenges that complicate the algorithm development. Missing values are often encountered

in electronic records. For instance, the patient table in Figure 1.5 has missing date of birth for many

patients. If the missing values cannot be imputed, patients with missing values or the associated

type of information are discarded. Elimination of subjects and features because of missing values

may signiﬁcantly reduce the size of the dataset. In addition, the interpretability of machine learning

models for healthcare is another important challenge. It is crucial to incorporate the domain exper-

tise into the learning scheme in biomedical informatics. However, it is often not possible to infer

biological meanings from the output and the learned parameters of a computational model, such

as deep learning models. Learning interpretable models for biomedical informatics applications is

another ongoing research area [105].

One of the most important characteristics of the EHRs is its longitudinal nature. While the

medical history of patients is a signiﬁcant element for disease progression and risk prediction stud-

9

Figure 1.6 Medical record of a patient contains different medical events at each time step. The gap
between two consecutive time steps is often irregular in EHR datasets. For this reason, EHRs are
harder to analyze compared to univariate regularly sampled time series data.

ies, it is also challenging to leverage the temporal aspect of EHRs. Different than standard time

series data such as audio signals, EHRs have high dimensional heterogeneous data points chang-

ing over time. Furthermore, the sampling rate is usually unknown and the elapsed time between

consecutive records of a patient is not uniform throughout the medical history. An illustration

of longitudinal records of a patient is given in Figure 1.6. The challenging characteristics of the

EHRs also complicates the visualization of the health information. Visualization is an important

component of interactive data analytics tools, which enables domain experts to interpret and some-

times tune the output of computational techniques. Therefore, it is important to design models that

facilitate a visualization approach for the results.

1.1.3 Role of Machine Learning in Biomedical Informatics

Biomedical informatics aims to discover underlying characteristics of patient data (e.g., EHRs)

and subsequently use the extracted information to predict outcome of healthcare tasks. Machine

learning and data mining offer the necessary data-driven techniques for the aforementioned pur-

pose. In particular, machine learning provides solutions for challenging biomedical informatics

tasks such as personalized medicine [97], patient phenotyping [56], and adverse drug reaction

(ADR) prediction [46], where computational techniques are expected to provide clinically sensible

10

Figure 1.7 An example of hierarchical patient phenotyping visualization. Each node in this tree
gives a structured clinical phenotype and a stable subcohort characterized by the phenotype.

performances. For instance, dimensionality reduction eliminates redundancy in EHRs and enables

interpretation for tasks such as patient phenotyping. Consequently, dimensionality reduction meth-

ods can also facilitate visualization of patient phenotyping procedure as given in Figure 1.7. In the

ﬁgure, a tree structure demonstrating the hierarchical clinical phenotypes is shown [9].

Deep learning is commonly employed to offer a solution to complicated healthcare tasks, such

as temporal clinical event prediction [28] and concept embedding [29]. Deep auto-encoders can

learn distinctive patient representations from heterogeneous and longitudinal EHRs with non-linear

interactions. In particular, RNNs are utilized to forecast future clinical conditions of patients, e.g.,

predicting the future diagnoses and date of the future hospital visit of patients. Machine learning

11

and deep learning algorithms in biomedical informatics also aim to provide efﬁcient ways to inte-

grate domain expertise into the learning model, which improves the reliability of the computational

techniques. In short, machine learning facilitates the extraction of crucial information present in

the EHRs that assists physicians in their efforts to treat patients and plan their long-term care. In the

following sections, fundamentals of machine learning, challenges, and machine learning solutions

to tackle these challenges are summarized.

1.2 Machine Learning

One important factor that drives advances in machine learning research is the growth of digital

data and the consequent demand for data analysis and interpretation tools. As more domains adopt

machine learning techniques for their data, new machine learning models, which are computation-

ally efﬁcient and capable of analyzing complex data, have been developed. Figure 1.8 shows the

expected growth of the world markets for artiﬁcial intelligence (AI) systems between 2017 and

2025. As can be seen in the ﬁgure, world markets invest in AI technology with an increasing rate.

AI, machine learning and deep learning are all related to machine perception [91], therefore the

term AI usually refers to machine learning and deep learning applications.

1.2.1 Problem Setting

In a typical machine learning problem setting, a model is learned from a training set containing in-

put and target pairs, (x, y). An independent test set is necessary to evaluate the learning algorithm.
Input x ∈ Rd is a d dimensional feature vector. Depending on the type of task, i.e. classiﬁcation
or regression, target y can be an integer, boolean or real valued scalar or vector. In classiﬁcation

problems, target y is also called the class label. Learning problems, where a target set is available,

partially available or unavailable are named supervised, semi-supervised or unsupervised, respec-

tively. Regardless of the target availability, machine learning algorithms are designed to learn a

function that can map the input data x to the target y; f (x) = y for the supervised problems and

12

Figure 1.8 Expected growth of the world market for artiﬁcial intelligence (AI) systems between
2017-2025 [110].

a function that projects the input x into another feature space, where the information contained in

the data is ampliﬁed for the unsupervised tasks, e.g., clustering. Thus, patterns and trends in the

training data are captured into the learned function, a.k.a the model, to be able to make predictions

about the unseen data that needs to be analyzed.

Machine learning describes a family of data-driven approaches where the learning-from-data

concept fundamentally consists of an optimization problem. Optimization in machine learning is

usually different than the standard mathematical optimization. Machine learning aims to optimize

the learning model with respect to a speciﬁc performance metric (e.g., classiﬁcation accuracy, area

under curve (AUC), F1 score) rather than focusing on optimizing the loss function itself. Hence,

the learning procedure iteratively improves the model by optimizing the parameters to attain high

13

Figure 1.9 Overall procedure of supervised learning problem. N is the total number of training
data points.

performance for the speciﬁed metric of the task. Machine learning often employs a ﬁrst or second

order iterative gradient descent-based optimization scheme since a closed-form solution is often

not available. Figure 1.9 summarizes the overall procedure of the supervised learning problem.

In short, the main component of a machine learning problem is the optimization scheme whose

complexity is usually quite sensitive to the number of data points and the dimensionality of features

(number of attributes).

1.2.2 Challenges

In principle, the success of a machine learning algorithm depends on the size and representative-

ness of the training set. From this perspective, big data should provide excellent opportunities to

learn powerful models. However, very large training sets also lead to computation and memory

issues. Furthermore, digital data cannot be simply characterized by its size alone. It also poses

challenges in terms of its high-dimensional, temporal, heterogeneous, non-linear, and distributed

nature. All these challenges complicate the optimization component of the learning-from-data

14

paradigm. Even though there have been tremendous advancements in processor and memory

technologies, the data complexity and required tasks often make the available machine learning

approaches infeasible. For instance, while the full gradient-descent based optimizers can provide

fast convergence, it is not efﬁcient to compute the gradient at each data point in a training set with

hundreds of thousands of instances.

As more domains demand machine learning solutions to address their data analytics problems,

the complexity of the desired learning tasks has also increased.

In the early years of machine

learning, tasks with well deﬁned input and output relationships, such as web search, spam ﬁlters,

recommender systems and fraud detection, were prominent. However, current problems cover a

broad spectrum, from autonomous driving to healthcare applications such as mortality prediction.

Therefore, both the data and the desired tasks are neither structured nor straightforward anymore.

The data has a more complicated structure and the learning algorithms are required to leverage

the characteristics of the data, including non-linear interactions, to be able to provide the desired

predictive power. For this reason, machine learning algorithms now avoid simpliﬁed assumptions,

such as convexity and linearity. The recent success of deep learning models, which are highly

non-linear and non-convex, supports the aforementioned point.

1.2.3 Machine Learning Solutions

One of the prominent difﬁculties caused by big data is the complicated nature of optimization [77].

Often, machine learning algorithms solve an optimization problem based on minimizing the errors
made by the learned function f (·) on the test data with respect to an evaluation metric. In addi-
tion, machine learning problems often pose constraints on the models, leading to use of regularized

optimization schemes which typically divide an iteration into two steps (gradient update and prox-

imal projection) when the regularizer is not smooth. As previously mentioned, time complexity of

the gradient step increases with the amount of training data in the case of the full gradient descent

approach. Another difﬁculty arises from the computational complexity of the proximal step, which

is usually sensitive to the input dimensionality, e.g., nuclear (trace) norm has a proximal projection

15

(singular value thresholding) with O (d3) complexity, where d is the feature dimensionality. To

handle this challenge, machine learning techniques now utilize stochastic approaches. Stochastic

proximal gradient descent and its variants compute the gradient at one randomly sampled point

in each iteration, offering a reduced computational complexity. However, random sampling intro-

duces variance to the framework which results in slow convergence. Variance reduction approaches

have been developed to improve the convergence rate of the stochastic methods [101, 121].

High dimensionality is alleviated by dimensionality reduction methods. The purpose of dimen-

sionality reduction is not only to decrease the number of the input features, but also to eliminate

redundancy in the original representation. Dimensionality reduction methods usually learn a linear

or non-linear projection where data is represented in a lower dimensional space and the relevant

information in the data is preserved. For instance, principal component analysis (PCA), ﬁrst de-

veloped by Karl Pearson [92], preserves most of the variance present in the data while projecting

the original features into a lower dimensional space via a linear transformation. However, one

drawback of the traditional PCA is that the principal components are computed as linear combina-

tion of all the input dimensions. Therefore, traditional PCA does not enable interpretation of the

output dimensions with respect to the input. Sparse PCA [33,86,104], which takes the linear com-

bination of only some of the input dimensions, has been introduced to mitigate the interpretability

limitation.

In machine learning, it is often assumed that the training data is accessible in a local machine.

However, real world data can be distributed over different geographical regions. Furthermore, it

may not be always possible to send datasets over a network to a local machine due to limited

bandwidth and privacy concerns. Even if data could be centralized, datasets cannot be combined

together to learn a predictive model since it would violate the assumption that data points are

generated from the same probability distribution model. In such cases, machine learning offers

multi-task learning (MTL) [21] approach that treats each distributed dataset as a separate task

and combines multiple single task models to obtain a composite model with better generalization

16

property. Furthermore, distributed MTL approaches supported by distributed optimization tech-

niques [85, 115] have been developed to alleviate the need for data centralization.

Another common aspect of the big data is the time dependency. Real world data collected

through sensors usually changes with time such as physiological measurements of a patient, or

audio and video signals. The temporal structure of the data can provide valuable information

and improve predictive performance. For instance, common biomedical informatics tasks include

forecasting future diagnoses of patients during their next visit. Therefore, learning temporal de-

pendencies between consecutive elements of a sequence is an important step. However, it is not

straightforward to extract temporal patterns and incorporate them into learning. This is especially

true when multivariate data is changing over time such as complete medical records of a patient.

One popular solution providing impressive performance for temporal or sequential data is deep

learning. In particular, recurrent neural networks (RNNs) have been commonly used to learn long

term dependencies in sequences with complicated structures.

1.3 Dissertation Focus and Contributions

In this dissertation, our main focus is to develop machine learning approaches to mitigate some

of the challenges in biomedical informatics. In particular, machine learning and deep learning

approaches are developed to assist computational patient phenotyping, to mitigate learning pre-

dictive models for distributed EHRs, to learn a single representation for temporal patient records

with irregular elapsed times, and to capture short and long term dependencies in time series with a

decoupled memory recurrent neural network.

• Computational patient phenotyping, which requires the analysis of large patient populations
and interpretation of input features, is addressed by a convex sparse PCA approach [9, 10].

A proximal variance reduced stochastic gradient descent method is used to solve the convex

sparse PCA problem. The proposed framework offers an interpretable patient phenotyping

approach due to the sparsity, and a time-efﬁcient optimization approach due to stochastic op-

17

timization with variance reduction. Furthermore, fast convergence properties can be attained

by considering the convex formulation of sparse PCA, which normally leads to a non-convex

optimization problem.

• A distributed multi-task learning approach with asynchronous updates [12] is introduced to
efﬁciently utilize distributed EHR datasets. EHRs are stored in different hospitals and they

cannot be transferred over a network because of privacy concerns and limited bandwidth.

The proposed distributed framework enables learning individual predictive models sepa-

rately and transferring a single vector over the network instead of the whole EHR dataset.

Knowledge transfer between the single models is performed asynchronously in a central

server. The asynchronous nature ensures a more robust learning framework in the presence

of network delays and failures compared to the traditional approaches such as centralized

and synchronous multi-task frameworks.

• A new Long-Short Term Memory (LSTM) network, named time-aware LSTM [11], is pro-
posed for longitudinal EHR datasets with irregular elapsed times. Elapsed time between

two consecutive patient records, which is a signiﬁcant element of clinical decision making,

usually varies from months to years. This time gap is used to modify the effect of the pre-

vious cell memory to the current output in time-aware LSTM. The proposed architecture is

deployed in an auto-encoder setting to solve the patient-subtyping problem, which aims to

group patients based on similar progression pathways.

• Due to the interest and the positive feedback from the research community about time-aware
LSTM, we extend the proposed idea to a decoupled memory gated unit architecture, named

decoupled memory recurrent network (DM-GRN). The main purpose of the study is to pro-

vide a different memory decomposition approach to be able to capture long and short-term

dynamics more explicitly. The proposed model is evaluated for healthcare and trafﬁc speed

datasets. Memory properties are discussed and visualized using synthetic examples.

18

In the subsequent four chapters, the proposed approaches summarized above will be presented in

detail. In each chapter, literature is reviewed, methodology is explained, and experimental results

are discussed. A summary and conclusions of this dissertation research are presented in Chapter 6.

19

Chapter 2

PHENOTREE: Hierarchical Phenotyping via

Sparse Principal Component Analysis

Computational patient phenotyping is a data-driven task of discovering clinical phenotypes in a

large patient cohort. The output of the patient phenotyping task is patient groups with similar

diagnostic pathways. In this study, diagnosis information of the patients is provided with ICD-

9 codes, which are universal codes to depict diagnostic groups.

ICD-9 codes (e.g. more than

10, 000 in this study) can be represented with high dimensional sparse vectors. As a result, the

computational phenotyping problem becomes obtaining patient groups of similar diagnoses in a

large patient cohort using ICD-9 codes. Since the ground-truth patient groups are not available,

this task is an unsupervised learning problem. Time efﬁciency of the phenotyping method and the

interpretability of the results are two important factors that facilitate the role of domain experts in

the phenotyping procedure. In this chapter, we propose a hierarchical phenotyping approach based

on sparse principal component analysis (SPCA). Dimensionality reduction methods are commonly

used to analyze high dimensional data. In particular, PCA aims to map high dimensional data into

a lower dimensional space where the most salient information in the data is preserved. On the other

hand, the output dimensions obtained by PCA cannot be easily interpreted regarding the input di-

mensions. SPCA improves the interpretability of the output dimensions by learning a sparse linear

20

transformation. Time efﬁciency of the patient phenotyping approach is addressed by solving the

SPCA problem using a stochastic proximal gradient descent technique with variance reduction. An

effective visualization technique is provided to assist physicians in analyzing the patient phenotyp-

ing results. In summary, an interpretable and time efﬁcient interactive computational phenotyping

tool is introduced in this chapter.

2.1

Introduction

Electronic health records (EHRs) provide a digital platform to store comprehensive patient in-

formation. The availability of digital patient records facilitates many challenging biomedical in-

formatics tasks. One of the challenging tasks is to identify clinical phenotypes that characterize

patient cohorts. The term phenotype is originally deﬁned as a composition of observable properties

of an organism, as a result of the interactions between its genotype and environmental surround-

ings [131]. Clinical phenotypes, on the other hand, represent patient categories with different com-

binations of diseases [1]. In the rest of the chapter, the term phenotype will be used for the clinical

phenotypes. Obtaining clinical phenotypes facilitates developing the most suitable treatments for

patients with speciﬁc requirements. For this reason, patient phenotyping can be considered an im-

portant step towards personalized medicine. Patient phenotyping requires analysis of large number

of patient records to extract the key clinical features that characterize certain patient groups. Given

the scale and the complexity of the EHR data, manual extraction of phenotypes by physicians is

not feasible. On the other hand, availability of the vast amount of patient data offers many op-

portunities for machine learning research to improve patient phenotyping task and patient care in

general. For this reason, biomedical informatics resorts to machine learning techniques to assist

physicians in extraction of clinical phenotypes from large scale patient cohorts [23, 57, 120, 131].

Inferring clinical phenotypes using machine learning techniques is named computational phe-

notyping. Computational phenotyping techniques usually aim to learn the coarse characteristics

of the population. However, it is more informative to explore ﬁner granularities and hierarchies

21

in phenotypes [82, 111]. In such cases, domain knowledge is necessary to reﬁne computational

phenotyping results. For this reason, a visually interpretable and time efﬁcient computational phe-

notyping tool is essential to assist physicians in the interpretation and evaluation of the phenotype

hierarchy. One of the challenges of developing such tools is the scale of the data, and consequently

the long processing time. To overcome the time efﬁciency and interpretability challenges, we in-

troduce PHENOTREE, a visual analytics tool utilizing SPCA to obtain hierarchical phenotypes of

large patient cohorts. In particular, PHENOTREE explores hierarchical phenotypes by iteratively

applying SPCA, and visualizing the phenotypes at different levels of granularity as a tree struc-

ture. Given cohort/sub-cohorts, PHENOTREE employs SPCA to identify key clinical features at

each level of phenotype hierarchy. At the end of the PHENOTREE procedure, phenotype of a cor-

responding patient group is generated as a set of key clinical features at different granularities.

To address the time complexity of standard SPCA approaches, a convex formulation of SPCA is

adopted so that a variance reduced stochastic gradient descent solver with fast convergence can be

used. Experiments on two real-world EHR patient cohorts are conducted to demonstrate the pheno-

typing application of the proposed PHENOTREE. Qualitative assessments show that PHENOTREE

can provide clinically plausible results. In addition, time efﬁciency and convergence properties of

the proposed SPCA algorithm are investigated.

2.2 Literature Review

In this study, we propose a phenotyping method using SPCA and investigate a ﬁrst order stochastic

gradient descent approach to solve the SPCA problem. For this reason, in this section, we review

clinical phenotype discovery methods and stochastic optimization techniques in literature.

2.2.1 Data-Driven Phenotyping

Due to its sparsity and noise, raw EHR data is not informative enough to directly utilize in clinical

research. Sparsity of the EHRs is often due to the fact that patients have various combinations

22

of different diseases (some diseases can be rare in the cohort). As a result, when the patient in-

formation is represented with a ﬁxed-length vector (e.g., size is based on the dictionary of unique

diagnoses), patient vectors will be sparse. EHR data is also noisy due to errors and anomalies in

EHR softwares. Computational phenotyping provides a more stable and robust representation for

patients than their raw EHR. As discussed by Zhou et al. [131], extracting phenotypic patterns

of patients is an important task which can contribute to the personalized medicine. Authors pro-

posed that the clinical features in EHR data can be mapped to a much lower dimensional latent

space and utilized a matrix completion approach to extract patient phenotypes [131]. Ho et al. [57]

proposed a phenotyping method, named Marble, by introducing a sparse non-negative tensor fac-

torization approach to obtain phenotype candidates. Ho et al. also deﬁned the properties of an

ideal phenotype, such as representativeness of the complex interactions between several sources

and interpretability.

Deep learning has also been successfully utilized to solve biomedical informatics tasks. For

instance, a deep model is used for the discovery and detection of characteristic patterns in clinical

data [23]. Che et al. showed that deep neural networks can learn relevant features for medical

applications. Authors state that the proposed framework improves the multi-label classiﬁcation

performance such as predicting ICD-9 codes. On the other hand, Marlin et al. proposed an un-

supervised approach to computational phenotyping [84]. The temporal sparsity of EHR data is

addressed using a probabilistic clustering model with an empirical prior distribution which was

used to deal with the sparsity of the data. Authors also state that the proposed model can capture

physiological patterns, and the clusters can distinguish different physiological variables [84].

2.2.2 Sparse Principal Component Analysis and Stochastic Proximal Opti-

mization

SPCA was proposed for the ﬁrst time by Hastie et al. [61] to address the interpretability issue

of the PCA. Traditional PCA learns dense loading vectors so that the principal components are

the linear combinations of all the input dimensions. In this case, it is hard to interpret the prin-

23

cipal components regarding the contribution of the input dimensions. On the other hand, Hastie

et al. proposed to learn sparse loading vectors using the lasso (elastic net) constraint so that the

principal components become linear combinations of only some of the input dimensions. In ad-

dition, d’Aspremont et al. proposed a SPCA approach based on semi-deﬁnite programming [33].

On the other hand, Journee et al., introduced two types of SPCA approaches [68]. The proposed

formulations are based on maximizing a convex function on a compact set using (cid:96)1 or (cid:96)0 norms.

However, large scale and high dimensional patient records cannot be handled by these approaches.

A stochastic SPCA algorithm which can deal with large scale data with exponential convergence

rate is proposed [104]. Furthermore, an approach with stochastic iterations with variance reduc-

tion, which had been previously proposed [67], is utilized. Variance reduction mechanism requires

strong convexity, however SPCA is a non-convex problem. Johnson and Zhang [67] provided a

different convergence analysis that does not include strong convergence property.

In this study, we propose to use a stochastic approach with variance reduction framework.

To be able to leverage the high convergence property of convex problems, convex formulation of

PCA [48] is adopted. The SPCA is posed as an (cid:96)1 norm regularized optimization problem, there-

fore a proximal gradient descent approach is required. In literature, several proximal gradient based

methods have been developed such as [13,119]. FISTA by Beck and Teboulle [13] offers the fastest

convergence rate among the ﬁrst order methods. However, FISTA ensures the fast convergence rate

for full gradient descent, which is not scalable. Therefore, stochastic gradient approaches are more

suitable for large scale problems. On the other hand, stochastic gradient descent suffers from high

variance, leading to low convergence rate. Nitanda [88] introduced a variance reduction frame-

work with Nester’s acceleration method to alleviate the aforementioned drawback of the stochastic

gradient approach. Johnson and Zhang [67] also introduced a variance reduction approach which

progressively reduces the variance in proximal gradient descent. Strong convexity of the objective

function is required to achieve a geometric convergence rate under expectation [67]. Another vari-

ance reduction approach for proximal algorithms was proposed by Xiao and Zhang [121] where a

multi-stage scheme is presented with strong convexity and Lipschitz continuity assumptions.

24

2.2.3 Visual Analytics for EHR

Visualization is essential in biomedical informatics to ensure that the domain experts can interpret

the output of data-driven models. For this purpose, Perer et al. proposed an interactive mining and

visualization framework, named Care Pathway Explorer, to capture frequent events in the EHR

data [95]. Another interactive visualization approach based on mining the EHRs and analyzing

clinical sequences was developed by Gotz et al. [50]. Wang et al. focused on a visual analysis

method for chronic kidney disease [114]. Huang et al. similarly utilized an interactive approach

to classify patients into different subsets [59]. Authors developed a visually rich web-based ap-

plication that can help physicians and researchers to comprehend and study patient cohorts over

time.

2.3 PHENOTREE

In this section, the proposed PHENOTREE is introduced. In the subsequent sections, details of the

EHR data used in this study, the proposed convex SPCA approach, and the optimization scheme

are presented.

2.3.1 Electronic Health Records

EHRs comprise digital patient information of different modalities (e.g., diagnosis, medication,

test result, and demographics) collected over a time period [54]. Diagnostic information is often

recorded as international codes, such as ICD-9. Each ICD-9 code corresponds to a speciﬁc di-

agnosis and there is a hierarchical relationship between the codes. Some of the ICD-9 codes and

their corresponding diagnostic groups are given in Figure 2.1. As shown in the ﬁgure, ICD-9 codes

from 001 to 139 represent infectious and parasitic diseases, and one of its subgroups, ICD-9 010 to

018, corresponds to tuberculosis. Patient demographics might also be explicitly provided in EHR

datasets, however patient’s gender and age group can sometimes be inferred from the ICD-9 codes.

For example, ICD-9 630 to 679 encode complications of pregnancy and childbirth. Hence, even

25

Figure 2.1 ICD-9 codes are used to represent different diagnostic groups. These universal codes
have a hierarchical structure.

though the patient demographic is not available, gender of the patient can be inferred from ICD-9

630 to 679. Due to its rich information content, ICD-9 is an important source for the computa-

tional patient phenotyping task. To be able to utilize ICD-9 codes for the exploration of phenotypes

from a large patient cohort, biomedical informatics resorts to machine learning and data mining

techniques. However, computational phenotyping results cannot have direct clinical implications

until they are validated by medical experts. For this reason, it is essential to visualize the extracted

phenotypes such that the medical experts can approve and reﬁne the results. In this study, a vi-

sualization approach is proposed to be able to interpret the extracted phenotypes. Details of the

proposed patient phenotyping approach based on SPCA are presented in the next sections.

2.3.2 Phenotyping via SPCA

Computational patient phenotyping process offers clinical guidance to domain experts as long as

the extracted phenotypes are interpretable. In particular, the exploration of the clinical phenotypes

at different levels of granularity is necessary to assist clinical researchers in understanding the

26

diagnostic properties of different sub-groups in a patient cohort. Machine learning and data mining

techniques facilitate extraction and visualization of hierarchical phenotypes due to their capability

of inferring underlying patterns in large datasets. The computational phenotyping task discussed

in this study is an unsupervised learning problem, where the ground-truth patient sub-groups are

not available. For this reason, we need to design a machine learning model that can extract key

clinical features from an EHR dataset to represent a group of patients without any supervision.

Such key clinical features can be obtained based on the frequencies of diagnosis (e.g., ICD-9 code)

encountered in the EHRs. However, a frequency based approach ignores the dependencies and

the hierarchical relationships between the features. To address the aforementioned challenges, we

propose a patient phenotyping approach based on SPCA, an unsupervised dimensionality reduction

technique. PCA, expressed in Eq. 2.3.1, is one of the most popular unsupervised dimensionality

reduction approaches.

(cid:107)SZ(cid:107)2
F ,

max
Z∈Rd×p

s.t. ZT Z = I,

(2.3.1)

where d is the input dimensionality, p is the number of principal components or the output dimen-
sionality, S is d× d covariance matrix, Z is a d× p orthogonal projection matrix, and (cid:107).(cid:107)F denotes
the Frobenius norm. Eq. 2.3.1 represents a constraint optimization problem, where an orthogonal

transformation matrix is learned from the data. PCA ensures that the top principal components re-

tain most of the variance existing in the data. Thus, the original data can be projected into a lower

dimensional space without losing the signiﬁcant information. However, PCA is not suitable to

interpret the output dimensions. Since the columns of the projection matrix Z (loading vectors) are

dense, the principal components are computed as a linear combination of all the input dimensions.

SPCA addresses this drawback by learning sparse loading vectors [61], where only a subset of

input dimensions are combined to obtain principal components as shown in Figure 2.2. Therefore,

SPCA is more interpretable in terms of analyzing the contributions of the input dimensions to the

principal components. For this reason, SPCA is used to design the proposed patient phenotyping

27

Figure 2.2 SPCA learns a sparse loading vector. Principal components are represented as a linear
combination of a subset of the input dimensions.

approach. The key clinical features can be deﬁned as the input dimensions that correspond to the

non-zero elements of the sparse loading vector. Thus, a set of sub-cohorts, each of which com-

prises the patients associated with one of the key features, can be obtained. Therefore, SPCA is

applicable to phenotyping task and the sparsity enables a hierarchical representation that facilitates

the visualization. On the other hand, traditional SPCA methods have high time complexity that

prevents designing interactive tools. To alleviate the time complexity of SPCA, a stochastic con-

vex SPCA approach is introduced. The details of the proposed model is presented in the following

section.

2.3.3 Stochastic Convex SPCA (Cvx-SPCA)

Existing SPCA methods usually suffer from scalability, which is an obstacle for developing efﬁ-

cient patient phenotyping tools. For this reason, we propose a convex formulation to be able to

leverage fast convergence property of a stochastic approach with variance reduction. In this study,

we focus on the ﬁrst principal component. Finding the sparse loading vector of the ﬁrst princi-
pal component can be posed as an (cid:96)1 norm ((cid:107).(cid:107)1) regularized optimization problem as given in
Eq. 2.3.2.

−zT Sz + γ (cid:107)z(cid:107)1 ,

min
z∈Rd

28

(2.3.2)

(cid:80)n

i=1 xixT
i

where d dimensional vector z is the loading vector of the ﬁrst principal component, S =

is the covariance matrix, and γ is the regularization parameter to control the spar-

1
n
sity of z. This formulation contains a smooth ﬁrst term and non-smooth second term. Proximal

gradient descent provides a typical solution for the composite formulations such as Eq. 2.3.2. Prox-

imal gradient descent method divides the optimization problem into two simple steps. In the ﬁrst

step, the next search point is updated using the gradient of the smooth part. In the second step,

optimal point is obtained by evaluating the proximal operator of the non-smooth part at the updated

point obtained from the ﬁrst step. As it was discussed earlier, proximal methods [13, 119] provide

fast convergence rates, however the full gradient is required in each iteration. Hence the com-

putational time increases drastically for large scale datasets such as EHRs. Stochastic proximal

gradient descent (Prox-SGD), where gradient is computed for a single data point at each iteration,

is more scalable. However, the stochastic methods suffer from low convergence due to the high

variance. The variance is introduced by the random sampling at each iteration. To deal with the

variance, stochastic algorithms adopt diminishing step size that increases the number iterations to

converge to the optimal solution.

Stochastic proximal gradient methods with variance reduction have been proposed to progres-

sively decrease the variance to avoid the slow convergence. For instance, a proximal stochastic

gradient approach with variance reduction (Prox-SVRG) [121] is available to alleviate the effects

of the high variance. Prox-SVRG provides a geometric convergence rate which is much faster

than the traditional Prox-SGD. On the other hand, the fast convergence of Prox-SVRG depends on

the convexity of the objective and Lipschitz continuity of the gradient. Since the formulation in

Eq. 2.3.2 displays a non-convex objective function, we cannot leverage the convergence properties

of Prox-SVRG. For this reason, we propose to formulate the SPCA with the following convex

optimization problem [10]:

(cid:18)1

2

min
z∈Rd

zT (λI − S) z − wT z

(cid:19)

+ γ (cid:107)z(cid:107)1 ,

(2.3.3)

29

where λ > λ1 (S) is the convexity parameter, λ1 (S) represents the largest eigenvalue of the co-
variance matrix S, and w ∈ Rd is a random vector to make sure that the ﬁrst order derivative of the
smooth term is not zero. The ﬁrst term of this formulation, which was proposed as an approxima-

tion of ﬁnding the ﬁrst principal component [48], is strongly convex. If λ is greater than the largest
eigenvalue of S, then λI − S will be positive deﬁnite which is a necessary condition for strong
convexity. On the other hand, the regularization term, (cid:107)·(cid:107)1, is not strongly convex, but convex.
Therefore, the composite function contains a strongly convex and a convex term which make the

overall function strongly convex.

2.3.3.1 Optimization Scheme

In this thesis, Prox-SVRG approach [121] is used to solve the problem in Eq. 2.3.3 which is the

combination of a smooth F (z) = (cid:2)zT (λI − S) z − wT z(cid:3) and a non-smooth part R (z) = (cid:107)z(cid:107)1.

In this case, F (z) can also be written as the sum of n smooth functions as given below.

(cid:88)n

i=1

F (z) = 1
n

(cid:2) 1
2zT(cid:0)λI − xixT

i

(cid:1) z − wT z(cid:3)

(2.3.4)

When n in Eq. 2.3.4 is large, computing the full gradient of the smooth part at each iteration

will be very time consuming. In contrast, Prox-SVRG [121] approach computes the gradient at a

randomly sampled data point in each iteration, and the variance of the gradient is upper bounded

by a multi-stage progressive variance reduction scheme. Furthermore, the variance is ensured to

converge to zero upon the optimal solution is obtained. Detailed proof of bounding the variance

can be found in Section 3.1 of [121].

In this study, Prox-SVRG optimization scheme given in Algorithm 1 is adopted to solve the

SPCA formulation in Eq. 2.3.3. In Algorithm 1, z0 is the initial value of the loading vector z, which

is usually randomly sampled from normal distribution, η is a constant step size, m is the number of

iterations for each epoch s, and T is the maximum number of epochs. The most time consuming

component of the algorithm is the computation of the full gradient ˜v, which requires multiplication

30

Algorithm 1 Prox-SVRG algorithm for solving Cvx-SPCA.
Require: λ, [x1, x2, ..., xn] , S, w, z0, η, γ, m, T
Ensure: z
1: for s = 1, 2, ...T do
2:
3:
4:
5:
6:
7:
8:
9:
10:
11: end for
12: return ˜zT

Pick xik ∈ {x1k, ..., xnk} randomly
zk = proxηγ (zk−1 − ηvk)

(cid:1) (zk−1 − ˜z) + ˜v

end for
˜zs = 1
m

˜z = ˜zs−1
˜v = (λI − S) ˜z − w
z0 = ˜z
for k = 1, 2, ..., m do

vk =(cid:0)λI − xikxT

ik

(cid:80)m

k=1 zk

of a d × d matrix with a d dimensional vector, in each epoch. However, this multiplication needs
to be performed only once before the iterations and the same copy of the full gradient is used to

iteratively reduce the variance of the stochastic gradient. In Algorithm 1, ˜z denotes an estimate of

the optimal point, which is updated as the average solution obtained throughout an epoch. At each

iteration, one data point is randomly sampled and the gradient is computed at that point as given

below:

vk = ∇fik (zk−1) − ∇fik (˜z) + ∇F (˜z)

=(cid:0)λI − xikxT

ik

(cid:1) (zk−1 − ˜z) + (λI − S) ˜z − w,

(2.3.5)

where ∇F (˜z) is the average gradient of functions fi (z) , i = 1, ..., n or the full gradient at point
˜z, ∇fik (zk−1) is the gradient of the function calculated by using the data point xik sampled at the
kth iteration and ˜z is the average of zk, k = 1, .., m at the end of an epoch.

Although the gradient in Eq. 2.3.5 is different than the actual gradient, it is still an estimate of

the full gradient that can be shown by taking the expectation of vk. Thus, the variance reduced

gradient has the same direction as the full gradient under expectation. After the gradient update of
the smooth part, [zk−1 − ηvk], proximal mapping of (cid:96)1 norm is applied to obtain the ﬁnal solution

31

as follows.

zk = proxη,γ (zk−1 − ηvk)

= sign (zk−1 − ηvk) max (0,|zk−1 − ηvk| − ηγ)

(2.3.6)

In the Prox-SVRG algorithm, variance of the stochastic gradient vk is reduced progressively,

while both ˜z and zk−1 are converging to the optimal point z∗ = arg minz P (z) [121]. Since the

full gradient is utilized to modify stochastic gradients and function F is an average of the smooth

component functions, variance can be bounded. In the next section, convergence analysis of Prox-

SVRG is summarized.

2.3.3.2 Convergence Analysis

The objective function proposed in this study is suitable to follow the convergence analysis of Prox-

SVRG. Therefore, our analysis is mostly adapted from [121]. However, we use weak conditions

which allow a broader family of objective functions to ﬁt in this scheme and leverage the geometric

convergence. The ﬁrst assumption is retained as in [121].

Assumption 2.3.1. The function R (z) is lower semi-continuous and convex, and its effective do-

main, dom (R) :=(cid:8)z ∈ Rd|R (z) < +∞(cid:9) is closed. Each fi (z) , f or

i = 1, ..., n, is differen-

tiable on an open set that contains dom (R), and their gradients are Lipschitz continuous. That is,
there exist Li > 0 such that for all z, y ∈ dom (R),

(cid:107)∇fi (z) − ∇fi (y)(cid:107) ≤ Li (cid:107)z − y(cid:107)

which also implies that the gradient of the average function F (z) is also Lipschitz continuous, i.e.,
there is an L > 0 such that for all z, y ∈ dom (R),

(cid:107)∇F (z) − ∇F (y)(cid:107) ≤ L(cid:107)z − y(cid:107)

32

where L ≤ (1/n)(cid:80)n

i=1 Li.

In [121], convergence analysis assumed that the objective function is strongly convex. On

the other hand, we only assumed that functions F (z) and R (z) are convex, but not necessarily

strongly convex. Thus, one strong assumption is relaxed. Strong convexity provides important

properties for faster convergence rates. However, this strong assumption does not always hold in

practice, and for this reason, a simpliﬁed version of the analysis will be preferable. We drop the

strong convexity assumption at two points in the original analysis [121] and obtain the convergence

rate given in the following theorem.

Theorem 2.3.2. Under the assumption that Assumption 2.3.1 holds and 0 < η < 1/ (4LQ), where

LQ = maxiLi, the convergence rate is obtained as follows:

ρ =

1

(cid:96) (1 − 4LQη) mη

+

4LQη (m + 1)
(1 − 4LQη) m

< 1,

E{P (˜zs)} − P (z∗) ≤ ρs [P (˜z0) − P (z∗)] ,

(2.3.7)

where z∗ = arg minz P (z).

Proof. The proof of Theorem 2.3.2 starts with investigating the distance between zk and z∗;
(cid:107)zk − z∗(cid:107)2. According to the stochastic gradient mapping deﬁnition in [121], zk can be written as
zk−1 − ηgk.

(cid:107)zk − z∗(cid:107)2 = (cid:107)zk−1 − ηgk − z∗(cid:107)2
= (cid:107)zk−1 − z∗(cid:107)2 − 2ηgk

The term(cid:0)−gk

T (zk−1 − z∗) + η

T (zk−1 − z∗) + η2 (cid:107)gk(cid:107)2

2 (cid:107)gk(cid:107)2(cid:1) can be bounded by using the deﬁnition of the proximal

(2.3.8)

(2.3.9)

update as shown below.

zk = proxηR (zk−1 − ηvk) = arg min

{1
2

y

(cid:107)y − (zk−1 − ηvk)(cid:107)2 + ηR (y)}

(2.3.10)

33

According to the optimality condition:

zk − (zk−1 − ηvk) + ηξ = 0

(2.3.11)

where ξ ∈ ∂R (zk) is the subgradient of R (z) at zk. If we combine the stochastic gradient mapping
deﬁnition with the optimality condition, we obtain the following expression.

zk − (zk + ηgk − ηvk) + ηξ = 0 ⇒ ξ = gk − vk

(2.3.12)

By using the convexity of F (z) and R (z), we can write the following inequality.

P (y) = F (y) + R (y) ≥ F (zk−1) + ∇F (zk−1)T (y − zk−1)

+ R (zk) + ξT (y − zk)

(2.3.13)

(2.3.14)

Convergence analysis in [121] utilized strong convexity of F and R in Eq. 2.3.14. However, we

show that strong convexity is not required at this point. Since F (z) is assumed to be Lipschitz

continuous with Lipschitz constant L, F (zk−1) can also be bounded by using Theorem 2.1.5

in [87].

F (zk−1) ≥ F (zk) − ∇F (zk−1)T (zk − zk−1) − L
2

(cid:107)zk − zk−1(cid:107)2

(2.3.15)

If we combine Eqs. 2.3.14 and 2.3.15, we obtain the following inequality.

P (y) ≥ F (zk) − ∇F (zk−1)T (zk − zk−1) − L
2

(cid:107)zk − zk−1(cid:107)2

(2.3.16)

+ ∇F (zk−1)T (y − zk−1) + R (zk) + ξT (y − zk)
≥ P (zk) − ∇F (zk−1)T (zk − zk−1) − L
2
+ ∇F (zk−1)T (y − zk−1) + ξT (y − zk)

(cid:107)zk − zk−1(cid:107)2

34

Here, we again use stochastic gradient mapping; [zk − zk−1 = −ηgk] to obtain the following in-
equality.

P (y) ≥

P (zk) + ∇F (zk−1)T (y − zk) + ξT (y − zk) − L
2

η2 (cid:107)gk(cid:107)2

If we substitute ξ with gk − vk, and then add and subtract zk−1 from the term (y − zk):

(cid:20)

(cid:21)

(2.3.17)

(2.3.18)

P (y) ≥ P (zk) + (vk − ∇F (zk−1))T (zk − y)

P (y) ≥ P (zk) + gk

+ gk

T (y + zk−1 − zk−1 − zk) − L
2
η − L
2

T (y − zk−1) +
+ (vk − ∇F (zk−1))T (zk − y)

(cid:18)

η2 (cid:107)gk(cid:107)2

(cid:19)

η2

(cid:107)gk(cid:107)2

Under the assumption of 0 < η < 1/4LQ < 1/L, (cid:2)(cid:0)η − L
(cid:0)−gk

2 (cid:107)gk(cid:107)2(cid:1) in Eq. 2.3.8.

T (zk−1 − z∗) + η

as η/2. Since (2 − Lη) is between (1, 2) according to the assumption, eliminating (2 − Lη)
does not change the inequality. Now we will use the result derived above for the term

2 η2(cid:1) = η

2 (2 − Lη)(cid:3) can be taken

(cid:107)zk − z∗(cid:107)2 ≤ (cid:107)zk−1 − z∗(cid:107)2 + 2η (P (z∗) − P (zk)) − 2η∆T (zk − z∗)

(2.3.19)

where ∆ = vk − ∇F (zk−1) and z∗ corresponds to y. The term −2η∆T (zk − z∗) can further
be bounded by using the proximal full gradient update ¯zk = proxηR (zk−1 − η∇F (zk−1)),
the proximal mapping
If Cauchy-Schwarz

((cid:13)(cid:13)proxηR (x) − proxηR (y)(cid:13)(cid:13) ≤ (cid:107)x − y(cid:107)) are utilized, the following expression can be derived.

inequality and the non-expansiveness of

−2η∆T (zk − z∗) = −2η∆T (zk − z∗ + ¯zk − ¯zk) ≤ 2η (cid:107)∆(cid:107)(cid:107)zk − ¯zk(cid:107) − 2η∆T (¯zk − z∗)

(2.3.20)

35

If we insert the deﬁnitions of zk = (zk−1 − ηvk) and ¯zk = (zk−1 − η∇F (zk−1)), we will have:

−2η∆T (zk − z∗) ≤ 2η2 (cid:107)∆(cid:107)2 − 2η∆T (¯zk − z∗)

(2.3.21)

If we combine the result shown above with Eq. 2.3.19:

(cid:107)zk − z∗(cid:107)2 ≤ (cid:107)zk−1 − z∗(cid:107)2 − 2η (P (zk) − P (z∗))

(2.3.22)

+ 2η2 (cid:107)∆(cid:107)2 − 2η∆T (¯zk − z∗)

Now, expectations of both sides are taken with respect to zk.

E{(cid:107) zk − z∗ (cid:107)} ≤ (cid:107)zk−1 − z∗(cid:107)2 + 2η2E(cid:8)(cid:107)∆(cid:107)2(cid:9) − 2η (E{P (zk)} − P (z∗))

(2.3.23)

− 2ηE(cid:8)∆T (¯zk − z∗)(cid:9)

Since ¯zk and z∗ are independent from the variable zk; E(cid:8)∆T (¯zk − z∗)(cid:9) = E(cid:8)∆T(cid:9) (¯zk − z∗) =
0. Because E(cid:8)∆T(cid:9) = E{vk − ∇F (zk−1)} = E{vk} − ∇F (vk−1) = 0. The variance of
the gradient E(cid:8)(cid:107)∆(cid:107)2(cid:9) is upper bounded in the Prox-SVRG algorithm and we will use the result
of Corollary 3 in [121] which is E(cid:8)(cid:107)∆(cid:107)2(cid:9) ≤ 4LQ [P (zk−1) − P (z∗) + P (˜z) − P (z∗)], where

LQ = maxi Li, ˜zs = 1
m

k=1 zk and ˜z = ˜zs−1 = z0 for a ﬁxed epoch. After incorporating the

bound of the variance of the gradient into the analysis, the following expression is obtained.

(cid:80)m

E(cid:8)(cid:107)zk − z∗(cid:107)2(cid:9) ≤ (cid:107)zk−1 − z∗(cid:107)2 − 2η (E{P (zk)} − P (z∗))

(2.3.24)

+ 8η2LQ [P (zk−1) − P (z∗)] + 8η2LQ [P (˜z) − P (z∗)]

36

Now, if we apply the inequality above repeatedly for k = 1, ..., m and take the expectation with

respect to previous random variables z1, ..., zm, then we can obtain the following inequality.

E(cid:8)(cid:107)zm − z∗(cid:107)2(cid:9) + 2η [E{P (zm)} − P (z∗)]

+ 2η (1 − 4ηLQ)

[E{P (zk)} − P (z∗)]

m−1(cid:88)

k=1

≤ (cid:107)z0 − z∗(cid:107)2 + 8η2LQ [P (z0) − P (z∗) + m (P (˜z) − P (z∗))]

(2.3.25)

(cid:80)m

k=1 P (zk), we can

Since 2η (1 − 4ηLQ) < 2η, z0 = ˜z and P is convex, therefore P (˜zs) ≤ 1
write the following inequality.

m

2η (1 − 4ηLQ) m [E{P (˜zs)} − P (z∗)]
≤ (cid:107)˜zs−1 − z∗(cid:107)2 + 8η2LQ (m + 1) (P (˜zs−1) − P (z∗))

(2.3.26)

By using the Lemma 2.3.3 which is a weaker condition then using the strong convexity and by

applying the above inequality recursively, we derive the convergence rate as follows:

[E{P (˜zs) − P (z∗)}] ≤

[P (˜z0) − P (z∗)]

(2.3.27)

(cid:32)(cid:0) 2
(cid:96) + 8η2LQ (m + 1)(cid:1)

2η (1 − 4ηLQ) m

(cid:33)s

Lemma 2.3.3. Consider the problem of minimizing the sum of two convex functions:

{P (z) = F (z) + R (z)}

min
z∈Rd

A standard method for solving the above problem is the proximal gradient method. Given an initial

point z0, using the proximal mapping, which is shown below, iteratively generates a sequence that

will converge to the optimal solution.

proxR (y) = arg min
z∈Rd

{ 1
2||z − y||2 + R(z)}

37

Since R (x) is a convex function, the optimal solution of above problem is also an optimal solution

of the following problem using a tuning parameter µ [71][Theorem 1].

min 1

2 (cid:107)z − y(cid:107)2

2 s.t. R (z) ≤ µ

By utilizing the optimal strong convexity condition which is a weaker condition than strong con-
vexity [81] for a convex function R, we have the following inequality for all z ∈ Ω:

P (z) − P (proxE (z)) ≥ (cid:96)

2 (cid:107)z − proxE (z)(cid:107)2

where the proxE is the Euclidean projection on to set E and (cid:96) is a positive parameter.

In summary, the strong convexity condition is discarded from the convergence analysis, so that

the algorithm in [121] is applicable to more generic convex objectives. In the following section,

the proposed computational phenotyping scheme using Cvx-SPCA is presented.

2.3.4 Interactive Hierarchical Phenotyping via PHENOTREE

In this problem, each patient is represented by a sparse vector whose elements correspond to ICD-9

diagnosis codes. The size of the vector equals to the number of unique ICD-9 codes (dictionary

size) present in the patient cohort. If a speciﬁc diagnostic group is targeted, a sub-sample of ICD-9
codes can also be used as the vocabulary. Thus, an n× d matrix represent the whole patient cohort,
where n is the total number of patients in the cohort and d is the vocabulary size. The procedure

of obtaining two-level hierarchical patient phenotypes using PHENOTREE is explained below.

Step 1: Cvx-SPCA is applied to the whole patient population to obtain the non-zero loading

values as illustrated in Figure 2.2. Clinical features corresponding to the non-zero loading values

are the input dimensions which contribute to the leading principal component. Therefore, these

clinical features are selected as the key features and a set of phenotypes within the population is

deﬁned as the ﬁrst level of the hierarchy.

38

Figure 2.3 Computational phenotyping procedure with SPCA. SPCA is iteratively applied until the
desired number of levels is reached.

Step 2: First level features obtained in the previous step are used to deﬁne sub-populations

(patients who have the corresponding diagnosis). The number of sub-populations in the second

level is equal to the number of ﬁrst level features. Next, the procedure in Step 1 is applied on

the sub-populations associated with each ﬁrst level feature to obtain the second level key clinical

features. The proposed procedure approach is summarized in Figure 2.3.

We iteratively apply the steps above to expand phenotypes and obtain a hierarchical tree struc-

ture. This structure can assist medical experts to (i) explore the diagnostic patterns in the patient

cohort, (ii) automatically grow the leaves of tree by determining the number of times SPCA is

applied to sub-populations, and (iii) allow physicians manually tune the phenotype hierarchy. In-

terpretation and analysis of hierarchical phenotypes by a domain expert would not be possible with

a text based representation. To alleviate this challenge, we utilized radial Reingold-Tilford tree [17]

based on the work by J. Heer and Davies to visualize the PHENOTREE. Three-level structure is

given in Figure 2.4, where the levels 1,2, and 3 are shown.

Each node of the tree, such as in Figure 2.4, gives a structured phenotype and a sub-cohort

characterized by this phenotype. In this structure, children nodes depend on their parent nodes

since the sub-population used in children nodes is conditioned on the parent phenotypes. For

39

Algorithm 2 Construction of a PHENOTREE
Require: Data D, solver parameters for Cvx-SPCA, number of levels is N
Ensure: N-level PHENOTREE T and a set of phenotypes P

Initialize tree T = ∅
Add pseudo phenotype to phenotype stack S: p0 → S
forall S (cid:54)= ∅ do

Pop one phenotype p from stack S.
if depth of p is less than N then
S = PatientSampling (D, p)
Compute phenotypes of a ﬁner level of granularity P(p,S) = ExpandPhenotype (p,S; O)
Update T with phenotypes P(p,S)
Push phenotypes in P(p,S) to S

end if

end forall

example, if the phenotype characterized by the diagnosis ICD-9 92 Syphilis has a parent phenotype
808 Pelvis, we denote the phenotype as ICD-9 92→808. Note that a patient may have a feature
from only the ﬁrst level, ﬁrst two levels or from all three levels. For instance, in Figure 2.4, there are

3 patients who are diagnosed with ICD-9 373 and ICD-9 185, 32 patients who are diagnosed with

ICD-9 373 and ICD-9 626, and one patient with diagnoses ICD-9 373, ICD-9 185 and ICD-9 761.

Same patients may have different hierarchical phenotypes, as well. For example, one patient could
simultaneously possess two phenotypes: ICD-9 373→185→761 and ICD-9 373→185. If we need
to assign patients exclusively to one of the phenotypes, the deepest hierarchy is considered. Thus,

PHENOTREE provides an informative and visually interactive way of phenotyping the patients by

their diagnoses information. These phenotypes can be used to cluster patients or can be used as

side information for classiﬁcation tasks. The proposed approach to construct a PHENOTREE is

given in Algorithm 2, where the subroutine PatientSampling selects a patient sub-population

with a speciﬁc phenotype, and ExpandPhenotype identiﬁes a set of phenotypes of a ﬁner level

of granularity by solving the Cvx-SPCA.

Cohort studies require interactive visualization approaches to provide insights of EHR datasets

in a comprehensible way [59,95,114]. Otherwise, there is the risk of ignoring signiﬁcant informa-

tion present in patient cohorts. In PHENOTREE, phenotypes do not have to be expanded uniformly.

40

In the hierarchy in Figure 2.4, for example, not every key feature is expanded to the third level. In

such cases, expertise of the medical researchers should be incorporated into the analysis process

to expand the phenotypes further. Therefore, a visual representation is necessary to be able to in-

volve the medical expert in the process. Another important factor is the efﬁciency of the approach

to obtain phenotypes since the proposed approach requires to iteratively apply Cvx-SPCA on the

entire patient cohort and sub-cohorts. A SPCA formulation which is not sensitive to the scale of

the data is needed to avoid high computation time.

2.4 Experiments

Synthetic and real world data experiments were conducted to investigate the time complexity of

the proposed Cvx-SPCA algorithm and to demonstrate a phenotyping application of the proposed

PHENOTREE. In our experiments, step size η was chosen following the heuristic 0 < η < 1/ (4LQ)

and LQ was taken as the largest eigenvalue of the covariance matrix. Iteration number m was
chosen as Θ (LQ/ (λ − λ1 (S))) which was suggested in [121].

2.4.1 Synthetic Dataset

Synthetic datasets used in this section were randomly generated from zero mean and unit variance

normal distribution. First, the convergence of proximal stochastic gradient with variance reduction

and traditional proximal stochastic gradient for convex SPCA scenario are compared. In Figure 2.5,

objective value versus number of epochs are plotted for different numbers of samples (n) and

features (d). Traditional proximal stochastic gradient (prox-SGD) and proximal stochastic variance

reduced gradient (prox-SVRG) methods are compared. In Figure 2.5, convergence is observed

when the maximum number of epochs is ﬁxed to 50. We also investigated the number of epochs

necessary for both algorithms to converge. Therefore, another experiment was conducted to see

how fast Cvx-SPCA with prox-SVRG converges to a similar sparsity as Cvx-SPCA with prox-

SGD. We again generated a synthetic dataset with 100, 000 instances and 10, 000 independent

41

dimensions. As it can be seen from the result in Figure 2.6, Cvx-SPCA with traditional SGD

requires more epochs than Cvx-SPCA with SVRG to converge to similar sparsity patterns.

Secondly, running times of other SPCA methods and the proposed method are compared for

1, 000 dimensional features in Figure 2.7. We ran the algorithms until they reached similar sparsity

patterns. The proposed Cvx-SPCA algorithm is more scalable, since a stochastic solver is used

and there is no eigenvalue decomposition or SVD steps during the optimization. For instance, [61]

requires singular value decomposition at each iteration, which is a bottleneck in terms of running

time, [55] uses an inverse power method based approach and [86] uses semi-deﬁnite programming.

According to the experiments on randomly generated data with sample sizes of 100, 000, 500, 000

and 1, 000, 000, Cvx-SPCA is observed to handle large datasets better than the baseline methods.

We also investigated the regularization path for the proposed algorithm. Regularization path

illustrates changes in solution with varying regularization parameter γ which speciﬁes the level of

sparsity. In order to have a suitable level of sparsity, γ should be tuned. One common approach

to ﬁnd an appropriate γ is the regularization path. For this purpose, we ﬁrst generated a random

sample with 10 features and applied the proposed Cvx-SPCA algorithm to obtain the ﬁrst principal

component. Then, the covariance matrix was reconstructed by using the ﬁrst principal compo-

nent corresponding to the largest eigenvalue with random noise. Loading values of the principal

component were computed with varying values of the regularization parameter γ by using the re-

constructed covariance matrix. We started with small γ values, and the loading vector learned

from the previous step is used as the initialization for each new Cvx-SPCA step. Results are given

in Figure 2.8 where the values of 10 features are represented by different colored curves. The

known principal component can be recovered through the path which conﬁrms that this is a valid
regularization path. When the regularization term was around −0.11 (dashed vertical line), the
non-zero loading values of the known principal component, which was used to generate the data,

are recovered.

42

Table 2.1 We sample patients with female, male, child and old age speciﬁc diagnoses. These
samples may overlap with each other. For instance, a patient may have dementia and a prostate
diagnosis together. We did not include medical conditions which can be encountered for any age
patient and both genders into these groups of patients. Old patients are assumed to be above 60
years old. The age range of child patients are determined between infants (birth to 1 year) to
adolescence (12-18).

Demographic Number of features Number of patients

Female
Male
Old
Child

1,268
106
66
596

130,035
24,184
2,060
38,434

2.4.2 Private Electronic Health Records Dataset

We used a private, large scale EHR dataset comprising of 223, 076 patients and 11, 982 diagnoses

over a time span of 4 years. Each diagnosis was represented by ICD-9 codes. Explicit demographic

information of the patients and their admission/readmission times were not available. However,

some of the ICD-9 codes have particular terms which indicate gender and age of the patients. For

instance, diagnoses codes implying problems in pregnancy, female/male genital organs, conditions

having the term senile in their explanations can be used to group patients as female/male, young

and old. On the other hand, there was some observed anomalies such as patients having records

for both female and male speciﬁc diagnoses or for both newborn and senile. Since we did not

take a part in data collection, the reason of these anomalies could not be resolved. Therefore,

these kind of patients were eliminated in our experiments. There were also patients who have very

few records in the dataset. Patients who have fewer than ﬁve records were also discarded. After

data cleaning, the total number of patients retained was 168, 431 from the initial pool of 223, 076

patients. In Table 2.1, statistics about female, male, old and child patients sampled considering the

deﬁnitions of ICD-9 codes is summarized.

The age range of child patients was determined ranging from infants (birth to 1 year) to ado-

lescence (12-18) and old patients were deﬁned to be above 60 years old. We should note that

there may be female, male, old and child patients who were not included into these demographic

groups such as patients with diagnoses which are not gender or age speciﬁc. The numbers given

43

in Table 2.1 do not give a clear idea about the percentages of age and gender groups in the EHR

dataset. For instance, diseases such as hypertension and Alzheimer were commonly encountered

among the people above a certain age. However, these problems have been lately observed in

younger patients. Each patient has a sparse feature vector where the i-th value gives the frequency

of the i-th diagnosis code for the corresponding patient. As discussed before, each ICD-9 code

corresponds to one diagnosis and each diagnosis belongs to a hierarchy. For instance, code 278

is Obesity, 278.01 is Morbid Obesity, and 278.02 is for Overweight. Thus, there are sub-groups

under the main diagnosis. In our experiments, all the sub-groups related to a particular diagnosis

were aggregated. As a result, the feature dimensionality was reduced from 11, 982 to 927.

We conducted several experiments to visualize the structure of the patient population using

the procedure explained in the previous section. A three-level PHENOTREE was generated for the

general patient cohort as shown in Figure 2.10, where the ICD-9 description of each disease is also

included. The following relationship between layers can be inferred from the ﬁgure. If we look at

the output features of the patients who have The diagnosis ICD-9 239, Neoplasm Of Unspeciﬁed

Nature, in the ﬁrst level, we can see ICD-9 176, Karposi’s Sarcoma, ICD-9 196, Secondary and

Unspeciﬁed Malignant Neoplasm of Lymph Nodes, ICD-9 693, Dermatitis Due To Drugs, ICD-9

702, Other Dermatoses, and ICD-9, 957 Injury to Other and Unspeciﬁed Nerves. Corresponding

branches of the PHENOTREE can be seen in Figure 2.9. Karposi’s Sarcoma is known as a type

of cancer. Unfortunately, neoplasms, in other words, abnormal growth of tissue can spread out

to different parts of the body. Therefore, patients who have diagnosis of neoplasm of unspeciﬁed

nature may have other types of neoplasms as well. In addition, we can also see dermatological

problems in the second level. Cancer treatments such as chemotherapy and radiation therapy can

have dermatological side effects such as radiation dermatitis. Another observation was the ICD-9

344 Paralytic Syndromes, whose second level diagnoses were obtained as ICD-9 669 Complica-

tions of Labor and Delivery, 744 Congenital Anomalies of Eye, Face and Neck, 820 Fracture of

Neck of Femur and ICD-9 891 Open Wound of Knee, Leg and Ankle. Paralytic conditions are

not commonly known to occur during birth delivery. However, methods like epidural may have

44

Table 2.2 EHR data features which contributes to the output dimensions after Cvx-SPCA algorithm
was applied to the whole patient population (168,431 patients). Feature descriptions were provided
by the private dataset and they can also be found in [44].

ICD-9 Code

Description

7
72
115
266
507
695
697
761
795
924

Balantidiasis/Infectious

Mumps Orchitisn/Infectious

Infection by Histoplasma Capsulatum
Ariboﬂavinosis/Metabolic disorder

Pneumonitis/Bacterial

Toxic Erythema/Dermatological
Lichen Planus/Dermatological

Incompetent cervix affecting fetus or newborn

Abnormal glandular papanicolaou smear of cervix

Contusion of thigh/Injury

the risk of paralysis. On the other hand, there are features related to neck or femur, whose serious

injuries can be a reason for paralysis. In the general cohort, fractures and injuries were commonly

encountered diagnoses. In addition, neoplasms, infectious diseases, and problems of newborns

caused by complications of mothers were also observed frequently. In Table 2.2, ICD-9 codes and

corresponding deﬁnitions of commonly observed conditions are given. Feature descriptions were

provided by the private dataset and they can also be found in [44].

In addition to the general cohort, hierarchical structure of different patient groups in terms of

age and gender, as given in Table 2.1, are also investigated. Hierarchical representations of female,

male, child, and old patient groups are given in Figures 2.11, and 2.12, respectively. Aforemen-

tioned sub-groups yielded their speciﬁc diagnoses as well as frequently encountered conditions

in the general cohort. For instance, one of the ﬁrst level features in Figure 2.11a is ICD-9 636,

Illegal Abortion and its second level features contained ICD-9 596, Disorders of Bladder, and

37, Tetanus. Abortion under unhygienic conditions and in underground clinics is known to have

health risks such as infections and urinary tract disorders that align well with the SPCA second

level features. If we look at the sub-group of old patient population in Figure 2.12b, we observe

diagnoses such as dislocation and fracture of bones. People older than a certain age such as 80

commonly suffer from fractures especially in femur and pelvis. For example, ICD-9 821 fracture

45

in femur is one of the ﬁrst level features representing the old patient group. In the second level

of ICD-9 821, diagnoses such as ICD-9 268 Vitamin D deﬁciency, 332 Parkinson’s Disease, some

infectious diseases and ICD-9 701 skin disorder were obtained. These diagnoses are known to be

commonly encountered among old patients. As a summary, our results show that the proposed

method can be used to visualize, interpret the relationships between sub-groups, and consequently

enable constructing a phenotype tree of patients via the obtained hierarchical structure.

2.4.3 Diabetes Dataset

We conducted patient phenotyping experiments on a publicly available EHR dataset which repre-

sents clinical data about diabetes collected between (1999-2008) at 130 US hospitals [107]. There

are 101, 767 records in total in this dataset and each patient has 50 features representing race, gen-

der, age, number of medications, test results, diagnoses and so on. There is an age range for each

patient, and age represented in the ﬁgure corresponds to the upper limit of the associated range, so

10 indicates the range [0,10). Each patient has 3 diagnoses with corresponding ICD-9 codes. In

the experiments, patients were represented by 729-dimensional feature vectors, where the ﬁrst 697

dimensions corresponded to the multi-hot representation (sparse binary vector) associated with 697

unique ICD-9 codes present in the dataset. Rest of the features represent number of times patient

was admitted in hospital, lab procedures, number of medications, and some test results about dia-

betes which have binary values. In addition, the readmission status, which include not readmission,

readmission before 30 days (< 30) and after 30 days (> 30) [107], was also given in the dataset.

This information was utilized to group patients as readmitted and not readmitted to see how the

hierarchical structures change.

The same procedure as on private EHR dataset was followed such that the features belonging

to the same ICD-9 hierarchy were aggregated (resulting in 697 features in total). Hierarchical

structure of the whole patient population is given in Figure 2.13. First level of features were

obtained as insulin and ICD-9 diagnoses such as neoplasm, heart disease, hormonal problem and so

on. Existence of the insulin among the output features indicates diabetes. If we further examine the

46

diagnoses obtained from the patient groups who were prescribed insulin, we can see a wide range

of diagnoses such as kidney problems, disorders of stomach, bacterial infections, and disorders

of adrenal glands. Stomach problems are also commonly encountered among diabetes patients

because of medications.

Similarly, readmitted (patients readmitted before and after 30 days of discharge) and not read-

mitted patients were examined as shown in Figure 2.14. The readmission is not only relevant for

medical purposes but also for insurance companies [107]. Our interest was to investigate how the

types of diagnoses and the hierarchical structure of readmitted patients differ from the patients who

were not readmitted. We should emphasize that the same regularization parameter was used for

both patient populations. According to our experiments, it was not possible to distinguish these

two populations by looking at the types of output diagnoses. For instance, diseases which may re-

quire the patient to get medical attention regularly such as cancer are encountered in both groups of

diabetes patients. However, it was observed from Figures 2.14a and 2.14b that, readmitted patients

produced more nodes in the second level. This observation was interpreted as readmitted patients

having several records for different diagnoses. Therefore, we could sample enough patients with

speciﬁc diseases compared to not readmitted patient population, while we were constructing the

levels. Graphs of female, male, old, teen and adult patients can also be seen in Figures 2.15

and 2.16, respectively. As a summary, we can see that exploring the insights and interpreting the

ﬁndings about the EHR data visually is possible by using the proposed PHENOTREE approach.

The proposed system can be helpful for clinical decision support systems since it aids physicians

to understand diagnoses and sub-cohort relationships in a visually interactive way.

2.5 Summary

In this study, a hierarchical phenotyping approach based on SPCA to analyze and visualize diag-

nostic patterns in EHRs is introduced. We propose to use a convex version of SPCA problem which

allows us to employ proximal stochastic variance reduced gradient methods alleviating low con-

47

vergence due to the variance of random sampling in traditional stochastic algorithms. Experiments

on both synthetic and real world datasets were conducted to evaluate the convergence properties of

the proposed formulation. Patient phenotyping results showed that proposed framework might ac-

tually assist medical experts to understand and analyze the patient populations and the relationship

between them.

48

(a) First level

(b) Second level

(c) Third level

Figure 2.4 An example of hierarchical phenotyping with stochastic Cvx-SPCA. (a) the ﬁrst, (b)
the second and (c) the third level features of the patient population. This procedure can be applied
repeatedly until the desired number of levels is reached.

49

Figure 2.5 Convergence for synthetic data with n samples and d features. Convergence of the
proposed stochastic Cvx-SPCA with (prox-SVRG) and without variance reduction (prox-SGD).
Proximal stochastic gradient with variance reduction has a faster convergence rate, since the vari-
ance caused by random sampling is bounded in prox-SVRG.

Figure 2.6 Convergence of sparse pattern in the log scale. Cvx-SPCA with Prox-SGD takes 275
epochs, whereas Cvx-SPCA with Prox-SVRG takes 45 epochs to converge a similar sparsity pat-
tern.

50

Number of Epochs050100150200250300Function Value (log scale)-10-50510152025Cvx-SPCA with prox-SVRGCvx-SPCAwithprox-SGDFigure 2.7 Running times (in seconds) are compared to obtain similar cardinality of loading vector.
A machine with 2.8GHz Intel(R) Xeon(R) CPU and 142 GB memory was used in the experiments.
The methods denoted by Reg-SPCA, InvPow-SPCA, and InvPow-SPCA are from [55, 61], and
from [86], respectively.

Figure 2.8 Regularization path for Cvx-SPCA, where different colored curves represent the values
of 10 features. When the regularization term was around −0.11 (dashed vertical line), the non-
zero loading values of the known principal component, which was used to generate the data, are
recovered.

51

Regularization parameter (log scale)-6-4-202Loadings of Principal Components×10-3-3-2-10123456Regularization PathFigure 2.9 An example branch of the PHENOTREE. The ﬁrst level diagnosis with ICD-9 code 239
denotes Neoplasm Of Unspeciﬁed Nature. Children nodes of the patients who have ICD-9 239 are
ICD-9 176 Karposi’s Sarcoma, 196 Secondary and Unspeciﬁed Malignant Neoplasm of Lymph
Nodes, 693 Dermatitis Due To Drugs, 702 Other Dermatoses, and ICD-9 957, Injury to Other
and Unspeciﬁed Nerves. Karposi’s Sarcoma is a type of cancer. Patients who have diagnosis of
neoplasm of unspeciﬁed nature may have other types of neoplasms as well. In addition, we can
also see dermatological issues in the second level.

Figure 2.10 Hierarchical stratiﬁcation via Cvx-SPCA. Cvx-SPCA is applied on the entire patient
population and the features with the largest absolute loading values on the leading principal com-
ponent are selected. Each feature dimension corresponds to a speciﬁc disease.

52

(a) Female

(b) Male

Figure 2.11 Hierarchical stratiﬁcation via Cvx-SPCA for female and male patients. One of the ﬁrst
level features in Figure 2.11a is ICD-9 636, Illegal Abortion. In the second level, we see diagnoses
like ICD-9 596 Disorders of Bladder, and 37 Tetanus, which could be some of the side effects of
illegal abortion.

53

(a) Child

(b) Old

Figure 2.12 Hierarchical stratiﬁcation via Cvx-SPCA for children and old patients. According to
our observations, different patient groups tend to have common diseases which was illustrated for
general population before. On the other hand, they also yielded speciﬁc diagnoses as well.

54

Figure 2.13 Hierarchical representation of the whole diabetes data.

55

(a) Readmitted

(b) Not Readmitted

Figure 2.14 Hierarchical Stratiﬁcation via Cvx-SPCA for patients who were readmitted and not
readmitted in diabetes database. Injury and poisoning are commonly encountered for not readmit-
ted patients. However, wider range of diagnoses are observed for readmitted patients.

56

(a) Female

(b) Male

Figure 2.15 Hierarchical Stratiﬁcation via Cvx-SPCA for female and male patients in diabetes data.
We can observe female/male diagnoses along with common diseases. For instance, Figure 2.15b
has ICD-9 602, disorder of prostate, and ICD-9 401, hypertension.

57

(a) Old

(b) Teen-adult

Figure 2.16 Hierarchical Stratiﬁcation via Cvx-SPCA for old and teen/adult patients in diabetes
data. Neoplasm is very common in patient populations with a wide range of age or different
genders for both the EHR datasets examined in this study.

58

Chapter 3

Asynchronous Distributed Multi-Task

Learning

In the previous chapter, we tackled computational patient phenotyping, which was an unsupervised

learning problem. Our goal was to extract key clinical features from a large scale patient cohort,

and consequently obtain patient sub-groups with similar diagnostic patterns. Time efﬁciency and

the interpretability were two important challenges of the computational phenotyping task. For this

reason, a convex sparse principal component analysis technique with fast convergence properties

was proposed. In this chapter, we address distributed EHR challenge in biomedical informatics.

In this study, we investigate learning supervised predictive models from EHR datasets located in

different geographical regions. Predictive models are commonly designed for healthcare related

tasks, such as mortality prediction and disease prediction. Based on the application, predictive

task can be either classiﬁcation or regression. Regardless the type of the task, predictive modeling

assumes that each data point in the training set is drawn from the same probability distribution.

On the other hand, EHR datasets collected in different locations have different patient distribu-

tions. For this reason, distributed EHR datasets cannot be combined into one large dataset to learn

the predictive models although the predictive tasks are similar. In such cases, multi-task learning

(MTL) is used to learn a separate model for each EHR dataset and leverage their shared knowledge

59

to improve the generalization performance. However, a standard MTL formulation requires data

centralization which is not possible for EHR datasets due to privacy concerns and limited band-

width. Distributed MTL addresses the data centralization issue by only transferring the model over

a network. However, the standard distributed MTL approaches are synchronized, which is required

by the optimization schemes. Synchronized frameworks are not robust against network delays and

failures. In this chapter, an asynchronous distributed MTL framework is introduced to address the

aforementioned challenges. The asynchronous nature of the framework prevents the optimization

scheme from impeding due to network delays and failures.

3.1

Introduction

Many application domains require learning predictive classiﬁcation or regression models for multi-

ple tasks. These tasks are not always independent of each other. In fact, multiple tasks are usually

related to each other such as predicting diagnostic outcomes for different types of diseases. In

this problem, a single model cannot be learned by utilizing a combination of heterogeneous in-

dividual datasets where the data distribution is not the same. However, the overall goal of the

individual tasks is usually similar. This indicates that there is a shared knowledge between indi-

vidual tasks. Multi-task learning (MTL) simultaneously learns the related tasks and performs an

inductive knowledge transfer between them to improve the generalization performance. The idea

behind MTL is to learn high performance models when there is not enough data for a single task

by sharing the predictive information among the related tasks.

There are several types of MTL approaches depending on the knowledge transfer technique.

The most commonly studied MTL approach is regularized MTL where the task relatedness is mod-

eled by adding a regularization term to the general loss function. The purpose of the regularization

term is usually to couple the individual models via a matrix and enforce a requirement that fulﬁlls

the knowledge transfer. The advantage of the regularized MTL is the ability to work with various

loss functions, such as least squares, logistic regression, and hinge loss. The regularization term is

60

Figure 3.1 Overview of the proposed asynchronous multi-task learning (AMTL) framework. Af-
ter a task node computes the gradient, it sends the gradient or the updated model to the central
server. Central server couples all the individual models to perform proximal projection and send
the updated individual models back to the corresponding task nodes.

usually a non-smooth function that makes the task a composite optimization problem. Such com-

posite optimization problems are solved with iterative proximal gradient descent approach, where

each iteration is divided into gradient and proximal update steps. Proximal step usually couples

the related tasks based on the type of the regularization.

Traditional MTL assumes that data is centralized during optimization. On the other hand, data

centralization is not always feasible since single task data can be located in different geographic

regions and considered private. As in healthcare domain, each hospital stores its own EHR database

and the patient information is very sensitive. For this reason, EHR data is not transferred over a

network. In addition, the data transfer is time consuming due to the EHR data volume and limited

bandwidth among the nodes. In such cases, distributed optimization techniques facilitate a solution

for MTL problems, where each task node is located in a local server. In the distributed optimization

scheme, the local server performs the gradient descent and sends the updated intermediate solution

to a central server. The proximal step is performed by the central server, where the several tasks are

coupled and the ﬁnal solution is obtained. Proximal projection depends on synchronized gradient

61

information from all the task nodes. In other words, central server, where the proximal step is

performed, needs to wait for all the task nodes to ﬁnish their computations. This framework can

be quite slow due to network communication delays, load imbalance across the task nodes, and

different hardware speciﬁcation of the local machines.

In this study, we propose an asynchronous distributed MTL framework with a linear conver-

gence rate for convex MTL formulations under mild assumptions [12]. An overview of the pro-

posed framework is given in Figure 3.1, which presents a more robust approach against network

infrastructures with high communication delays between the central server and the task nodes

compared with the synchronous distributed MTL. The framework is capable of solving most of the

existing regularized MTL formulations. As a case study, low-rank MTL formulation is elaborated

which transfers knowledge via learning a low-dimensional subspace of task models. Experiments

on both synthetic and real-world datasets were conducted to evaluate the proposed framework.

3.2 Literature Review

Distributed MTL utilizes distributed optimization techniques. In particular, regularized distributed

MTL requires a distributed proximal gradient descent approach.

In this section, related work

covering distributed optimization and distributed MTL concepts are presented.

3.2.1 Distributed Optimization

Distributed optimization techniques facilitate the solution of massive optimization problems using

hardware advancements. One popular and commonly used distributed optimization approach is

alternating direction method of multipliers (ADMM), which was ﬁrst proposed in the 1970s [18].

Boyd et al. deﬁned ADMM as a well suited distributed convex optimization method. In this ap-

proach, local copies of the solution are introduced for local sub-problems, and the work nodes and

the center node communicate to reach a consensus [18]. Therefore, ADMM is a synchronized

distributed optimization instance. Although ADMM was proposed for large-scale distributed op-

62

timization setting, the number of iterations increases due to the requirement of local copies to

achieve the same accuracy. On the other hand, Iutzeler et al. proposed an asynchronous approach

using randomized ADMM based on randomized Gauss-Seidel iterations of Douglas-Rachford

splitting (DRS) [63]. In the proposed asynchronous distributed setting, the overall cost function

comprises of individual cost functions of a set of network agents and the objective imposes a con-

sensus for the minimizer of the overall cost function [63].

Aybat et al. introduced an asynchronous distributed optimization approach for proximal gradi-

ent method [8]. Authors used randomized block coordinate descent to minimize composite func-

tions with smooth and non-smooth components. For this purpose, an asynchronous extension of

synchronous distributed ﬁrst-order augmented Lagrangian (DFAL) algorithm is introduced [8]. Liu

et al. similarity proposed an asynchronous stochastic proximal coordinate descent approach [81],

but they allowed an inconsistent read mechanism. In this mechanism, elements of the optimization

variable are updated by one core while being read by another core in a multicore processor set-

ting. Authors [81] reported a linear convergence rate with a suitable step size under optimal strong

convexity assumption. Peng et al. proposed a more general asynchronous parallel framework,

named ARock, for coordinate updates based on ﬁxed-point problems with non-expansive opera-

tors [38, 94]. Many commonly known optimization algorithms such as gradient descent, proximal

gradient descent, ADMM, and primal-dual method can actually be formulated as non-expansive

operators. For this reason, ARock [94] can be applied to a general spectrum of optimization prob-

lems. The approach [94] converts the problem into a ﬁxed-point problem with a non-expansive

operator and applies the ARock framework.

3.2.2 Distributed Multi-Task Learning

In many real-world MTL problems, such as predictive modeling using EHRs, data is distributed

across different geographical regions.

In this scenario, there are two main challenges, namely

limited network bandwidth and data privacy. Data transfer over a network is very costly due to

network limitations. More importantly, distributed data might have sensitive information, such

63

as patient records. In such cases, data transfer will not be allowed even with encryption. For this

reason, it is necessary to achieve the knowledge transfer among distributed datasets without sharing

the raw data. Distributed MTL techniques, where data is not needed to be transferred to a central

node, have been proposed to address the aforementioned challenge. Since only the learned model,

which is usually a single vector (of model parameters), is transferred instead of the whole training

data, the cost of network communication is much lower. Dinuzzo et al. proposed a client-server

regularized MTL, where multiple learning tasks due to distributed datasets are simultaneously

learned [36]. In this setting, each client has access the information content of the data on the server

without seeing the raw data.

Mateos-N´u˜nez and Cort´ez also introduced distributed optimization techniques for MTL [85].

In their approach, objective function contains a separable convex function and a joint regularization

to impose low rank solutions. Thus, their problem setting is separable on local decision variables.

In fact, local gradient computations are distributed into a network of agents. Authors also proposed

a second solution where a separable saddle-point reformulation is introduced through Fenchel

conjugation of quadratic forms. Jin et al., on the other hand, introduced a collaboration between

local and global learning for distributed online multiple tasks [65]. Their method learns individual

models for streaming data, thus the distributed MTL and online learning are combined. Local

and global learning are performed alternately in their framework such that the ﬁrst step is the

online learning on local clients and the second step is the global learning performed by the server.

However, a subset of raw data is still transferred between clients and the global server [65].

In 2016, Wang et al. proposed a shared-subspace MTL in a distributed multi-task setting [115].

Their framework comprises of several separate machines, where each machine is responsible for

one task and has access to only the data of the corresponding task. Similar to the studies discussed

thus far, a central node transfers the updated models to their associated machine. However, opti-

mization requires synchronous updates, such that the central node has to wait for all the machines

to ﬁnalize their computations. In summary, the reviewed studies usually follow synchronized ap-

proaches which can make the optimization very slow when there is data imbalance, network com-

64

munication issues and different hardware speciﬁcations of local machines. For this reason, in our

study, we focus on an optimization framework which can perform asynchronous updates thereby

avoiding network delays.

3.3 Distributed Multi-Task Learning

In this section, details of distributed MTL techniques are presented. First, the motivation behind

the regularized MTL and its mathematical model are revisited. Then, the synchronized MTL

mechanism is discussed in detail. Finally, the proposed asynchronous MTL approach is presented.

3.3.1 Regularized Multi-Task Learning

MTL provides a principled way of simultaneously learning multiple models for related tasks to

improve the generalization performances. Let’s assume there are K supervised learning tasks and

note that vectors are denoted by bold lower case letters and matrices are denoted by bold capital
k ∈ {1, 2, . . . , K}. Here,
letters. Each task has its own training data denoted by Dk = {xk, yk},
xk ∈ Rnk×d is the feature matrix of of the task k and yk is the target. If the task is classiﬁcation,
yk ∈ Rnk represents the class labels. If the problem is regression, then yk ∈ Rnk×p comprises of
real valued target vectors. A linear model parametrized by vector wk ∈ Rd is learned optimizing
the loss function of each task k is (cid:96)k(xk, yk; wk), which can be either least squares or logistic

loss depending on the problem. In addition, tasks can be heterogeneous [123] such that while

some tasks are regression, others can be classiﬁcation. Each task minimizes their corresponding

loss function separately. However, these tasks are assumed to share a common knowledge, and

thus they are not independent. Regularized MTL addresses the task relatedness by imposing a

constraint on the single task models to perform knowledge transfer. As a result, learning one task

consequently beneﬁts learning other tasks [21]. One of the most popular constraints to impose task

relatedness is the assumption of low rank model matrix.

65

Let W = [w1, . . . , wK] ∈ Rd×K be the model matrix of all K tasks. One intuitive idea to learn
(cid:96)t(wk). However, this approach

W is optimizing the joint objective function given as f (W) =

K(cid:80)

learns the single tasks simultaneously rather than facilitating a knowledge transfer between them

k=1

since f (W) decouples each task wk. To achieve the knowledge transfer, MTL often utilizes a

regularization term as given below [41, 128, 129]:

(cid:110)(cid:88)K

min
W

(cid:111) ≡ f (W) + λg(W),

(cid:96)k(wk) + λg(W)

k=1

(3.3.1)

where g(W) is a penalty term that couples K tasks and λ is the regularization parameter to control

the amount of knowledge to be shared among tasks. For instance, when g(W) is chosen as nuclear
norm (cid:107)W(cid:107)∗, high λ values result in a lower rank solution of W. In other words, high λ values
indicate that the individual tasks share a large amount of information.

In this study, we focus on one of the commonly used regularized MTL, named shared subspace
learning [7]. Shared subspace MTL utilizes the nuclear norm regularizer g(W) = (cid:107)W(cid:107)∗ =
σi(W), where σi(W) is the i-th singular value of the matrix W. In other words, the

(cid:80)min(d,K)

i=1

nuclear norm is the tightest convex relaxation of the rank function [42]. As it was discussed earlier,

a low-rank W means the columns of the W matrix, representing individual models, are linearly

dependent, and thus they share a low-dimensional subspace [123]. The nuclear norm regularizer is

a non-smooth function. For this reason, proximal gradient descent method discussed in Chapter 2

can be used to solve the regularized MTL problem [64]. In the following section, proximal gradient

descent solution in synchronized distributed MTL setting is discussed.

3.3.2 Synchronized Multi-Task Learning (SMTL)

The MTL formulation adopted in this study has a smooth and a non-smooth term (nuclear norm

regularization). The solution of such problems is provided by the proximal gradient based op-

timization techniques. There are several ﬁrst and second order proximal algorithms in literature

such as FISTA [13], SpaRSA [119], and PNOPT [76]. The common procedure shared by most

66

of the proximal solvers comprises of a gradient update and a proximal projection. In the gradient

update step, gradient of the smooth term is computed. Since the smooth term is the summation of

loss functions of each task, gradient of each task model is computed and aggregated to obtain the

gradient of the smooth term as given in Eq. 3.3.2.

(cid:88)K

∇f (W) =

∇(cid:96)k (wk)

k=1

After gradient is obtained, the model matrix is updated to obtain an intermediate solution:

ˆW = W − η∇f (W)

(3.3.2)

(3.3.3)

The intermediate solution is not the optimal solution since this point is not in our solution domain.
Due to the nuclear norm, our solution space is in the low rank matrices. A proximal projection

needs to be performed to map the intermediate solution ˆW into the solution space. The proximal

projection is achieved by solving the following optimization problem.

(cid:17)

(cid:16) ˆW

Proxηλg

= arg min

W

1
2η

(cid:107)W − ˆW(cid:107)2

F + λg (W)

(3.3.4)

where η is step size, (cid:107) · (cid:107)F is the Frobenius norm, and ˆW is the intermediate point obtained in
Eq. 3.3.3. In the standard MTL problem setting, natural assumption is to centralize the training

data of each task. As discussed before, in practice single task datasets are not necessarily located

in the same server and data transfer is not always feasible. For this reason, a distributed MTL
approach is necessary when the datasets for each task D1, . . . , Dk, . . . ,DK are stored in separate
local machines.

In this study, each local system with its corresponding single task dataset Dk is named task
node. Task nodes are responsible for the decoupled operations, such as the gradient update of

the individual task models. On the other hand, the proximal projection is a coupled operation,

which means that individual updated models need to be transferred to a central server to execute

the proximal mapping on W. After the proximal mapping is performed in the central server,

67

updated models need to be transferred back to the task nodes. The aforementioned distributed

setting is known as synchronous since the central server waits for all the task nodes to ﬁnish their

executions before the proximal mapping can be performed. In other words, proximal mapping is

executed on the intermediate results obtained at the same iteration. In this setting, central server is

idle until all the intermediate results are received. Similarly, task nodes do not proceed until they

receive the updated solution from the central server. For this reason, the synchronized approach

is generally slow when (i) computational power of some of the task nodes is limited, (ii) there

are communication delays, and (iii) data size at the task nodes is unbalanced. In the worst case,

one of the nodes might stop working and consequently the optimization procedure may have to

be terminated. For these reasons, solving a ﬁrst-order optimization problem, that requires many

iterations to converge to a reasonable precision, is not practical within a synchronous framework.

3.3.3 Asynchronous Multi-Task Learning (AMTL)

To address the disadvantages of the synchronized MTL discussed in the previous section, we pro-

pose an asynchronous MTL framework. The proposed model, named AMTL, comprises of task

nodes and a central server similar to standard distributed MTL setting. However, in AMTL, central

server does not wait for all the task nodes to ﬁnalize their gradient update to perform the proxi-

mal step. Similarly, task nodes do not wait for all the other task nodes to receive their updated

model to start gradient update. In this framework, task nodes need to maintain a copy of W, which

may not necessarily be the most current. This inconsistency might hurt the convergence, however

the proposed AMTL framework is implemented within ARock [38, 94] which provides a linear

convergence rate. In particular, the task models are learned via asynchronous parallel coordinate

updates using Krasnosel’skii-Mann (KM) iterations. In this framework, forward-backward split-

ting algorithm is used to perform proximal gradient descent. Each iteration is divided into forward

and backward steps, where the forward step computes the intermediate solution using the gradient

update and the backward step performs the proximal mapping.

68

Algorithm 3 The proposed asynchronous multi-task learning framework
Require: Multiple related learning tasks reside at task nodes, including the training data and the loss func-
tion for each task {x1, y1, (cid:96)k}, ...,{xK, yK, (cid:96)K}, maximum delay τ, step size η, multi-task regularization
parameter λ.

Ensure: Predictive models of each task v1, ..., vK.

Initialize task nodes and the central server.
Choose ηi ∈ [ηmin,
forall every task node asynchronously and continuously do

] for any constant ηmin > 0 and 0 < c < 1

2τ /

Task node k requests the server for the forward step computation Proxηλg

Retrieves(cid:0)Proxηλg

√
c
T +1

(cid:0)ˆvi(cid:1)(cid:1)

Computes the coordinate update on vk

k from the central server and

(cid:0)(cid:0)Proxηλg

(cid:0)ˆvi(cid:1)(cid:1)

k − η∇(cid:96)k

(cid:0)(cid:0)Proxηλg(ˆvi)(cid:1)

vi+1
k = vi

k + ηi

(cid:1)

(3.3.5)

(cid:0)ˆvi(cid:1), and
(cid:1) − vi

k

k

Sends updated vk to the central node.

end forall

In our model, we propose to use backward-forward splitting [32, 93] to solve Eq. 3.3.1 where

we reverse the order of gradient and proximal steps. The order of the steps does not change the

convergence but affects the number of network communications between the task nodes and the

central server within one iteration. In backward-forward splitting, one iteration starts with proximal

mapping at the central server and ends with the gradient update at the task node. Thus, one way

network communication is needed to ﬁnalize one iteration.

In the following, we present more

details about the solution of the optimization problem in Eq. (3.3.1).

According to the optimality condition, the optimal solution W∗ of a composite function such

as {f (W) + λg (W)} should satisfy the following:

0 ∈ ∇{f (W∗) + λ∂g(W∗)}

(3.3.6)

where ∂g(W∗) denotes the sub-gradient set of the non-smooth part. This condition states that the
set comprising the gradient of the smooth term and the sub-gradients of the non-smooth term at
W∗ should include 0 if W∗ is the optimal solution. If we derive the optimal solution W∗ from

69

Eq. 3.3.6 leading the forward-backward iteration:

−∇f (W∗) ∈ λ∂g(W∗)
−η∇f (W∗) ∈ ηλ∂g(W∗)

W∗ − η∇f (W∗) ∈ W∗ + ηλ∂g(W∗).

(3.3.7)

Thus the forward-backward iteration is expressed as:

W+ = (I + ηλ∂g)−1(I − η∇f )(W)

(3.3.8)

where (I−η∇f ) represents the forward operator on W and (I+ηλ∂g)−1 is the backward operator.
Eq. 3.3.8 converges to the solution with value of η ∈ (0, 2/L) under the assumption that the loss
function f (W) is convex and L-Lipschitz differentiable with L > 0, and g(W) is closed proper
convex. As it was discussed before, while the forward operator is separable since ∇f (W) is
decoupled, i.e., ∇f (W) = [∇(cid:96)1(w1),∇(cid:96)2(w2),··· ,∇(cid:96)K(wK)]. The backward operator is non-
separable due to coupling of individual models in proximal mapping. Since there is a distributed

setting, reversing the order of forward and backward operators is more efﬁcient in terms of network

communications. As a result, backward-forward iteration can be expressed as:

V+ = (I − η∇f )(I + ηλ∂g)−1(V)

(3.3.9)

where we need to use an auxiliary matrix V ∈ Rd×K since the update variables in two cases are
not the same. However, the ﬁnal optimal solution W∗ can be obtained from V∗ by one additional
backward step at the end of the iterations. In the current setting, we follow the coordinate update

framework [94], but with a block coordinate update modiﬁcation, where each task model is a block

of variables for the corresponding task. Update procedure for each task is deﬁned as follows:

k = (I − η∇(cid:96)k)(cid:0)(I + ηλ∂g)−1(V)(cid:1)

v+

k

(3.3.10)

70

where vk ∈ Rd is the auxiliary variable for model wk of task k. As it can be seen in Eq. 3.3.10,
updating one task block requires one backward step applied on the whole auxiliary model matrix

V and one forward step performed on the corresponding task block vk. The overall framework is

given in Algorithm 3, where Eq. 3.3.5 represents the KM iteration. KM iteration provides a general

framework for problems such as ﬁnding the ﬁxed point of a non-expansive operator. In fact, the

backward-forward operator is a non-expansive operator and ﬁnding the optimal solution of the

problem deﬁned in Eq. 3.3.1 is an instance of the ﬁxed point problem [94]. We refer to Section 2.4

of [94] for the derivation of Eq. 3.3.5. One of the main purposes of the proximal gradient methods

is to address situation where sub-problems with different difﬁculties. In our distributed regularized

MTL setting, backward operation or the proximal mapping has a closed form solution, therefore it

is easier to compute compared to forward step, especially when we have large scale data. Since the

gradient computation usually takes more time due to large number of data points, it is preferred to

perform the block coordinate update with backward-forward iteration.

In AMTL setting, network communication is limited between the central server and the task

nodes. Task nodes do not communicate with each other. Moreover, the communication between

the task node k and the central server contains only the model vector vk which is much smaller
compared to the dataset Dk stored in the task node k. For example, if a 1024 dimensional vector is
saved in standard binary format, only 8KB will be needed to transfer the model vector, compared

with sending the entire data matrix (e.g., if we consider 500, 000 data points, 4GB will be required
for 500, 000 × 1024 data matrix). Hence, distributed AMTL reduces the network communication
cost. Since we only need to send the model vector and not the data matrix itself, privacy concerns

are also addressed. However, the current AMTL model is not a privacy preserving approach. An

extension of the proposed model to a privacy-preserving proximal gradient algorithm with asyn-

chronously updates in the distributed MTL setting is introduced in another study [122]. Therefore,

the current setting introduced in the chapter can be considered as a ﬁrst step to a privacy preserving

distributed MTL framework.

71

3.3.3.1 Asynchronous Updates of AMTL

Asynchronous update mechanism of AMTL is illustrated in Figure 3.2. Multi-task model matrix

is stored and updated in the central server. For instance, in Figure 3.2, task node 2 receives its

updated model from the central server at time t1. As soon as a new model arrives at a task node,

gradient update or the forward step starts. When the task node ﬁnalizes the forward step, updated

model is sent back to the central server for the backward step. This leads to an inconsistency at the

task node side. If we look at Figure 3.2, we can realize that while task node 2 is still performing

the forward step, task node 4 has already ﬁnished its forward step and sends its updated model to

the central server. Since this is an asynchronous scheme, central node does not need to wait for

task node 2 and performs the proximal mapping on the model matrix, which contains the updated

model coming from task node 4 and the existing models from other nodes. Therefore, when task

node 2 is ready to send its model to central server, model matrix has already been updated. In other

words, an inconsistency between the previously received model from the central server and the

model stored in the central server after the forward step occurs at the task node side. This means

that the model received by the task node 2 at time t1 is not the same copy as the model at time t3 in

the central server. Under the aforementioned inconsistency, linear convergence of AMTL is shown

via the following theorem which follows the convergence analysis presented in [94].

Theorem 3.3.1. Let (Vi)i≥0 be the sequence generated by the proposed AMTL with ηi ∈
] for any ηmin > 0 and 0 < c < 1, where τ is the maximum delay. Then (Vi)i≥0
[ηmin,
converges to an V ∗-valued random variable almost surely. If the MTL problem in Eq. 3.3.1 has a

√
c
K+1

2τ /

unique solution, then the sequence converges to the unique solution.

This theorem states that if the MTL problem has a unique solution, the proposed asynchronous

optimization framework will converge to the unique solution with a constant step size ηi.

72

Figure 3.2 Illustration of asynchronous updates in AMTL. The asynchronous update scheme has
an inconsistency when it comes to reading model vectors from the central server.

3.3.3.2 Dynamic Step Size in AMTL

In this study, all the task nodes are assumed to follow independent Poisson processes and have the

same activation rate [94]. A task node is activated when it performs gradient update and commu-

nicates with the central server for updates. Activation rates of different nodes are assumed to be

the same. For instance, the probability of each task node being activated before other task nodes

is 1/K [73], where K is the total number of tasks. On the other hand, different task nodes often

have different activation rates in real world network settings due to the topology of the network.

We proposed a dynamic step size approach to address this fact. In asynchronous updates, step

size is usually much smaller to guarantee the convergence compared to synchronous settings. The

dynamic step size idea was previously used in a speciﬁc setting with asynchronous optimization

to boost the overall performance [26]. Dynamic step size proposed in this study integrates a time

multiplier into the update of AMTL deﬁned in Eq. 3.3.5:

(cid:32)(cid:0)Proxτ λg

(cid:0)ˆvi(cid:1)(cid:1)

k − η∇(cid:96)k

(cid:0)(cid:0)Proxτ λg(ˆvi)(cid:1)

(cid:33)

(cid:1) − vi

k

k

(3.3.11)

vi+1
k = vi

k + c(k,i)ηi

where the multiplier is given by:

c(k,i) = log (max (¯νk,i, 10))

(3.3.12)

73

Receive Task ModelProximalTask Node 4Task Node 2Send Task GradientSend Task GradientTask Node 3Central Servert1t2t3t4Proximalt5Receive Task ModelMulti-Task Model(cid:80)z

where ¯νk,i = 1
i+1

time point, and ν(j)
k

k

is the average of the last i + 1 delays in task node k, z is the current

j=z−i ν(j)
is the delay at time j for task k. Thus, the actual step size is scaled by the

history of communication delays between the task nodes and the central server such as c(k,i)ηi. If

the delay is long, the probability of a task node being activated is low which means the activation

rate is small. In such cases, the step size is increased to compensate for the delay. As different

types of functions can also be used instead of Eq. 3.3.12, utilizing a history of delays is empirically

shown to help boosting the convergence.

3.4 Experiments

In this section, we compare the efﬁciency of AMTL and synchronized MTL, namely SMTL.

AMTL framework was implemented in C++. Distributed environment was simulated by a shared

memory architecture [94] where the network delays were artiﬁcially added to the task node to

mimic the real world network scenario. Empirical convergence behaviors of AMTL and tradi-

tional SMTL were compared on synthetic datasets. Effect of proposed dynamic step size was also

investigated on synthetic datasets with various numbers of tasks. Hardware used in the experi-

ments was an Intel Core i5-5200U CPU 2.20GHz x 4 dual-core laptop whose the performance

was limited by the number of cores. Subsequently, we also developed a standalone Java socket

programming implementation of AMTL 1 available in the public domain.

3.4.1 Experimental setting

In our experiments, threads represent the task nodes and the number of threads was equal to the

number of tasks; the shared memory played the central server role. The proposed framework can

work with generic regularized MTL formulations, however our experiments focused on the low-

rank MTL formulation for shared subspace learning. Each task was assumed to solve a regression

problem with least squares loss(cid:80)K

1Available at https://github.com/illidanlab/AMTL

74

k=1 (cid:107)Xkwk − yk(cid:107)2

2, where Xk, nk, and yk denote data matrix,

sample size, and the targets of the task k, respectively. In the shared subspace learning formu-

lation given in Eq. 3.4.1, nuclear norm is used as the regularization, which provides low-rank

solutions. Nuclear norm couples the model vectors and learns a low-dimensional subspace that

actually achieves the knowledge transfer between tasks.

(cid:110)(cid:88)K

min
W

k=1

(cid:107)xkwk − yk(cid:107)2

2 + λ(cid:107)W(cid:107)∗

(cid:111)

(3.4.1)

Nuclear norm, a.k.a trace norm is deﬁned as follows:

(cid:107)W(cid:107)∗ = trace(cid:0)WT W(cid:1) =

min{d,K}(cid:88)

j=1

σj (W)

(3.4.2)

where σj (W) denotes jth singular value of matrix W. As we can see from its formulation,

nuclear norm is not separable and smooth. Therefore, we need to apply the proximal mapping of

the nuclear norm [20, 64] in the backward step as given below:

(cid:16) ˆVi(cid:17)

min{d,K}(cid:88)

Proxηλg

=

max (0, σj − ηλ) uiv(cid:62)

i

(3.4.3)

j=1

= U (Σ − ηλI)+ V(cid:62)

where {uj} and {vj} are the columns of U and V, respectively, ˆVi = UΣV(cid:62) is the singular value
decomposition (SVD) of ˆVi and (x)+ = max (0, x).

When the central node performs the proximal mapping during backward step, the current ver-

sion of the models in the shared memory is used. As it was discussed earlier, during the execution

of the backward step, some of the models in the shared memory might be changed because of the

asynchronous nature of the framework. In nuclear norm regularization, every backward step re-

quires singular value decomposition (SVD) of the model matrix W. Complexity of SVD is O (d3),

and therefore the backward step is computationally expensive when the data dimension, d, is high.

To avoid computing the full SVD in each iteration, online SVD [19] may be preferred. The com-

75

plexity of the online SVD for a p× q rank−r matrix is O (pqr) [19]. SVD is performed once at the
beginning and then the left U, right V, and singular value matrices Σ are used to update the SVD

of the matrices in subsequent iterations. Whenever a task node changes the corresponding column

of the model matrix, central server performs proximal mapping. For this reason, online SVD could

be a more efﬁcient approach when we deal with huge number of tasks and high dimensionality.

3.4.2 Comparison between AMTL and SMTL

Synthetic and real-world data experiments were conducted to compare the computation times of

AMTL and SMTL. To simulate a realistic network setting, random network delays are introduced

to task nodes for both AMTL and SMTL.

3.4.2.1 Synthetic Dataset Experiments

In this set of experiments, computation times of AMTL and SMTL are compared based on pub-

lic and randomly generated synthetic datasets with varying number of tasks, dimensionality, and

sample sizes. In Figure 3.3, the computation time for varying numbers of tasks (a), sample sizes

(b), and dimensionality (c) are shown, where the blue curves represent SMTL and the red curves

represent AMTL. All three plots were obtained for the same and ﬁxed number of iterations. While

comparing the computation time for different number of tasks, dimensionality of the datasets and

the samples sizes were ﬁxed to 50 and 100, respectively. If we look at Figure 3.3 (a), we can

observe that computation time of SMTL increases faster than AMTL with increasing number of

tasks. Note that after 150 tasks, time consumption of SMTL does not increase signiﬁcantly, which

can be explained because of the limited number of cores (dual-core machine) used in experiments.

In Figure 3.3 (b), data dimensionality was ﬁxed to 50, and the number of tasks was 5. Com-

putation time for both SMTL and AMTL approaches did not increase drastically with the increase

in sample size. This can be explained due to the fact that the proximal mapping is performed on

model matrix which is independent of the sample size. Only the forward step, which computes the

gradient, is affected by the sample size. For this reason, the computational time of the backward

76

Figure 3.3 Computation times of AMTL and SMTL for (a) varying number of tasks when the num-
ber of dimensions and the number of data points per node were ﬁxed at 50 and 100, respectively,
for (b) varying number of task sizes with 50 dimensions, and 5 tasks, and for (c) varying dimen-
sionality of 5 tasks with 100 samples in each task. As expected, computational time of SMTL is
higher than AMTL for a ﬁxed number of iterations.

step remains unchanged with the increasing number of sample sizes for both SMTL and AMTL.

In Figure 3.3 (c), number of tasks were ﬁxed at 5 and the sample size per task was ﬁxed at 100.

Computation times of both approaches increases with dimensionality, which is expected due to the

proximal mapping of the nuclear norm. Furthermore, the gap between AMTL and SMTL compu-

tational requirements also gets wider with dimensionality because SMTL needs to wait longer for

the updates at both task nodes and central server. In summary, computation time increases with the

number of tasks, sample sizes and the dimensionality for both SMTL and AMTL. However, the

rate of the increase is higher for SMTL compared to AMTL as expected.

In the next experiment, different network characteristics were investigated for synthetic datasets

with varying numbers of tasks. Fifty dimensional datasets with 100 samples per task were used to

compare the computational time of AMTL and SMTL under different amounts of network delays.

Summary of the results for 5, 10, and 15 tasks are given in Table 3.1. Similar to the previous

problem setting, regression problem with the squared loss and nuclear norm regularization was

taken into account. “Network” column of the table represents synthetically generated network

delays, where the numbers next to AMTL and SMTL denote the offset values for the amount of

delay. For instance, AMTL-5 means that each task node is idle for 5 seconds plus a random number

77

Table 3.1 Computation times (sec.) of AMTL and SMTL with different network delays. The offset
value of the delay for AMTL-5, 10, 30 was chosen as 5, 10, 30 seconds. Same network settings
were used to compare the performance of AMTL and SMTL. AMTL performed better than SMTL
for all the network settings and numbers of tasks considered here.

Network
AMTL-5
SMTL-5
AMTL-10
SMTL-10
AMTL-30
SMTL-30

5 Tasks
156.21
239.34
297.34
452.84
902.22
1238.16

10 Tasks
172.59
248.23
308.55
470.79
910.39
1367.38

15 Tasks
173.38
256.94
313.54
494.13
880.63
1454.57

of seconds after it completes the forward step. As we can see from Table 3.1, the increase in the

computation time with network delays is signiﬁcantly more for SMTL than AMTL.

3.4.2.2 Real World Datasets Experiments

We also conducted experiments for three public datasets for multi-task learning. Details about the

datasets are summarized in Table 3.2. School dataset is a popular multi-task learning dataset con-

taining exam records of 139 schools in 1985, 1986, and 1987 provided by the London Education

Authority (ILEA) [89]. MNIST is a well known handwritten digits dataset with 60, 000 training

samples and 10, 000 test samples [75]. MNIST was used for 5 binary classiﬁcation tasks: 0 v. 9,

1 v. 8, 2 v. 7, 3 v. 6, and 4 v. 5 . The third public data is Multi-Task Facial Landmark (MTFL)

dataset [126] containing 12, 995 face images with different genders, and head poses. MTFL con-

tains facial features, including ﬁve facial landmarks and attributes for gender, smiling/not smiling,

with/without glasses, and head pose (e.g., frontal, right proﬁle, and left proﬁle). Four binary clas-

siﬁcation tasks such as male v. female, smiling v. not smiling, with or without glasses, and right or

left head pose were designed for multi-task learning setting. For binary classiﬁcation tasks, logistic

loss was used. Training time comparison is given in Table 3.3 with different amounts of network

delays. For the datasets with larger number of tasks, the gap between training times of AMTL and

SMTL was bigger.

78

Table 3.2 Benchmark public datasets used in this study. Sample sizes vary per task. Third column
of the table summarizes the minimum and maximum number of data points tasks might have in
each dataset.

Dataset Number of tasks
School
MNIST
MTFL

139
5
4

Sample sizes

Dimensionality

22-251

13,137-14,702
2,224-10,000

28
100
10

Table 3.3 Training time (sec.) comparison of AMTL and SMTL for the three public datasets. Sim-
ilar to the synthetic data experiments, AMTL generally requires smaller training time compared to
SMTL under different network settings.

Network
AMTL-1
AMTL-2
AMTL-3
SMTL-1
SMTL-2
SMTL-3

School MNIST MTFL
50.40
194.22
231.58
77.44
103.45
460.15
50.59
299.79
298.42
92.84
146.87
593.36

54.96
83.17
115.46
57.94
114.85
161.67

Experimental results suggest that an asynchronous framework is preferred in a distributed set-

ting with communication delays and different characteristics of task nodes. Moreover, AMTL

provides a more robust optimization scheme compared to SMTL in terms of network failures.

Asynchronous optimization can still continue even one task node turns ofﬂine suddenly. One

potential drawback of AMTL is convergence due to the inconsistent updates. To investigate the

convergence property, we conducted another experiment. In Figure 3.4, change in objective values

of AMTL and SMTL is given for a ﬁxed number of iterations and synthetic data with 5 and 10

tasks. According to our experiments, AMTL did not suffer from slower convergence compared to

SMTL.

3.4.3 Dynamic step size

In this section, effect of the proposed dynamic step size approach is investigated. The average delay

of the last 5 iterations was used to modify the step size for randomly generated 50 dimensional

synthetic datasets with 100 samples in each task and ﬁxed number of iterations. At the end of the

79

Figure 3.4 Convergence of AMTL and STML under the same network conﬁgurations. Experiment
was conducted for randomly generated synthetic datasets with 5 and 10 tasks.

iterations, objective values of each dataset with various number of tasks and different delay patterns

were observed. Some task nodes have to wait longer than other nodes due to the network delays

and this situation slows the convergence. To compensate the effect of delays on the convergence,

we increased the step size of the nodes which had to wait for a long time in previous iterations by

following the formulation in Eq. 3.3.12. Experimental results in Table 3.4 showed that dynamic

step size could indeed boost the convergence. We can observe a decrease in the objective value

of AMTL with dynamic step size compared to the AMTL with constant step size for the equal

number of iterations, indicating that the dynamic step size contributes to a faster convergence. In

addition, the objective values were observed to decrease with an increasing amount of delay, when

the dynamic step size was used. There were no theoretical analysis of dynamic step size while

designing the proposed approach, however the empirical results indicate that the dynamic step size

could be promising to boost convergence in distributed settings with network delays.

3.5 Summary

In this chapter, an asynchronous framework, AMTL, is proposed for distributed multi-task learn-

ing. Distributed datasets are commonly encountered in many domains, such as EHRs in hospitals

80

Table 3.4 Objective values of the synthetic dataset with 5, 10, and 15 tasks. Objective values
of different network settings are shown. Dynamic step size yields lower objective values at the
end of the last iteration than ﬁxed step size. This result indicates that the dynamic step size in
asynchronous distributed setting is recommended to boost the convergence.

Network
Number of tasks
AMTL-5
AMTL-10
AMTL-15
AMTL-20

Without dynamic step size With dynamic step size
15

10

15

5

5

10

163.62
163.59
163.56
168.63

366.27
367.63
366.26
366.35

559.07
561.68
561.87
561.21

144.83
144.77
143.82
143.50

334.24
333.71
333.13
331.13

508.65
505.64
500.05
499.97

located in different geographical regions. AMTL performs asynchronous distributed block coor-

dinate descent with backward-forward splitting scheme on regularized MTL formulations. Exper-

imental results presented the efﬁciency of AMTL compared to synchronous version, SMTL, in

terms of computational time. A dynamic step size strategy is introduced to boost the convergence

performance of distributed settings with network delays.

81

Chapter 4

Patient Subtyping via Time-Aware LSTM

Networks

In the previous chapters, we addressed the challenges due to the scale, interpretability, and the dis-

tribution of the EHR datasets. In this chapter, temporal aspect of the EHRs is taken into account.

EHR data essentially comprises the medical history of patients. This means that we have the op-

portunity to investigate the changes and the dependences between consecutive patient records. In

recent years, state-of-the-art performances for sequential data analysis have been achieved by re-

current neural network (RNN), and its variants (e.g., LSTM, Gated Recurrent Unit (GRU)). In this

chapter, we introduce a new LSTM architecture, named time-aware LSTM (T-LSTM), to address

analysis of longitudinal EHRs with unevenly sampled time steps. Medical history of patients is

one of the most important components of clinical decision making. Temporal structure of patient

records is also crucial in disease progression studies. One prominent task in disease progression

studies is patient subtyping. Subtyping aims to group patients based on similar disease progression

patterns. In particular, patient subtyping corresponds to clustering temporal patient records. An

important component of this task is to learn a single representation for the sequential patient data

such that the temporal structure can be captured and embedded into the representation. Standard

RNN architectures, such as LSTM, assume that the time gap between consecutive elements is uni-

82

form throughout the sequence. However, EHR data naturally has irregular elapse time where the

time between consecutive records can vary from weeks to years. In this study [11], T-LSTM is

designed based on the hypothesis that if there is a long time gap between two consecutive records,

the effect of the previous time step on the current output should be reduced. Thus, T-LSTM learns

the temporal patterns while considering the irregular elapsed times which demonstrates a more

realistic problem setting. For patient subtyping purposes, a single distinctive representations for

patient sequences are learned using T-LSTM auto-encoder and patient representations are clustered

into subtypes.

4.1

Introduction

Medical history of patients is an important component of clinical decision making. EHRs store

digital medical history of patient information including test results, procedures, medications, and

diagnoses. Clinical research also beneﬁts from the medical history to discover unknown patterns

in patient cohorts to investigate prognosis of different types of diseases, and the effects of drugs.

In short, large-scale, systematic and longitudinal patient records play a key role in the healthcare

domain. EHRs provide a rich source of such longitudinal data, however as discussed in Chapter 1,

analyzing EHRs is challenging. Biomedical informatics resorts to machine learning approaches to

alleviate these challenges [23, 57, 84, 120, 131] and tries to address difﬁcult tasks, such as disease

progression modeling and risk prediction [28, 30, 39, 80, 130, 132]. Patient subtyping is one of the

disease progression modeling tasks that investigates patient groups with similar disease progression

pathways. Subtyping task is a crucial step of personalized medicine which prescribes treatments

that best ﬁts the health conditions of a group of patients.

Patient subtyping usually takes a particular type of disease into consideration [23], such as

Parkinson’s disease. Since essentially it is a grouping task, patient subtyping is considered as

an unsupervised learning problem. In particular, an approach that can cluster time series data is

required. One important note is that time series analysis approaches based on aligning the time

83

Figure 4.1 An example segment of longitudinal patient records where the patient had 6 ofﬁce visits
between Sept 5, 2015 and Sept 27, 2016. In each visit, medical information such as diagnosis of
the patient is recorded. Diagnoses are usually encoded as ICD-9 codes. Time gap between two
successive visits varies.

sequences and computing the similarity between them are not suitable for patient subtyping task.

Here, every element of the time sequence contains heterogeneous information mostly in the form

of high dimensional vectors. For this reason, we need to use a more extensive approach that can

address the complexity of the temporal EHR data. One efﬁcient method that can capture temporal

patterns is recurrent neural network (RNN) that has been successfully applied to different problems

such as speech recognition [51], text classiﬁcation [72], video processing [37, 106], and natural

language processing [117]. Even though, in theory, long term dependencies can be successfully

captured by the RNNs. However, RNN performance is not robust against vanishing and exploding

gradients due to back-propagation through time. To avoid this problem, several variants of RNN

have been proposed such as long-short term memory (LSTM) [58] networks. LSTM is a gated

RNN structure where the hidden state of the next time step depends on a summation rather than

a matrix multiplication. The additive nature of LSTM overcomes the vanishing and exploding

gradient problem. LSTM has also been applied to biomedical informatics [22, 24] with promising

results.

Even though it is not explicitly stated, LSTM assumes that the time gap between the elements

of the time sequence is uniform. Sampling frequency of an audio ﬁle, time gap between measure-

84

Visit 1Visit 2Visit 3Visit 4Visit 5Visit 6Diagnoses ICD-9:•42789•42822•4263•41401•V861•4280•2449•3659Diagnoses ICD-9:•3962•4260•2875•41401•4019Diagnoses ICD-9:•99831•41511•99672•496•V4581•4019•V1051Diagnoses ICD-9:•41401•4111•496•4019•53081•V1051Diagnoses ICD-9:•V4511•V1251•V5861•V4589•2875Diagnoses ICD-9:•2766•5856•40301•4254•28529•7100•78909ments periodically taken from a sensor, or the gap between words of a sentence can naturally be

the same throughout the sequence. However, uniform elapsed time assumption does not always

hold. In case of EHRs, time span between consecutive records can vary from days to months,

and sometimes years. An example illustration of a patient record is given in Figure 4.1. In this

ﬁgure, segment of a sample medical record of a patient is shown. Each time step represents one

visit to the hospital: medical information of the patient is usually recorded at the end of each visit

such as diagnoses encoded in ICD-9. As we can see from the ﬁgure, there can be months between

consecutive visits. As a matter of fact, the elapsed time might even contain important information

about patient’s condition. For example, if a patient visits hospital frequently, this might indicate an

ongoing disease and information recorded in consecutive visits would assist the clinical decision

maker of the next visit. On the other hand, if there are months and years between two consecutive

visits, then the dependency on the previous information is little.

In this study, we propose Time-Aware LSTM (T-LSTM) to address aforementioned challenges

for sequences with irregular time gaps and show an application on patient subtyping. T-LSTM

takes elapsed time into account to adjust the memory content of the LSTM unit by decomposing the

memory as long and short term memories and applying a time decay on the short term component.

The amount of decay is determined by the elapsed time such that longer the elapsed time, smaller

the effect of the previous memory to the current output. The proposed architecture is then used

for patient subtyping purpose which is essentially a clustering problem. To be able to cluster

patients, a single representation from patient’s records is learned by using a T-LSTM auto-encoder.

The proposed T-LSTM auto-encoder maps temporal sequences of patients to a representation by

capturing the dependencies between consecutive records in the presence of time irregularities.

Experiments were conducted for supervised and unsupervised tasks on synthetic and real world

datasets to examine the performance of T-LSTM. Before introducing the details of the proposed

approach, a literature review is presented in the next section.

85

4.2 Literature Review

4.2.1 RNNs for Biomedical Informatics

There have been several studies applying deep learning methods to biomedical informatics data.

Patient subtyping and phenotyping are two prominent research topics in healthcare where learning

a powerful representation of a patient is crucial. Patient data is very complicated and heteroge-

neous, therefore it is challenging to cluster patients and to do predictive analysis based on their

medical records. Supervised and unsupervised machine learning approaches usually rely on a

single discriminative representation of data points to achieve a good performance. In that sense,

deep learning offers a representation learning strategy that usually leverages large amounts of data.

Another advantage of deep models is that they start from coarse representations of the raw input

and obtain ﬁner representations that can summarize the data very well by capturing the spatial and

temporal patterns. RNN and its variants intuitively learn a representation of the sequence at each

time step by considering temporal dependencies.

As it was mentioned earlier, learning a representation from temporal sequences is an important

step in biomedical informatics. For this purpose, Pham et al. proposed an end-to-end deep net-

work to read EHRs, save patient history, infer the current state and predict the future [96]. Their

approach, called “DeepCare”, used LSTM for multiple admissions of a patient, and also addressed

the time irregularities between the consecutive admissions by modifying the forget gate of standard

LSTM unit. A vector representation was learned for each admission that was fed into the LSTM

network. The proposed T-LSTM approach in this study, however adjusts the memory cell by using

the elapsed time. The study [96] focused on supervised problem settings. There are other stud-

ies in the literature using RNNs for supervised tasks. For instance, Esteban et al. [39] proposed

an approach to predict whether a patient suffering from kidney failure would survive. RNN was

used to predict several conditions related to kidney failure within predetermined time windows.

LSTM was used to recognize patterns in multivariate time series of clinical measurements [80]. In

this case learning a representation for clinical time series was posed as a multi-label classiﬁcation

86

problem. LSTM with a fully connected output layer was used for the multi-label classiﬁcation

problem.

Choi et al. [30] aimed to mimic how physicians make decisions. RNN was used for this purpose

with the patient’s past visits in a reverse order. Authors also proposed a different way of using

RNNs such that there were two RNNs for visit-level and variable-level attention mechanism. Their

goal was to predict diagnoses by ﬁrst considering the recent visits of the patient and determining

which visits and which events are worth paying attention. Another study focusing on predicting the

diagnosis of the patient along with the time duration until the next visit was presented [28]. In their

study, elapsed time was incorporated as an attribute concatenated to the input rather than a decay

factor and another variant of RNN, called GRU, was utilized. On the other hand, Che et al. [22]

aimed to learn patient similarities directly from temporal EHR data for personalized predictions

of Parkinson’s disease. GRU unit was used to encode the similarities between the sequences of

two patients and dynamic time warping was used to measure the similarities between temporal

sequences. A different approach for representation learning from EHR data was introduced [29].

Their method, called Med2Vec, was proposed to learn a representation for both medical codes and

patient visits from large scale EHRs. The learned representations were aimed to be interpretable.

Authors utilized a multi-layer perceptron to generate a visit representation for each visit vector.

4.2.2 Auto-Encoder Networks

We also would like to brieﬂy mention related studies in auto-encoder networks. In patient sub-

typing task, we do not have any labels that can be used in a standard supervised setting. Besides,

LSTMs are often used for supervised tasks in the literature as summarized above. Therefore,

we need an approach where a powerful representation of the temporal sequence can be learned

without any supervision. The most popular unsupervised way of utilizing deep networks is auto-

encoders that obtain a single representation of the raw input by minimizing a reconstruction error.

For instance, LSTM auto-encoders were used to learn representations for video sequences [106].

Authors investigated the performance of learned representations on supervised tasks and reported

87

Figure 4.2 Illustration of time-aware long-short term memory (T-LSTM) unit, and its application
on medical records. Green boxes indicate networks and yellow circles denote point-wise operators.
T-LSTM takes input record and the elapsed time at the current time step. T-LSTM decomposes
the previous memory into long and short term components and utilizes the elapsed time (∆t) to
discount the short term effects.

an increase in the classiﬁcation accuracy. Auto-encoders were also used to generate a different

sequence by using the representation learned in the encoder part. For instance, one RNN encoded

a sequence of symbols into a vector, then the decoder RNN mapped the single representation into

another sequence [27]. Cho et al. [27] showed that their proposed approach can interpret the input

sequence semantically and can learn its meaningful representation syntactically.

4.3 Time-Aware Long Short Term Memory

In this section, background information about LSTM networks, the proposed T-LSTM architecture,

and the T-LSTM auto-encoder for patient subtyping are introduced.

4.3.1 Long Short-Term Memory (LSTM)

Commonly used standard feed-forward networks such as multi-layer perceptron and convolutional

neural networks take an input, extract levels of abstraction and predict the output. From graph

88

theoretic viewpoint, these networks are directed graphs without cycles and thus without a mem-

ory. Therefore, feed-forward networks are not suitable for sequential data since the dependencies

between consecutive elements of a sequence needs to be modelled. On the other hand, recurrent

neural networks (RNNs) are deep network architectures where the connection between hidden units

forms a directed cycle, in other words a feedback mechanism. Thus, RNNs naturally construct an

internal memory containing information from previous hidden states. Consequently, RNNs are

applicable to problems where the system needs to store and update information [15]. Before the

popularity of RNNs, Hidden Markov Models (HMM) were proposed for analyzing temporal se-

quences. The fundamental difference between RNNs and HMMs is the Markov assumption that

RNNs do not make. Another advantage of RNNs is being able to process variable length sequences.

In principle, RNNs can keep the long term information of past inputs in the memory, however

optimization for long-term dependencies degrades the learning process because of vanishing and

exploding gradient problems. RNNs suffer from this problem when gradient becomes nearly zero

or gets too large because of the chain multiplications in gradient computation. Vanishing and ex-

ploding gradient can also be encountered in feed-forward networks, however this problem becomes

more prominent in RNNs because of back propagation through time (BPTT). For instance, non-

linear activation functions such as sigmoid σ and tanh have almost ﬂat regions that makes gradient

nearly zero and gradients in BPTT involves multiplying gradients from previous time steps. As

a result, even one nearly zero gradient in the chain does not let the model be updated properly.

To be able to incorporate the long-term dependencies without violating the optimization process,

variants of RNNs have been proposed such as Long Short-Term Memory (LSTM) [58].

A standard LSTM unit comprises of forget, input, output gates, and cell memory. Forget gate is

used to control the informational ﬂow from the previous cell memory. While forget gate discards

some part of the history, input gate writes new information about the current time step. Output

gate works as a ﬁlter to control what to output and the cell memory keeps the history collected

throughout the sequence which is updated at each time step. One missing point of the current

architecture is a mechanism to incorporate the irregular elapsed time into the system. Besides,

89

time irregularity in real world temporal data can be encountered as in EHR datasets. Therefore, we

propose a novel LSTM architecture, called Time-Aware LSTM (T-LSTM), where the time lapse

between successive records is included in the network architecture.

4.3.2 Proposed Time-Aware LSTM (T-LSTM)

Sometimes temporal sequences do not follow regular time gaps by nature such as patient records

where the frequency and the number of patient reports are unstructured. Another reason of non-

uniform elapsed times in longitudinal data can be missing information. In such cases, missing

elements of a sequence would introduce irregularity and this might alter the pattern in the temporal

changes that we want to leverage for predictive tasks. T-LSTM is proposed to incorporate the

elapsed time information into the standard LSTM architecture to be able to capture the temporal

dynamics of sequential data with time irregularities. The proposed T-LSTM architecture is shown

in Figure 4.2 where the input sequence is represented by the temporal patient records.

The most important component of the T-LSTM architecture is the subspace decomposition

applied on the cell memory. At each time step, cell memory of the previous hidden state is de-

composed into short (C S

t−1 = Ct−1 − C S

t−1) which represents
fast and slow changes in the patient records, respectively. Then a non-increasing function of the

t−1) and long-term memories (C L

elapsed time which transforms the time lapse into an appropriate weight (g (∆t)) is used to dis-
t−1∗g (∆t)) before adding long and short-term components
count short-term component ( ˆC S
back together (C∗
t−1). Note that this decomposition is data-driven and the parameters are learned
simultaneously with the rest of network parameters by back-propagation. There is no requirement

t−1 = C S

for the activation function to be used in the decomposition network. In fact, several options were

tried but we did not observe a drastic difference in the prediction performance of the T-LSTM

unit. The basic idea behind T-LSTM is to adjust the cell memory with respect to the elapsed time

such that the amount of discount is more if there is a big gap between consecutive elements of the

sequence. For instance, if there are months or years between two consecutive reports of a patient,

it means that no new information was recorded for a long time for that patient. In this case, if the

90

current hidden state and eventually the output rely on the information from years ago, this might

be misleading because of the fact that during that time gap patient might develop new conditions.

However, overall temporal pattern should also be preserved. For this reason, we do not apply the

discount on the cell memory itself but a portion of it. The overall T-LSTM formulation is provided

below.

C S

t−1 = tanh (WdCt−1 + bd)

(Short-term memory)

t−1 ∗ g (∆t)

ˆC S
t−1 = C S
t−1 = Ct−1 − C S
C L
t−1
C∗
t−1 + ˆC S
t−1 = C L
t−1

ft = σ (Wf xt + Uf ht−1 + bf )

it = σ (Wixt + Uiht−1 + bi)

ot = σ (Woxt + Uoht−1 + bo)

˜C = tanh (Wcxt + Ucht−1 + bc)
Ct = ft ∗ C∗
ht = o ∗ tanh (Ct)

t−1 + it ∗ ˜C

(Discounted short-term memory)

(Long-term memory)

(Adjusted previous memory)

(Forget gate)

(Input gate)

(Output gate)

(Canditate memory)

(Current memory)

(Current hidden state)

where xt represents the current input, ht−1 and ht are previous and current hidden states, and
Ct−1 and Ct are previous and current cell memories. {Wf , Uf , bf}, {Wi, Ui, bi}, {Wo, Uo, bo},
and {Wc, Uc, bc} are the network parameters of the forget, input, output gates and the candidate
memory, respectively. {Wd, bd} are the network parameters of the subspace decomposition. Di-
mensionality of the parameters are determined by the input, output and the chosen hidden state
dimensionality. ∆t is the elapsed time between xt−1 and xt and g (·) is a heuristic decaying func-
tion such that larger the value of ∆t, smaller the effect of the short-term memory. Different types
of monotonically non-increasing functions can be chosen for g (·) depending the measurement
unit of the time durations. For instance, some datasets might have a ﬁxed time measure such as

91

seconds throughout the sequence and in other cases, elapsed time present in the sequence might

have seconds, minutes and hours. In the latter case, we need to decide on one type of unit say

seconds. In this case, when there are hours between two consecutive elements, elapsed time will

be huge in seconds. Empirically we determined that, g (∆t) = 1/∆t is appropriate for datasets

with small amount of elapsed time and g (∆t) = 1/ log (e + ∆t) [96] is preferred for datasets with

large elapsed times.

In T-LSTM, one of the reasons behind adjusting the memory cell instead of the forget gate

is to avoid any changes to the current input’s effect to the current output. The current input runs

through the forget gate and the information coming from the input plays a role to determine how

much memory content we should keep from the previous cell. Therefore, we chose not to modify

forget and input gates. Another idea to handle the time irregularity could be to impute the data

by sampling new records between two consecutive time steps and then applying LSTM on the

augmented data. However, when there is a huge gap between two consecutive elements, we would

need to sample many points. In case of an EHR dataset, imputing patient records in such a fashion

would not be plausible. Patient records comprise of detailed information and it is not possible to

ensure that the imputed records reﬂect the reality. In the next section, we present how T-LSTM is

used in an auto-encoder setting for patient subtyping.

4.3.3 Patient Subtyping with T-LSTM Auto-Encoder

Patient subtyping is posed as a clustering problem since we do not have any prior information about

the patient groups in the cohort. Clustering patient sequences directly is not possible. We propose

to learn a single representation for each sequence and apply a standard clustering algorithm such as

k-means on the representations to obtain patient groups. Auto-encoders provide an unsupervised

way to directly learn a mapping from the original data [14]. In the literature, LSTM auto-encoders,

where encoder and decoder parts are comprised of LSTM networks, have been used to encode

sequences such as sentences [133]. In this study, T-LSTM auto-encoder is introduced to learn an

effective single representation of the sequential records of a patient. The proposed auto-encoder

92

has T-LSTM encoder and T-LSTM decoder units with different parameters learned jointly to min-

imize the reconstruction error. The proposed auto-encoder can capture the long and the short term

dependencies by incorporating the elapsed time into the system and learn a single representation

that can be used to reconstruct the input sequence.

In Figure 4.3, a single layer T-LSTM auto-encoder mechanism is shown for a sequence with

three elements [X1, X2, X3]. In this architecture, T-LSTM decoder takes the hidden state and the

cell memory of T-LSTM encoder at the end of the input sequence as the initial state and memory.

Using a recurrent neural network in an auto-encoder setting is also known as sequence to sequence

learning. Input and the elapsed time at the ﬁrst time step of the decoder are set to zero and the

ﬁrst reconstruction output ( ˆX3) is assumed to be the last element of the input sequence. Thus,

the output sequence is reversed as suggested [106]. This is a common practice in sequence to

sequence learning since reversing the order introduces short term dependencies and this is known

to facilitate the optimization. When the parameters of the T-LSTM auto-encoder are learned based

on the reconstruction error Er given in Equation 4.3.1, one forward pass of T-LSTM encoder yields

the learned representation as the hidden state at the end of the sequence.

(cid:88)L

i=1

Er =

(cid:13)(cid:13)(cid:13)Xi − ˆXi

(cid:13)(cid:13)(cid:13)2

2

,

(4.3.1)

where L is the length of the sequence, Xi is the ith element of the input sequence and ˆXi is the ith

element of the reconstructed sequence. The hidden state at the end of the sequence carries concise

information about the input such that the original sequence or a target sequence can be recon-

structed from it. In other words, representation learned by the encoder is a summary of the input

sequence [27]. Number of layers and dimensionality of the representation can be determined based

on the complexity of the problem. According to our observation, learning a lower dimensional rep-

resentation compared to the input dimensionality requires a higher model capacity, therefore we

preferred to use a two layer T-LSTM auto-encoder in our experiments.

93

Figure 4.3 Finding patient groups with a single-layer T-LSTM Auto-Encoder. Blue arrows denote
the cell memory and the black arrows denote the hidden states. After the representations (Ri,
i = 1, 2,··· , 8) are learned for the population, we can cluster the patients and obtain subtypes for
each group.

After a single representation for each patient is obtained, patients can be grouped by a clustering

algorithm such as k-means. Since we do not have any information and assumption about the

structure of the clusters, we propose to use k-means because of its simplicity. In Figure 4.3, a

small illustration of clustering the patient cohort for 8 patients is given.

In the ﬁgure, learned

representations are denoted by R. If R has the capability to capture the patient characteristics from

temporal medical records, then clustering algorithm is expected to group patients with similar

properties together. This consequently provides subtypes in a patient cohort. When there is a new

patient, learned T-LSTM encoder is used to retrieve the representation of the patient and subtype

of the patient can be found by obtaining the cluster whose centroid gives the minimum distance for

the new patient. As such, learned representations could be used for supervised tasks as well.

4.4 Experiments

To investigate the performance of the proposed T-LSTM and T-LSTM auto-encoder, experiments

on synthetic and real world datasets were conducted. Performance comparisons between T-LSTM,

MF1-LSTM, MF2-LSTM [96], LSTM, and logistic regression were made for supervised and un-

94

supervised settings on synthetic and real world datasets. MF1-LSTM and MF2-LSTM denote

Modiﬁed Forget Gate LSTM approaches adopted from [96]. MF1-LSTM multiplies the output of
the forget gate by g (∆t) such as ft = g (∆t) ∗ ft, whereas MF2-LSTM utilizes a parametric time
when
weight such as ft = σ (Wf xt + Uf ht−1 + Qf q∆t + bf ) where q∆t =

(cid:1)2 ,(cid:0) ∆t

(cid:1)3(cid:17)

(cid:16) ∆t
60 ,(cid:0) ∆t

180

360

∆t is measured in days similar to [96].

The application of T-LSTM auto-encoder on patient subtyping is presented on a real world

dataset (PPMI) and subtyping results are discussed. T-LSTM 1 was implemented in Tensorﬂow and

mini-batch stochastic Adam optimizer was used during experiments. All the weights were learned

simultaneously, and same network settings and parameters were used for all the deep methods for

fair comparison. Therefore, ﬁxed number of epochs were chosen during the experiments instead of

using a stopping criteria. Since the longitudinal patient data used in experiments has variable length

sequences, batches with same sequence lengths were generated instead of padding the original

sequences with zero to make every sequence to be of same length. The main reason of avoiding

zero padding, which a commonly used practice, was that the patient data generally contains sparse

vectors representing diagnosis and other medical features and zeros also indicate a valid meaning

other than absence. Therefore, we did not add extra zeros to patient sequences. Note that in this

study, we did not use the publicly available large scale ICU database, called MIMIC [66]. Since it

is challenging to ﬁnd public EHR datasets due to privacy concerns, MIMIC is an important source

of patient data for biomedical informatics research. MIMIC database contains clinical information

recorded at the end of patient’s hospital stay. As a result, majority of the patients in the MIMIC

database have one or two records. Since patients do not have enough temporal variations, we could

not use the MIMIC database to test the performance of proposed T-LSTM network.

1Available at https://github.com/illidanlab/T-LSTM

95

4.4.1 Synthetic Dataset

4.4.1.1 Supervised Experiment

In this experiment, synthetically generated EHR data 2 was used for the classiﬁcation task. The

aforementioned synthetic data has records of up to 100, 000 patients with lab results, diagnoses,

and start and end dates of the admissions. Similar to real world EHR datasets, a unique patient

ID is assigned to each patient. We refer to [69] for further details of the data generation pro-

cess. Although the dataset is synthetically generated, it contains similar characteristics as a real

EHR data. The proposed T-LSTM was used to classify healthy patients and patients with Dia-

betes Mellitus. For this binary classiﬁcation task, 6, 730 patients were sampled with an average of

4 admissions. Input features were multi-hot vectors containing the diagnoses given in the corre-

sponding admission with a vocabulary size of 529. Since this task predicts the binary class label

for patient sequences, T-LSTM was used in the standard way, where the hidden state at the end of

the patient sequence is mapped to a binary label, other than auto-encoder.

For this task, a single layer T-LSTM, MF1-LSTM, MF2-LSTM networks and traditional lo-

gistic regression were tested to compare the performance based on area under ROC curve (AUC)

metric for 50 epochs. In this experiment, number of hidden and softmax layer neurons were chosen

as 1, 028 and 512, respectively. In logistic regression experiments, admissions were aggregated for

each patient without incorporating the elapsed time. We also tried to incorporate the elapsed time

as a weight by using the same non-increasing function used in T-LSTM during the aggregation of

admissions. However, this approach did not improve the performance in our case. The results are

summarized in Table 4.1.

In summary, T-LSTM was observed to provide a better AUC performance compared to baseline

approaches. Logistic regression, which is a commonly used method in biomedical informatics

applications, performed very poorly in our case. The way to represent the sequential data could

be improved further for logistic regression, but aggregation of the admissions for each patient

did not perform well for this task. Supervised experiments showed that LSTM networks can be

2http://www.emrbots.org/

96

Table 4.1 Supervised synthetic EHR experimental results showing the average AUC of testing on
10 different splits. Training and testing proportion was chosen as 70% to 30%.

Methods
T-LSTM

MF1-LSTM
MF2-LSTM

LSTM

LR

Avg. Test AUC Stdev.
0.01
0.02
0.09
0.02
0.01

0.91
0.87
0.82
0.85
0.56

more helpful to learn temporal patterns of EHR data compared to logistic regression. In addition,

modifying the cell memory gave a better classiﬁcation performance. According to our observation,

MF1-LSTM and MF2-LSTM had better and sometimes similar results as the traditional LSTM for

the tasks in our experiments. We did not observe any bias regarding the sequence lengths during

the experiments.

4.4.1.2 Unsupervised Experiment

In this experiment, we investigate the expressive power of the representation learned from the

T-LSTM auto-encoder. For this purpose, a synthetic data was randomly generated and the clus-

tering results were evaluated. Since we know the ground-truth of the synthetic data, we computed

the Rand index (RI), given in Equation 4.4.1 [83], of the clustering to observe the discrimina-

tive power of the learned representations. A large value of Rand index indicates that the learned

representations can let the clustering be close to the ground-truth.

RI = (TP + TN)/(TP + FP + FN + TN),

(4.4.1)

where T P , T N, F P , F N are true positive, true negative, false positive and false negative, respec-
tively. Note that 0 ≤ RI ≤ 1.

The results on a synthetic dataset containing 4 clusters generated from a mixture of normal

distributions with four different means and the same covariance are reported. A data point in the

synthetic dataset was a sequence of vectors and the values of the sequences were increasing with

97

Table 4.2 Average Rand index of k-means over 10 runs. T-LSTM auto-encoder outperforms LSTM
and MF1-LSTM auto-encoders.

Method
T-LSTM

MF1-LSTM

LSTM

Mean RI

0.96
0.85
0.90

Std
0.05
0.13
0.09

time. Some of the elements in the sequences were discarded randomly to introduce unstructured

elapsed time and obtain variable sequence lengths of sizes 4, 6, 18, 22, and 30. Dimensionality of

the vectors was 5 and the dimension was reduced to 2 by the T-LSTM auto-encoder to be able to

plot the representations in a 2-D space. The hidden state dimension of the second layer T-LSTM

encoder was chosen as 2, therefore the learned representations were 2-dimensional single vectors.

The learned representations were clustered by k-means, where k was set to 4. Representation

learning was repeated 10 times with different initializations of k-means and the average Rand index

of clustering is reported for T-LSTM, LSTM and MF1-LSTM auto-encoders in Table 4.2. The

non-increasing heuristic function was chosen as g (∆t) = 1/ log (e + ∆t). For this experiment, we

compared the performances of T-LSTM, MF1-LSTM and LSTM excluding MF2-LSTM. Since the

time gap of the data used in this experiment does not relate to an actual time measurement such as

days, MF2-LSTM was excluded.

Table 4.2 shows that the T-LSTM outperforms the baselines and T-LSTM auto-encoder can

learn the underlying structure of the input sequence with varying elapsed times such that the rep-

resentations obtained by T-LSTM encoder could be clustered. In this example, performance of

MF1-LSTM was found to be better than LSTM on average. A visual example of one of the tri-

als is also shown in Figure 4.4, where the 2-dimensional representations obtained by the three

approaches are plotted.

In Figure 4.4 different colors denote ground-truth assignments of different clusters. Represen-

tations learned by T-LSTM provided more compact groups in the 2-D space leading to a more

accurate clustering result compared to the standard LSTM and MF1-LSTM. The change in the ob-

jective values of T-LSTM, MF1-LSTM and LSTM with respect to the number of epochs were also

98

Figure 4.4 Illustration of the clustering results. Different colors denote ground-truth assignments
of different clusters. T-LSTM auto-encoder learns a mapping for the sequences such that 4 separate
groups of points can be represented in the 2-D space.

compared in Figure 4.5 for the trial illustrated in Figure 4.4. It is observed that the modiﬁcations

related to the time irregularity does not affect the convergence of the original LSTM network in a

negative way.

4.4.2 Parkinson’s Progression Markers Initiative (PPMI) Data

In this section, we present experimental results for a real world dataset. Parkinson’s Progression

Markers Initiative (PPMI) 3 is an observational clinical and longitudinal study comprising of eval-

uations of people with Parkinson’s disease (PD), those people with high risk, and those who are

healthy [35]. PPMI aims to identify biomarkers of the progression of Parkinson’s disease. PPMI

3www.ppmi-info.org

99

Figure 4.5 Change in the objective values of T-LSTM, MF1-LSTM and LSTM with respect to 500
epochs. It is observed that the modiﬁcations related to the time irregularity does not deteriorate the
convergence of the original LSTM network.

data is a publicly available dataset which contains clinical and behavioral assessments, imaging

data, and biospecimens, therefore PPMI is a unique archive of PD [35]. As with many EHRs,

PPMI is a longitudinal dataset with unstructured elapsed times.

In our experiments, we used the pre-processed PPMI data of 654 patients given in [22]. Che et

al. [22] collected patients with Idiopathic PD or non PD, imputed missing values, used one-hot fea-

ture representation for categorical values, and encoded data abnormalities as 1 and 0. As a result,

the dataset we used has 15, 636 records of 654 patients with an average of 25 sequences (mini-

mum sequence length is 3). Authors of [22] also categorized data as features and targets, where

the features are related to patient characteristics and the targets correspond to the progression of

PD. A total of 319 input features consist of motor symptoms/complications, cognitive functioning,

autonomic symptoms, psychotic symptoms, sleep problems, depression symptoms, and hospital

anxiety and depression scale. A total of 82 targets are related to motor sign, motor symptom, cog-

nition, and other non-motor factors [22]. Summary of the PPMI data characteristics used in this

study can be found in Table 4.3.

As it can be seen in Table 4.3, the elapsed time was measured in months. From 1 month

to nearly 24 months gap between successive records of patients was encountered in the dataset.

100

Table 4.3 Details of PPMI data used in this study. Elapsed time encountered in the data is measured
in months and it varies between 1 month to nearly 24 months. Here, the elapsed time interval is not
the time interval of PPMI data recording, but elapsed times seen in records of individual patients.

Number of Patients
Elapsed Time Interval

Average Sequence Length
Feature Dimensionality
Target Dimensionality

654
[1, 26]

25
319
82

Table 4.4 Average mean square error (MSE) for 10 different train-test splits for T-LSTM, LSTM,
MF1-LSTM, and MF2-LSTM. T-LSTM yielded a slightly better result than the standard LSTM in
the presence of the unstructured time gaps.

MSE T-LSTM MF1-LSTM MF2-LSTM LSTM
0.51
Mean
Std
0.017

0.53
0.017

0.50
0.018

0.51
0.012

Following experiments were conducted on PPMI data to show the performance of the proposed

subtyping approach.

4.4.2.1 Target Sequence Prediction

In this experiment, T-LSTM was used to predict the target sequence of each patient. For this

purpose, we divided the data into different train (70%)-test (30%) splits and report the mean square

error (MSE) between the original target sequence and the predicted target sequence. Average

MSEs of 10 different train-test splits for T-LSTM, LSTM, MF1-LSTM and MF2-LSTM are given

in Table 4.4. Same step size and the number of epochs were used for all the three methods. The

non-increasing heuristic function of the elapsed time was chosen as g (∆t) = 1/ log (e + ∆t) for

PPMI data.

We also investigated target features on which T-LSTM performed the best. The commonly

encountered target features where the T-LSTM provided lower MSE than LSTM, MF1-LSTM and

MF2-LSTM are reported in Table 4.5. The main observation about the target features in Table 4.5

was that they are related to the effects of Parkinson’s disease on the muscle control such as ﬁn-

ger tapping, rigidity, and hand movements.

In addition, T-LSTM predicted the target value of

101

Table 4.5 Some common target features from PPMI dataset on which T-LSTM performed better
than LSTM and MF1-LSTM during 10 trials. These target features are mainly related to the effects
of Parkinson’s disease on muscle control.

Code

NP3BRADY
NP3RIGRU
NP3FTAPR
NP3TTAPR
NP3PRSPR
NP3HMOVR

Name

Global spontaneity of movement

Rigidity - RUE(Right Upper Extremity)

Finger Tapping Right Hand

Toe tapping - Right foot

Pronation-Supination - Right Hand

Hand movements - Right Hand

Rigidity - Neck

Dressing

NP3RIGN
NP2DRES
PN3RIGRL
Rigidity - RLE (Right Lower Extremity)
DFBRADYP Bradykinesia present and typical for PD
NP3RTARU
NP3PTRMR
MCATOT

Rest tremor amplitude - RUE
Postural tremor - Right Hand

MoCA Total Score

Bradykinesia, which encompasses several of the problems related to movement, and MoCA (Mon-

treal Cognitive Assessment) Total Score, which assesses different types of cognitive abilities with

lower error than other methods. This result showed that the reported target features are sensitive to

elapsed time irregularities and discounting the short-term effects by the subspace decomposition

of memory cell helps to alleviate this sensitivity.

4.4.2.2 Patient Subtyping of PPMI Data

In the next experiment, T-LSTM auto-encoder was used to obtain subtypes of the patients in the

PPMI dataset. The T-LSTM encoder was used to learn a representation from the input feature

sequence of each patient and the T-LSTM decoder generated the target sequence. Parameters of

the auto-encoder were learned to minimize the squared error between the original target sequence

and the predicted target sequence. The learned representations were used to cluster the patients by

the k-means algorithm as discussed before.

Since we did not know the ground-truth for the clustering, a statistical analysis was conducted

to assess the subtyping performance. For this purpose, clustering results were statistically analyzed

102

at the time of 6 years follow-up in the PPMI study. Features including demographics, motor

severity measures such as Uniﬁed Parkinson’s Disease Rating Scale (MDSUPDRS), Hoehn and

Yahr staging (H&Y), non-motor manifestations such as depression, anxiety, cognitive status, sleep

disorders, imaging assessment such as DaTScan, as well as cerebrospinal ﬂuid (CSF) biomarkers

were taken into account. In order to interpret the clustering results in terms of subtyping, clusters

were compared using Chi-square test for the categorical features, F-test for the normal continuous

features, Kruskal-Wallis test for the non-normal continuous features, and Fisher’s exact test for

the high sparsity features. According to the previous Parkinson’s disease studies, if the p-values

of the aforementioned features are less than 0.05, a signiﬁcant group effect is considered for the

associated features [43]. Thus, if a method can obtain a lot of features with small p-values, it

indicates that the method provides a more sensible patient subtyping result.

We tried several k values for the k-means algorithm. We often observed that there were two

main clusters, therefore we reported the clustering results for k = 2. We conducted several tests

with different parameters. According to our observation, LSTM did not provide adequate number

of features with p-values less than 0.05 and most of the patients were generally grouped into one

cluster. In Table 4.6, features of small p-values and cluster means of the features are presented

for T-LSTM, MF1-LSTM and MF2-LSTM. As it can be seen from the table, T-LSTM had more

discriminative features than MF1-LSTM and MF2-LSTM.

In Table 4.6, high cluster mean indicates that the symptoms of the corresponding feature are

more severe for that cluster and the PD patients have lower cluster mean for DaTScan feature.

Note that one of the observed features of T-LSTM in Table 4.6 is MoCA which was predicted

better by T-LSTM in the target sequence prediction experiment. Finally, we illustrated the patient

subtyping results of T-LSTM with heat map illustration in Figure 4.6.

In this ﬁgure, shade of

red color represents the cluster mean which is higher than the total mean of the patients and the

shades of blue color show lower mean values for the corresponding feature with the p-value< 0.05.

Subtypes and features which are signiﬁcant for each subtype can be observed from the heat map.

For instance, DaTSCAN features were found to be signiﬁcant for subtype I, whereas subtype II

103

Table 4.6 Results of the statistical analysis for T-LSTM, MF1-LSTM and MF2-LSTM. DaTScan1
corresponds to DaTScan SBR-CAUDATE RIGHT, DaTScan2 is DaTScan SBR-CAUDATE LEFT,
and DaTScan4 is DaTScan SBR-PUTAMEN LEFT.

Feature
T-LSTM

BJLO
MoCA

DaTScan1
DaTScan2
DaTScan4
MF1-LSTM
CSF-Total tau

MoCA
SDM

MF2-LSTM

HVLT-Retention

SDM

P-Value
9.51 × 10−8

0.001
0.042
0.027
0.001

0.007

2.16 × 10−17

0.005

0.03
0.007

Cluster1 Mean Cluster2 Mean

16.5
40.0
2.29
2.31
1.4

87.9
47.5
58.5

0.84
36.61

24.7
41.2
2.07
2.08
1.1

46.72
41.05
41.5

0.83
41.68

Figure 4.6 Heat map illustration of the patient subtyping results of T-LSTM for two clusters. Light
red color represents the cluster mean which is higher than the total mean of the patients and the
shades of blue show lower mean values for the corresponding feature with p-value< 0.05.

104

was deﬁned by BJLO (Benton Judgement Line Orientation) and MoCA features. Note that the

dataset contains healthy subjects as well. It is known that PD patients have lower DaTScan SBR

values than healthy subjects [125]. Hence, we can conclude from Figure 4.6 that subtype II can be

considered as PD patients. We can also observe from Figure 4.6 that cluster means of BJLO and

MoCA are very low (darker shades of blue) for subtype I compared to subtype II.

4.5 Summary

In this study, time-aware LSTM is proposed to improve the LSTM performance when there is an

elapsed time irregularity in the temporal sequence, whereas traditional LSTM naturally assumes

a uniform time gap throughout the sequence. T-LSTM does not make any assumption about the

elapsed time unit such that the time gap does not have to be measured in days or years and thus it

can be adopted by other domains dealing with different types of sequences. T-LSTM adjusts the

previous memory content of an LSTM unit by a decaying function of the elapsed time in a way

that longer the time lapse, less the inﬂuence of the previous memory content on the current output.

The proposed T-LSTM was tested for supervised and unsupervised tasks on synthetic data and real

world datasets. Patient subtyping, which can be deﬁned as clustering sequential patient records,

was analyzed on a publicly available real world dataset called Parkinson’s Progression Markers

Initiative (PPMI). For the subtyping purpose, T-LSTM auto-encoder is used to learn powerful

representations for the temporal patient data, and the learned representations are used to cluster the

patient population.

105

Chapter 5

Decoupled Memory Gated Recurrent

Network

In Chapter 4, a modiﬁed long short-term memory (LSTM) architecture was introduced to address

patient sub-typing task. The sub-typing task is essentially a patient clustering problem, where

the data points are historical medical records. In this problem, temporal pattern in the patient’s

medical records needs to be efﬁciently encoded into a vector such that time sequences of patients

can be clustered into different groups using an off-the-shelf clustering algorithm. In addition to

the temporal pattern, elapsed time irregularities in health records play an important role in clinical

decision making. For this reason, in Chapter 4, we proposed to integrate elapsed time in the

standard LSTM unit. In particular, cell memory of the LSTM unit was decomposed into long and

short term components and the short term memory was discounted using a time decaying weight.

Empirical analysis of T-LSTM provided an evidence that further modiﬁcations on the internal

memory of LSTM can improve the predictive performance for irregular time series data. This

observation is particularly important for healthcare tasks since patient’s medical history depicts an

irregular time series data.

In this chapter, the internal memory of RNNs and external memory networks are further inves-

tigated. Moreover, a new gated recurrent model with a modiﬁed internal memory is proposed to

106

extend the T-LSTM approach proposed in Chapter 4. The proposed model, decoupled gated re-

current network (DM-GRN), addresses an important potential issue of the T-LSTM model. When

the actual elapsed time between consecutive time steps is not known, T-LSTM does not modify

the cell memory and behaves like a standard LSTM model. On the other hand, the absence of the

elapsed time information does not necessarily indicate uniform elapsed time. The proposed DM-

GRN introduces separate short and long term memory components to address the potential time

irregularities without utilizing the actual elapsed time. In particular, the new architecture aims to

alleviate the vague distinction between the long and short term memories in the standard LSTM

unit while employing an internal attention mechanism to update the long term memory component.

Experiments on synthetic and real world data are conducted for quantitative and qualitative evalu-

ations of the proposed architecture. Applications of DM-GRN on healthcare and trafﬁc prediction

are discussed.

5.1

Introduction

Artiﬁcial intelligence and machine learning have been in the limelight and the interest continues to

grow. Companies prevalently prefer data-driven models that offer effective ways to harness their

data, and more importantly, provide predictive capability to improve their services. As a conse-

quence, data has become an important asset to design reliable learning models. Depending on the

application domain, data is obtained via various kinds of acquisition techniques. For instance, traf-

ﬁc speed data is collected with GPS equipped cars, click stream data is recorded via e-commerce

web pages, transactional data is collected by ﬁnancial service providers, and digital patient data is

stored via electronic health record (EHR) systems. Data acquisition technique speciﬁes the charac-

teristics of the data. One of the most common characteristic is the temporal dependency between

the consecutive measurements or records. For this reason, a number of applications require the

ability to deal with temporal data. In ﬁnance and healthcare, in particular, it is imperative to cap-

ture the temporal patterns present in datasets since the temporal dynamics reveals important insight

107

Figure 5.1 Comparison between some of the auto-reggresive models and deep learning techniques
for time series analysis.

about subject’s behavior. For instance, temporal analysis of patient’s medical history can facilitate

a more reliable risk prediction, which consequently assists in clinical decision making.

There are various time series analysis techniques with different stochastic processes. Prominent

traditional time series analysis techniques are autoregressive, integrated, moving average models,

and their combinations (e.g., autoregressive moving average (ARMA) and autoregressive inte-

grated moving average (ARIMA)) [5]. The traditional linear models are simple to implement and

do not require a large amount of data for training. However, their model complexity is not adequate

to analyze complex multi-variate time series data with elapsed time irregularities. Hidden Markov

Model (HMM), on the other hand, represents the time series as a Markov process, where the fu-

ture states depend only upon the current states. Consequently, this assumption of HMM results

in a memoryless system. Although the HMM is used for various time series applications, such

as speech recognition [98], Markov property is a very strong assumption that real world datasets

often do not follow. As a result, a data-driven and a non-linear model with fewer assumptions is

necessary to tackle challenging real world time series datasets, such as EHRs.

With the advancements in hardware and optimization techniques, deep learning has been a

ﬂourishing family of data-driven models yielding impressive performance in many domains. Re-

108

current neural network (RNN) and its variants have been developed to tackle time series data. In

addition to RNNs, multi-layer perceptron [62], convolutional neural networks [62], and deep au-

toregressive [53] models are utilized for time series analysis. Properties of traditional time series

analysis approaches and deep models used for time series data are summarized in Figure 5.1. The

fundamental differences between RNNs and the traditional time series analysis models are non-

linearity and memory properties. RNNs, in theory, can capture long term dependencies without

any Markov property and linearity assumptions. However, as discussed in Chapter 4, RNNs are

not robust against vanishing gradients due to non-linear activation functions and back-propagation

through time (BPTT). For this reason, variants of the RNN architecture, such as long short term

memory (LSTM) [58] and gated recurrent unit (GRU) [31] networks, are more prevalent. Both

architectures introduce gate mechanisms to regulate the information ﬂow from past to future time

steps and extract features (e.g., input, forget, output gates in LSTM). On the other hand, the key

improvement LSTM and GRU have over traditional RNN is the additive memory component that

prevents gradient values from vanishing.

LSTM and GRU networks implicitly assume that time sequence has regular elapsed times. An

example time series data with regular and irregular elapsed times is shown in Figure 5.2. As seen

in Figure 5.2a, sine wave has a uniform sampling rate. However, in many applications, elapsed

time is not uniform due to several factors, such as missing values and irregular sampling rate.

For instance, time gap between consecutive patient records varies depending on the frequency of

hospital visits or admissions. Since the sampling rate of patient records is not known, EHR data

cannot be easily imputed to make the temporal sequence regular. To integrate the elapsed time

into the standard LSTM architecture, a modiﬁed LSTM architecture, named time-aware LSTM

(T-LSTM) [11], was introduced in Chapter 4. T-LSTM, given in Figure 4.2, decomposes the cell

memory of the LSTM unit into two components that permits modifying the memory content with-

out deteriorating the overall temporal pattern stored in the memory. In theory, LSTM proposes two

kinds of memories, namely cell memory and the hidden state. Cell memory aims to maintain the

long term patterns, where the hidden state captures the short term changes. However, mathemat-

109

(a) Regular time series

(b) Irregular time series

Figure 5.2 Regular and irregular time series data example.

ical model of LSTM poses a coupled memory, where the hidden state is obtained using the cell

memory weighted by output of a non-linear gate function. Besides, we empirically presented the

beneﬁts of decomposing the cell memory into long and short-term components in T-LSTM. On the

other hand, T-LSTM might face an issue when the elapsed time is not available. In such cases, the

elapsed time is assumed to be uniform although time irregularity exists due to various reasons, such

as noise. Under uniform elapsed time assumption, T-LSTM is simpliﬁed to the standard LSTM

unit.

In this chapter, a new gated architecture with decoupled memory, named decoupled memory

gated recurrent network (DM-GRN), is proposed as an alternative approach to T-LSTM model.

In DM-GRN architecture, the memory is constructed as a summation of two components that

110

are designed to focus on different dynamics of the input time series. For this purpose, a new

gated unit is designed with two sigmoid gates to extract features from the input and the recurrent

component, and two tanh memory layers to construct short and long term memories. Gate outputs

are used to assign weights to short and long term memory components before they are assembled

to form the overall memory or hidden state. The difference between the consecutive elements

of the time series is used as the input to the forget gate, facilitating an automatic way to capture

time irregularities. As a result, the proposed network can still regulate the memory without any

elapsed time and sampling rate information. The long term memory component of the proposed

DM-GRN architecture is computed via an attention mechanism used to weight the past hidden

states. Inspired by the external memory networks [52,118], a limited number of past time steps are

stored in a memory matrix to computed the weighted sum. Main contributions of the DM-GRN

architecture are as follows.

• Decoupling the internal memory into long and short term memories provides more insight
about the different dynamics in the time series data compared with coupled memory net-

works.

• Using the difference between consecutive elements as one of the inputs to the network aims
to capture the time irregularities inherently even the elapsed time is not explicitly provided.

• An internal attention mechanism is proposed to compute the long term memory component.
In particular, a weighted sum of the past hidden states is used instead of considering only the

previous time step. A ﬁxed length time window is used for attention to reduce the complexity

of training.

Several synthetic and real-world data experiments are conducted and results are compared to

several other gated architectures, to empirically validate the beneﬁts of the decoupled memory. In

the next sections, literature review and further details on internal and external memory networks

are presented.

111

5.2 Literature Review

LSTM networks were proposed to tackle the vanishing gradient problem of the standard RNN

architecture [58]. The vanishing gradient is avoided by the gated recurrent connections and an extra

memory unit, named cell memory. The main advantage of the LSTM network over the standard

RNNs is the additive structure of the memory. The cell memory is obtained by a weighted sum and

the weights are determined by the gates. In literature, gated architectures are reported to perform

better than standard RNNs. However, there are also studies discussing the reliability of the gated

architectures for tasks with speciﬁc properties. Russell et al. [103] investigated and compared

the memory properties of RNN, LSTM, and gated recurrent unit (GRU). One of the important

conclusions drawn from their empirical evaluations is the unreliability of LSTM and GRU with

the tasks of ﬁxed delay recall. On the other hand, gated architectures perform better when the task

requires a memory to store and update information. This result indicates that the memory due to

the feedback loop of the standard RNN is not adequate to write and delete information through

different time steps.

Ablation studies have been conducted to further analyze the importance of each LSTM gate and

the cell memory. Levy et al. [78] investigated LSTM networks from a different perspective. Au-

thors argued that the gates offer powerful representations, which stem from the fact that LSTMs are

the combinations of two recurrent models. It is also empirically shown for several natural language

processing (NLP) tasks (e.g., language modelling, question answering, machine translation) that

output gate and the candidate memory have minor contributions compared with input and forget

gates. The analysis in this paper demonstrates that the main advantage of LSTM is the element-

wise weighted sum rather than the non-linear recurrent transitions. In addition, degradation in the

performance is reported when the cell memory is ablated. On the other hand, the multiplicative re-

current connections in the candidate memory and in the gates do not have a signiﬁcant contribution

to the representational power.

In literature, there are also efforts to visualize the gate and memory contents of gated architec-

tures to better comprehend the underlying mechanism. For this purpose, Tang et al. [109] proposed

112

a memory visualization approach to observe the behaviors of LSTM and GRU units for automatic

speech recognition. In particular, activation patterns can be observed to investigate the ways differ-

ent RNN architectures encode the information. Based on the experimental results, authors argued

that information encoded by GRU is more distributed than by LSTM. In addition, activation values
of GRU hidden state concentrates between [−1, 1] unlike the activation values of the LSTM cell
memory. Authors suggested that the constraint activation values of GRU can facilitate the model

training. It is also reported that LSTM tends to remember longer sequences than GRU based on

visualization result of the temporal trace.

Long and short term changes in time series can also be analyzed with signal level decompo-

sition. One of the popular methods that analyzes frequency content of the time series is Fourier

transform. On the other hand, Fourier transform does not learn frequencies in the time series,

but utilizes predeﬁned frequencies. For this reason, it is not suitable for forecasting applications.

Neural networks have been used also for signal level decomposition, which can be learned from

the signal. For instance, Godfrey et al. introduced a neural decomposition technique to improve

the generalizability performance of time series forecasting [49]. The proposed method performs

a decomposition similar to inverse discrete Fourier transform (iDFT), however the frequencies are

learned using sinusoidal activation functions. Compared with RNNs, the neural decomposition of

time series is better at handling the unevenly sampled data.

5.3 Methodology

In this section, background information about memory mechanism in gated recurrent models and

external memory architectures is presented, and the proposed approach is introduced.

5.3.1 Memory in Recurrent Networks

Memory is a vital property of human brain and an indispensable component of computers. Being

able to remember past experiences, adding new knowledge, and discarding irrelevant information

113

are building blocks of the memory. Memory is crucial not only for humans and computers, but also

for dynamic systems. In particular, the memory in learning problems involving temporal inputs

facilitates inferring the time dependencies between consecutive elements, and thus enhances the

predictive performance. RNN and its variants offer such a memory due to their feedback loop.

In literature, LSTM and GRU are the most commonly employed RNN architectures for temporal

problems. Different variants of LSTM are also available to address speciﬁc tasks. For instance,

we proposed T-LSTM to address irregular elapsed time in Chapter 4. Mathematical models of

LSTM, GRU, T-LSTM, and a T-LSTM variant proposed by Yang et al. [124] are compared in

Table 5.1. The models summarized in the table are known as gated architectures due to their

multiple neural layers with sigmoid activation. Common components of the gated architectures are

the additive memory and sigmoid gates to regulate the information ﬂow. The prominent differences

between different gated architectures are often the number of gates and deﬁnition of the memory.

For instance, T-LSTM [11] and its variant [124] modify the cell memory using elapsed time to

apply a time decay on the short term memory component. T-LSTM decomposes the cell memory

using a single non-linear neural layer, whereas the T-LSTM variant represents the short and long

term memories with two different non-linear neural layers. T-LSTM applies the time decay in

a multiplicative approach, whereas the T-LSTM variant adds the time decay as another neural

layer. In summary, memory in recurrent networks can be modiﬁed in various ways, however the

crucial point is updating the memory in an additive way. In the following section, memory in gated

architectures is investigated in detail.

5.3.1.1

Internal Memory of Recurrent Networks

RNN and its variants depict a family of networks that have internal memories. The existence

of the memory indicates that this type of networks does not follow the Markov property. Thus,

recurrent networks can be more ﬂexible and versatile to infer complicated relationships between

the elements of a time sequence. Consequently, the memory component facilitates forecasting and

predictive tasks on complicated time series data. The success of RNN models is mainly driven by

114

Table 5.1 Mathematical models of gated RNN architectures covered in this chapter.

Architecture

Model

LSTM [58]

GRU [31]

T-LSTM [11]

T-LSTM Variant [124]

ft = σ (Wf xt + Uf ht−1 + bf )
it = σ (Wixt + Uiht−1 + bi)
ot = σ (Woxt + Uoht−1 + bo)
˜C = tanh (Wcxt + Ucht−1 + bc)
Ct = ft ∗ Ct−1 + it ∗ ˜C
ht = o ∗ tanh (Ct)

zt = σ (Wzxt + Uzht−1 + bz)
rt = σ (Wrxt + Urht−1 + br)
˜ht = tanh (Whxt + Uh (rt ∗ ht−1) + bh)
ht = (1 − zt) ∗ ht−1 + zt ∗ ˜ht

t−1 + ˆC S
t−1

t−1 ∗ g (∆t)

C S
t−1 = tanh (WdCt−1 + bd)
ˆC S
t−1 = C S
t−1 = Ct−1 − C S
C∗
ft = σ (Wf xt + Uf ht−1 + bf )
it = σ (Wixt + Uiht−1 + bi)
ot = σ (Woxt + Uoht−1 + bo)
˜C = tanh (Wcxt + Ucht−1 + bc)
Ct = ft ∗ C∗
ht = o ∗ tanh (Ct)

t−1 + it ∗ ˜C

ft = σ (Wf xt + Uf ht−1 + bf )
it = σ (Wixt + Uiht−1 + bi)
ot = σ (Woxt + Uoht−1 + bo)
gt = tanh (Wgxt + Ught−1 + bg)

cshort
t−1 = tanh (Wshortcct−1 + wshortt∆t + bshortc)
clong
t−1 = tanh (Wlongcct−1 + wlongt∆t + blongc)
cshortnew
t−1

= tanh (wshrink∆t + bshrink) cshort
t−1
longclong

shortcshortnew

+ wnew

t−1 + bmerge

wnew

cnew
t−1 = tanh
t−1
ct = ft ∗ ct−1 + it ∗ gt
ht = ot ∗ tanh (ct)

(cid:16)

115

(5.3.1)
(5.3.2)
(5.3.3)
(5.3.4)
(5.3.5)
(5.3.6)

(5.3.7)
(5.3.8)
(5.3.9)
(5.3.10)

(5.3.11)
(5.3.12)
(5.3.13)
(5.3.14)
(5.3.15)
(5.3.16)
(5.3.17)
(5.3.18)
(5.3.19)

(5.3.20)
(5.3.21)
(5.3.22)
(5.3.23)
(5.3.24)
(5.3.25)
(5.3.26)

(5.3.27)

(5.3.28)
(5.3.29)

(cid:17)

the memory due to recurrent hidden states through time. The hidden state of the current time step

of standard RNN model is computed as follows. Note that bias terms are omitted for simplicity.

ht = σ (Wxt + Uht−1)

yt = f (Vht)

(5.3.30)

where W ∈ Rr×d, U ∈ Rr×r, and V ∈ Rr×c are the network weights, and d, r and c are input,
hidden, and output dimensions, respectively. The non-linear activation function, f (·), is chosen
based on the task (e.g., sigmoid for classiﬁcation, linear for regression tasks). This architecture

mainly focuses on the short term changes since the output at any time step is obtained using the

hidden state of the previous time step. On the other hand, learning to store the information about

long sequences is not feasible with the current architecture due to the decaying error ﬂow back

through time.

LSTM network alleviates the training for long sequences by introducing an extra memory

component, called cell memory. Cell memory is designed to behave as a long term memory, while

the hidden state captures the short term changes. Mathematical model of the standard LSTM can

be found in Table 5.1. Each equation from 5.3.1 to 5.3.4 comprises an input and a recurrent layer

with a non-linear activation function similar to the vanilla RNN in Eq. 5.3.30. The hidden state at

each time step, in Eq. 5.3.6, is directly mapped from the updated cell memory controlled by the

output gate. Thus, long and short term memory components are tightly coupled in LSTM networks.

Cell memory is updated using a weighted sum of the candidate and previous memories as given

in Eq. 5.3.5. The weights, ft and it in Eq. 5.3.5, are named forget and input gates. The purpose

of the multiplicative input gate is to protect the memory content stored in the previous time step

from perturbations [58]. The additive component in Eq. 5.3.4 prevents the gradient values from

vanishing due to Constant Error Carousel (CEC). CEC occurs when the values of forget gate and

input gate are nearly 1, in other words, information ﬂows to the next time step unchanged. In this

case, the derivative does not have the severe decaying effect when the error back-propagates. It is

116

Figure 5.3 Standard LSTM unit.

also due to the fact that the expression in Eq. 5.3.4 does not have a non-linear activation function

with ﬂat regions.

In a standard LSTM unit, shown in Figure 5.3, the three gates, forget, input, and output, regulate

the information ﬂow to the next time steps. On the other hand, the purpose of these gates can also

be considered as feature extraction since each gate has an input layer. However, the value of each

gate is computed using the same formulation. The only difference is that the gate weights are not

shared. Even though, different set of weights are learned for each gate, the contribution of each

individual gate to the predictive power is not always so signiﬁcant. For instance, in literature, it

is often discussed that the output gate may not be necessary [78]. Similarly, some studies propose
to use a single forget gate and derive the input gate as 1 − f. For multivariate time series, using
all three gates can provide the complexity required to encode the underlying patterns in features

and the temporal sequence. Otherwise, LSTM architecture is prone to overﬁtting due to the high

number of parameters.

GRU is proposed to alleviate the overﬁtting in LSTM by eliminating one of the gates. Two

gates given in Eq. 5.3.7 and Eq. 5.3.8 are called update and reset gates, respectively. The update

117

gate learns how much information contained in the hidden state needs to be discarded. In that

sense, the update gate behaves as the forget gate in LSTM. Similarly, the reset gate determines

how much information from the previous hidden state is added to the memory of the current time

step. Thus, reset gate corresponds to the input gate in LSTM. In conclusion, the fundamental

difference between LSTM and GRU is the number of parameters, otherwise memory is computed

with a similar approach. In literature, performance of LSTM is often reported better than GRU.

GRU is mostly preferred when there is not enough data. In addition to LSTM and GRU, there

are various other gated units [47, 127] aiming to reduce the number of parameters and alleviate

the training. On the other hand, memory is commonly coupled in the aforementioned models and

none of the techniques report signiﬁcant performance improvement over LSTM.

RNN and its variants originally assume a ﬁxed sampling rate, namely regular elapsed time.

As discussed in Chapter 4, this assumption may not be reasonable for real world problems, es-

pecially for healthcare applications. The proposed T-LSTM [11] architecture attempts to decom-

pose the cell memory into short and long-term memories. The main motivation behind T-LSTM

is to incorporate elapsed time in a way that the contribution of the information of the previous

state is discounted if there is a large time gap. Memory decomposition aims to ensure that the

important memory content is not altered while applying the time decay. Recently, a modiﬁed T-

LSTM [124] is proposed, where the cell memory is decomposed using two different tanh layers.

Both approaches attempt to modify the cell memory while the original gates of the LSTM unit

are preserved. In the following section, we discuss a different approach, named external memory

networks, to utilize memory in dynamic systems. External memory networks combines a standard

neural network (e.g. LSTM) with an additional external memory.

5.3.2 External Memory Architectures

The goal of the RNN architectures is essentially memorizing the temporal pattern of the input time

sequence. For this purpose, information observed at each time step is encoded into a hidden state

(e.g., hidden state and cell memory for LSTM). Hidden state is a dense vector, thus the temporal

118

knowledge is condensed in a single vector. As a result, the capacity of the memory maintained

by recurrent networks is considered limited [52]. To address the limited memory capacity of

RNN and its variants, deep networks with external memories are proposed in literature. Due

to the increased memory capacity, the external memory models are suitable to model very long

sequences. This family of networks commonly implements an external memory matrix with a

read/write mechanism using attention. Namely, the weighted sum of different memory locations is

used, where the weights are computed based on the input content.

5.3.2.1 Neural Turing Machine

One of the popular deep architectures with external memory is Neural Turing Machine (NTM) [52].

NTM, shown in Figure 5.4, comprises of a memory matrix, a controller, and a selective mechanism

to read and write the memory. Controller behaves as an interface between the memory and the

input, and interacts with the memory through an addressing mechanism. The attention weights

used in reading and writing are learned using the addressing mechanism. Two of the addressing

approaches discussed by Graves et al. [52] are accessing by content similarity and location. In the

content-based addressing, attention weights are determined based on the similarity between the

content and the memory locations. Whereas in the location-based addressing, controller decides

which previous memory values to keep and discard. Writing mechanism comprises erase and add

steps similar to the forget and input gates in LSTM, respectively. On the other hand, reading

mechanism computes a weighted sum of the memory locations. As opposed to standard Turing

machine, where a single memory location is accessed, reading the memory as a weighted sum of

the memory locations enables a differentiable model. As a result, iterative gradient descent based

optimization techniques can be used to learn the NTM parameters.

5.3.2.2 Memory Networks

Another common external memory approach is the memory network proposed by Weston et

al. [118]. Memory network, shown in Figure 5.5 contains a memory and 4 components, namely

119

Figure 5.4 Neural Turing Machine [52]

input feature map, generalization, output feature map, and response. Input feature map projects the

original input into a latent feature representation. Generalization component updates the memory

units using the new input. Output feature map reads the most relevant memory locations guided by

the current memory and the new input to produce the new output. Finally, response converts the

output into a desired format depending on the task. The aforementioned components are usually

learnable and a suitable machine learning model can be used to implement the procedure. How-

ever, this approach is fully supervised and needs to iterate the entire memory. End-to-end memory

networks [108] offers less dependence on supervision by replacing soft attention mechanism with

argmax. In other words, memory network computes the weighted sum of all the memory locations,

whereas the end-to-end memory network retrieves most relevant memory locations. Memory net-

works have much simpler memory reading and writing mechanisms compared with NTM. In addi-

tion, external memory in memory network architectures mainly focuses on retrieving the relevant

information from the memory rather than updating it.

120

Figure 5.5 Memory Networks [118]

5.3.2.3 Disadvantages of External Memory

Due to increased memory capacity, NTM and memory networks perform better than LSTM for

long sequences (e.g., > 200) [52, 118]. However, performance analysis of the aforementioned

external memory models is usually discussed for straightforward tasks and limited domains. For

instance, original NTM study [52] reports better performance than LSTM for abstract tasks, such

as priority sort, where the network sorts a sequence of binary vectors with a scalar priority weight.

On the other hand, the memory network [118] and the end-to-end memory network [108] are

evaluated on textual reasoning tasks, such as question and answering. Memory networks yield

slightly better performance compared with LSTM and NTM on language modeling tasks. More-

over, the aforementioned memory networks incorporate feedforward, and recurrent networks (e.g.,

RNN and LSTM) architectures to control read and write mechanisms. For this reason, combining

standard deep models with an external memory increases the model complexity. Increased model

complexity would be relevant for certain tasks, but it degrades the generalizability of the exter-

nal memory models to wide variety of tasks with different levels of difﬁculty. Disadvantages of

memory networks and NTM are summarized below.

• The number of learnable parameters are comparatively higher than RNN architectures.

121

• They are hard to parallelize due to the sequential memory access and update mechanisms.
• Training is difﬁcult due to the fact that it can be impractical to employ an external memory.
• Numerical instability is more prominent than RNN architectures.

5.3.3 Decoupled Memory Gated Recurrent Network (DM-GRN)

DM-GRN is proposed to address two potential problems in T-LSTM model. When the elapsed time

information is not available, T-LSTM assumes uniform elapsed time and behaves like a standard

LSTM network. Even though the actual elapsed time is not known, irregularities due to different

sampling rates and noise may still exist in the time series. The second potential issue is due to

the coupled memory discussed in the previous sections. Since T-LSTM model follows the same

procedure as LSTM after the cell memory modiﬁcation, long and short term memories are tightly

coupled. Although the performance of LSTM and its variants is often superior to vanilla RNN

model, the coupled memory may become a limitation while analyzing time series with different

sampling rates and frequencies.

Long term memory is usually expected to store the global temporal pattern, while the purpose

of a short term memory is to capture local changes in the time series signal. LSTM’s cell memory

is considered as a long term memory, however as can be seen in Eq. 5.3.4, the cell memory update

contains a recurrent layer of only the last time step. Therefore, the distinction between the long and

short term memories is not clear. To address the aforementioned issues, we propose DM-GRN, a

recurrent neural model with internal attention mechanism. The main hypothesis of the proposed

approach is that the decoupled short and long term memories focus on different components of the

input time sequences. Since the purpose is not a frequency domain analysis, short and long term

memories are not expected to explicitly capture the high and low frequencies in the time series.

For this reason, hypothesis does not assume a perfect distinction between short and long term

memories in terms of frequencies. On the other hand, long term memory is assumed to capture

the overall trend and short term memory should be more receptive to rapid changes in the time

122

sequence. The proposed approach, whose mathematical model is given below, fuses the weighted

long and short term memories to obtain a total memory. The total memory is later used to predict

the future values of the time sequence using a task-speciﬁc output layer.

Forget gate

ft = σ (Wf ∆xt + Uf Mt−1 + bf )

Input gate

it = σ (Wixt + bi)

Short term memory Mshort

t

= tanh (Wshortxt + UshortMt−1 + bshort)

L(cid:88)

Long term summary M∗

Long term memory Mlong

t−1 =

w(cid:96)Mt−(cid:96)

t = tanh(cid:0)Wlongxt + UlongM∗

(cid:96)=1

t−1 + blong

(cid:1)

Total memory Mt = ft ∗ Mlong

t + it ∗ Mshort

t

(5.3.31)

(5.3.32)

(5.3.33)

(5.3.34)

(5.3.35)

(5.3.36)

where {Wf , Wi, Wshort, Wlong} are dx×dh dimensional input layer weights, {Uf , Ushort, Ulong}
are dh × dh dimensional recurrent layer weights, Mt ∈ Rdh is the total memory at time t, ∆xt =
(xt − xt−1), and {w(cid:96)}L

(cid:96)=1 are the attention weights computed as below:

(cid:16)

(cid:16)

(cid:17)(cid:17)

w(cid:96) = softmax

vT tanh

Wxxt + WhMlong
t−1

(5.3.37)

where softmax (yi) = eyi(cid:80)

j yj

weights.

and {v ∈ Rdh, Wx ∈ Rdx×dh, Wh ∈ Rdh×dh} are softmax layer

Due to the recurrent state transitions, the current memory is considered as a function of the
previous time steps, Mt = f (x1, x2,··· , xt). On the other hand, the contribution of each time
step’s input is not equal. According to the theoretical analysis conducted by Le et al. [74], the

contributions of the past inputs tend to gradually decay. For this reason, the effect of the past

time steps on the current memory is challenging to leverage. For this reason, we propose to use a

weighted sum of the past memory components while updating the long term memory. In Eq. 5.3.34,

long term summary is deﬁned as the weighted sum of L previous time steps, where L can be

123

determined considering the sequence length and cross-validation. The differences between the

proposed DM-GRN, visualized in Figure 5.6, and T-LSTM models are summarized below.

• GM-GRN discards the recurrent layer in the input gate. In this case, input layer only focuses

on extracting features from the current input.

• In the forget gate, ∆xt = (xt − xt−1) is used as the input along with a recurrent previous
memory layer. The purpose of utilizing ∆xt is to urge the forget gate to put more emphasis

on the change in the signal since the last time step, rather than its instantaneous value. Thus,

the elapsed time irregularity is implicitly incorporated in the model.

• As opposed to LSTM and T-LSTM, short and long term memories are decoupled in DM-
GRN. Two separate tanh layers comprising of the current input and the previous memory

are utilized to decouple the total memory. The two memory components are then weighted

using the output of the input and forget gate ﬁlters and aggregated to obtain the updated total

memory.

• DM-GRN eliminates the output gate based on the evidence in literature that the contribution

of the output gate is not signiﬁcant.

• An internal attention mechanism is adopted to leverage the information stored in the previous
time steps. Since the information stored in the memory of past time steps may be discarded

and overwritten, the effect of the input at previous time steps is not expected to be completely

preserved. For this reason, attention is considered to increase the capacity of the long term

memory component.

In the next section, the behavior of the proposed architecture for different tasks are investigated

and the performances of baseline approaches and DM-GRN are compared.

124

Figure 5.6 Proposed decoupled memory gated recurrent network (DM-GRN) unit.

5.4 Experiments

In this section, experimental results are presented and interpreted. The performance of the pro-

posed model is investigated and compared to several other gated recurrent architectures. Synthetic

data experiments are designed to analyze and visualize the behavior of the memory components.

Real world data experiments are conducted to compare the predictive performance of the proposed

model and baselines. To make a fair comparison, number of epochs, hidden dimensionality, and

the data used in training and test phases were ﬁxed the same for all the baselines and the proposed

approach. Since a better performance is usually observed with 5 previous time steps, the param-

eter L in DM-GRN is ﬁxed to 5 for all the experiments. All the models are implemented with

Tensorﬂow and Python.

125

Figure 5.7 Synthetically generated input signal and its corresponding target.

5.4.1 Synthetic Data

Synthetic data experiments are conducted to investigate and visualize the behavior of the proposed

model in comparison with baseline methods. The predictive power of the models are also tested

on the synthetic EHR dataset [69] used in Chapter 4.

5.4.1.1 Periodic Signal

In this experiment, a periodic signal was created with short term changes. Total of 5, 000 data

points with sequence length of 200 were generated from normal random distribution. To analyze

the performance of the proposed approach, ﬁrst 150 time steps are used as the input to the recurrent

models and the last 50 time steps are predicted. Input and its corresponding target of one of the

samples are shown in Figure 5.7. A single layer DM-GRN is trained with 32-dimensional hidden

state and the prediction performance after 50 epochs is observed. Final prediction is computed

using the total memory, however predictions using only the short term and the long term memory

components are also generated to investigate the behavior of the decoupled memory.

126

Prediction results are visualized for two of the samples in the synthetic dataset in Figure 5.8.

According to our observation, predicted values by the short term memory can follow the temporal

changes, but do not ﬁt the target perfectly. On the other hand, long term memory component

demonstrates the overall trend in the sequence and complements the short term memory. Thus,

the ﬁnal prediction perfectly superimposes on the original target. This result is consistent with

our initial hypothesis. Short term memory component is expected to capture the temporal changes

between consecutive time steps while the long term memory component is expected to remember

the past information and complement the short term memory.

In this problem, the elapsed time is uniform. For this reason, we also investigate the behavior

of the cell memory in the standard LSTM architecture. The predictions obtained using the hidden

state and the cell memory of LSTM unit are given in Figure 5.9. According to our observation,

information contained in the cell memory does not directly reﬂect the predictive power. We also

observe the memory activations for LSTM and DM-GRN to visualize the memory content of both

architectures. In Figure 5.10, heatmap visualization of the short and long term memory compo-

nents of DM-GRN and LSTM are given. In the ﬁgure, short and long term memory in LSTM

correspond to hidden state and the cell memory, respectively. Periodic nature of the signal and

the short term changes are more observable in DM-GRN memory components than LSTM hidden

states and the cell memory. This result provides an evidence that the difference between long and

short term memories in the proposed architecture is more observable and informative compared to

LSTM.

5.4.1.2 Synthetic EHR

In this experiment, synthetically generated EHR dataset 1, used in Chapter 4, was utilized to evalu-

ate the proposed approach for multivariate time series. We considered the same task as Chapter 4,

which is to classify patients as diabetes and healthy. Diabetes Mellitus patients and patients without

diabetes were sampled, resulting in 6, 730 subjects in total. Input features were 529 dimensional

1http://www.emrbots.org/

127

multi-hot vectors representing the common diagnoses in the dataset. Classiﬁcation accuracy (ACC)

and area under ROC curve (AUC) are reported for 5 different train and test data splits. Dimension-

ality of hidden states and the fully connected layer were set to 1, 024 and 64, respectively. Binary

classiﬁcation results of single layer DM-GRN, T-LSTM, LSTM, T-LSTM Variant [124], and GRU

are compared for 100 epochs in Table 5.2. In this problem, elapsed time between consecutive

records of patients is available, and it is not uniform as discussed in Chapter 4. T-LSTM and its

variant [124] utilize the elapsed time to modify the LSTM memory while DM-GRN tackles the

irregular elapsed time with a decoupled memory architecture.

Table 5.2 Diabetes classiﬁcation experiment using synthetic EHR dataset. Average AUC and ACC
are reported for 5 different train and test data splits.

Method
DM-GRN
T-LSTM
LSTM
GRU

T-LSTM Variant [124]

AUC (std)
0.952 (0.03)
0.923 (0.02)
0.911 (0.02)
0.895 (0.02)
0.933 (0.02)

ACC (std)
0.932 (0.02)
0.897 (0.02)
0.878 (0.03)
0.866 (0.03)
0.888 (0.03)

In this experiment, the proposed DM-GRN performs better than baseline gated architectures

in terms of classiﬁcation accuracy. T-LSTM and its variant also demonstrate higher accuracy than

LSTM and GRU. This observation indicates that addressing the elapsed time irregularity along

with memory modiﬁcations improves the predictive performance for the synthetic EHR dataset.

5.4.2 Trafﬁc Speed Prediction

Application of the proposed approach is not limited to healthcare applications. Time series data

with short and long-term changes can be encountered in many domains, such as intelligent trans-

portation systems. Timely and accurate prediction of trafﬁc speed and ﬂow facilitates trafﬁc man-

agement, travel scheduling and trip advisory systems. In particular, ride-sharing companies are

interested in speed forecasting applications to avoid travel delays and improve their service qual-

ity during rush-hours. Trafﬁc speed prediction aims to infer future trafﬁc speed values on a road

segment given its past trafﬁc speed information. In this section, we used a real world trafﬁc speed

128

dataset provided by DIDI (Chinese ride-sharing company) collected in the city of Chengdu, China,

shown in Figure 5.11, during November 2016 [34]. Raw data contains GPS ﬁles comprising of

driver ID, order ID, time stamp, longitude, and latitude. Pre-processing steps, such as coordinate

transformation, map matching, and trajectory to speed transformation, are required to be able to

utilize the trafﬁc data in speed forecasting tasks. In this study, we utilized pre-processed trafﬁc

data, which includes speed time series of the road segments in a sub-region, for instance the red

bounding box in Figure 5.11. Some of the roads have insufﬁcient amount of speed information

due to missing values. Such roads are eliminated and a total of 3, 432 road segments are used in

the experiments. Each road has a time sequence of 144 elements corresponding to speed values

aggregated at every 10 mins during 24 hours for 30 days. The temporal change of the speed values

in one of the roads during a week in November 2016 is given in Figure 5.12.

In this problem, we tackle a univariate time series analysis problem. As shown in Figure 5.12,

approximately ﬁrst 7 hours of the day demonstrates lower variance compared with the change in

speed values during the rest of the day. However, this pattern is not assumed to be consistent across

all the road segments in the sub-region. Due to missing values and noise, forecasting future speed

values becomes a challenging task. Moreover, road segments do not usually follow a prominent

pattern in this dataset. The main goal of the trafﬁc prediction problem is to learn the temporal

pattern of the target road segment so that the future speed predictions will be reliable. In addition,

spatial dependencies between neighboring road segments may also effect the temporal trend in

the speed time series. Trafﬁc data can be considered as a road network, shown in Figure 5.11,

where nodes are intersections and edges are the roads. As a result, graph analysis techniques are

beneﬁcial for detecting the spatial dependencies. On the other hand, connectivity of the nodes

on the graph does not necessarily reﬂect the reality. Even though incoming and outgoing trafﬁc

ﬂow are different, some nodes can seem connected on the graph. Such anomalies degrade the

performance of spatial models. For this reason, only temporal dependencies are taken into account

in this experiment.

129

Table 5.3 Mean absolute error (MAE) and Root Mean Square (RMSE) are reported for the trafﬁc
prediction task.

Method
DM-GRN
T-LSTM
LSTM
GRU

T-LSTM Variant [124]

MAE RMSE
1.064
1.465
1.437
1.030
1.425
1.021
0.966
1.347
1.489
1.077

To evaluate the trafﬁc prediction performance of the proposed model and the baselines, speed

information over 10 minutes intervals is aggregated further to a sequence length of 24. Thus,

each time step represents an hour of the day facilitating interpretation of the prediction results.

Aggregated speed time series of one of the road segments is illustrated in Figure 5.13. Input and

output time series are prepared in the following way; all the time sequences of a day in November

expect the last one (e.g., there are 4 Sundays in November 2016) are concatenated to form the

input time series and the remaining time series is used as the target to predict. As a result, 72 steps

long input time series are used to predict 24 steps long target sequence. The task was posed as

a regression problem, where the last hidden state is mapped to the 24-dimensional target. In this

experiment, consecutive speed values of some road segments are missing. As a result, we still need

to tackle irregular elapsed time even though the speed values are aggregated.

We used 80% of the road segments with all the available days in training and the rest of the road

segments for test. Mean absolute error (MAE) and root mean square error (RMSE) are computed

for each day, and then the average error is reported in Table 5.3. Trafﬁc prediction is a challenging

problem due to the fact that speed time series of the road segments are very irregular. For this

reason, the error is usually high for all the methods compared in this study. LSTM and GRU

performed slightly better than the proposed DM-GRN, whereas DM-GRN performs better than T-

LSTM Variant. Prediction results can also been observed in Figure 5.14. The short and long term

memory components of the proposed DM-GRN demonstrate a similar behavior as the univariate

synthetic data experiment. In other words, the long term prediction follows the global trend of the

time series while the short term component is more susceptible to the short term changes.

130

5.4.3 Diabetes Data

In this section, a binary classiﬁcation experiment is conducted to evaluate the performance of

the proposed approach for multi-variate time series data. The dataset, also used in Chapter 2,

comprises encounter data (emergency, outpatient, inpatient), demographics, diagnoses, and in-

hospital procedures (e.g., laboratory and pharmacy data) [107]. Detailed information about the

dataset is provided in Table 5.4. In Chapter 2, the dataset was used for patient phenotyping task,

which is an unsupervised problem aiming to group patient based on common diagnoses. For

this reason, we had mainly focused on diagnoses and demographics information. However, in

this experiment, the goal is to predict the readmission of a patient given his clinical information

collected during each previous hospital admission the patient.

Table 5.4 Diabetes dataset statistics.

Time span

Total number of records
Total number of patients

Average number of admissions

Number of patient with more than 3 admissions

Total number of records of 3011 patients

1999 - 2008

101,766
71,518

14

3,011
16,169

Readmission prediction is a supervised problem, where the temporal patterns in medical history

of patients play an important role. In the dataset, ground-truth readmission information is available

for each hospital admission of a patient, such as readmission after more than 30 days, readmission

after less than 30 days, and no readmission. In biomedical informatics, 30 days time window is

usually preferred in readmission prediction problem since 30 days criteria is usually considered

by the funding agencies [107]. Thus, the records of each patient are labeled as readmission if the

patient readmitted before or after 30 days, and as no readmission if the there is no information

about readmission. As a result, the problem becomes predicting the binary labels of each time step

of a sequential multi-variate time series data.

The deep recurrent models discussed in this chapter are applicable to the problem described

above. However, some patients have a very few hospital admissions in the dataset. In this case,

131

Table 5.5 Readmission prediction performance of diabetes dataset. Average accuracy (ACC) and
area under ROC curve (AUC) are reported for 5 different train and test splits. Standard deviation
is given inside the parentheses.

Method
DM-GRN

LSTM
GRU

T-LSTM Variant [124]

ACC (std)
0.941 (0.03)
0.891 (0.05)
0.906 (0.05)
0.912 (0.01)

AUC (std)
0.944 (0.03)
0.915 (0.04)
0.934 (0.04)
0.911 (0.02)

it is not reasonable to use a recurrent model since there will not be a prominent temporal pattern

to capture. To avoid this situation, we eliminated patients with less than 3 hospital admissions,

resulting in 16, 169 records and 3, 011 patients, in our experiments. In this dataset, patient records

are ordered, however the actual time stamp of a record is not provided. As a result, the elapsed time

between consecutive records is not known. As it was discussed in Chapter 4, consecutive records

in EHR datasets often have irregular elapsed times. For this reason, we cannot simply assume

uniform elapsed time in this dataset. T-LSTM could be utilized if the elapsed time was provided.

We used 70% of the patients in training and 30% for test. Each record is represented with a 85-

dimensional feature vector and feature vectors are normalized to unit length. Each feature vector

contains, time in hospital, number of diagnoses, and total of 25 medications related to diabetes and

other diagnoses. Nominal features (e.g., medications) are represented with one-hot vectors. Each

model is trained for 300 epochs with 512-dimensional hidden layers and one 128-dimensional fully

connected layer. The readmission prediction performances of LSTM, GRU, T-LSTM variant and

the proposed DM-GRN models are given in Table 5.5. The proposed approach, DM-GRN yields

the best readmission prediction performance in terms of classiﬁcation accuracy (ACC) and area

under ROC curve (AUC). This result indicates that even though the elapsed time is not explicitly

available, there are time irregularities in the dataset that the proposed model can address better than

the baselines.

132

5.5 Summary

In this chapter, we investigate the role of memory in gated recurrent networks and propose a new

model to address the tightly coupled memory in LSTM architectures. The proposed model, DM-

GRN, decouples the memory into short and long-term memory components and eliminates redun-

dant gates and layers present in the standard LSTM. Based on empirical evaluations, short and long

term components are observed to capture different dynamics in the input time series signals. Even

though the predictive power of the proposed approach is not always superior to standard LSTM for

univariate time series data, the difference between its long and short term memories is more inter-

pretable. Furthermore, it is often observed that memory decoupling in DM-GRN performs better

than T-LSTM Variant. As a result, when the elapsed time is not explicitly provided, the proposed

approach can offer a solution to capture different dynamics in the time series data.

133

(a) Sample 1

(b) Sample 2

Figure 5.8 Prediction performance of the synthetic data. When only short term memory is used to
predict the target, prediction error is high, but the predicted values can follow the temporal changes.
On the other hand, long term memory component demonstrates the overall trend and complements
the short term memory. Thus, the ﬁnal prediction perfectly superimposes on the original target.

134

Figure 5.9 Predictions obtained using the hidden state and the cell memory of the standard LSTM
unit. Cell memory is assumed to behave as a long term memory. According to our observation,
information contained in the cell memory does not directly reﬂect the predictive power.

135

(a) DM-GRN Short Term

(b) DM-GRN Long Term

(c) LSTM Short Term

(d) LSTM Long Term

Figure 5.10 Long and short term memory contents of DM-GRN, and hidden state and cell memory
contents of LSTM for the synthetic data. LSTM short term and long term memories correspond
to hidden states and the cell memory, respectively. Heatmaps are used to visualize the memory
activations. The periodic nature of the signal and the short term changes are more observable in
DM-GRN memory components than LSTM hidden states and the cell memory.

136

Figure 5.11 Road network of a subregion in the city of Chengdu, China. Intersections are repre-
sented with nodes, and the roads are represented by the edges. Due to noise and errors during map
matching procedure, connected roads on the graph might not be connected in real life.

137

Figure 5.12 Speed time series of one road over 4 days of the same week in November, 2016.
Trafﬁc data suffers from missing values. Roads with relatively few missing values are kept and the
missing values are interpolated.

Figure 5.13 Speed time series for one road for 4 days in November, 2016. Each time step, in the
24 steps long sequence, represents the aggregated speed value of one hour.

138

(a) DM-GRN

(b) LSTM

(c) T-LSTM

(d) T-LSTM Variant

Figure 5.14 Trafﬁc prediction performance. Long term memory of DM-GRN underestimates the
actual speed values, however the long term prediction follows the global trend of the time series
while the short term component is more susceptible to the short term changes.

139

Chapter 6

Summary and Suggestions for Future Work

6.1 Summary

In this thesis, several machine learning and deep learning models are developed to address different

challenges encountered in biomedical informatics tasks. More speciﬁcally, temporal, distributed,

large scale, and high dimensional nature of digital patient data (e.g., EHR) are taken into account.

Contributions of this thesis are summarized below.

1. A convex SPCA approach along with stochastic gradient descent framework is introduced

to obtain interpretable principal components. Proposed Cvx-SPCA method [9] is applied on

patient phenotyping problem and a hierarchical visualization of the clinical phenotypes is

presented.

2. An asynchronous distributed multi-task learning framework, AMTL [12], is proposed to

address challenges due to distributed EHR data. AMTL transfers only the model vectors to

alleviate privacy and bandwidth limitations. Proposed method can also be applied to a wide

variety of supervised predictive tasks.

3. A new LSTM architecture, T-LSTM [11], is proposed for analysis of time series data with

unevenly sampled time steps, such as EHRs. T-LSTM is designed based on the hypothesis

140

that longer the time gap, smaller the effect of the previous memory to the current output. T-

LSTM auto-encoder is proposed to encode patient’s longitudinal medical records in a single

vector representation. Learned representations are used to cluster patients to assist in patient

subtyping task.

4. An extension to T-LSTM is proposed to address short and long-term changes in time series

data. The designed model, DM-GRN, comprises a gated recurrent network with decoupled

memory. Decoupling the internal memory of the recurrent network facilitates more robust

prediction for time series with irregularities even the actual elapsed time is not available.

6.2 Suggestions for Future Work

The algorithms and models designed in this thesis are not limited to healthcare applications. Dis-

tributed and temporal datasets are encountered in many other domains. In this section, potential

future research directions for the proposed methods are discussed. The research presented in this

thesis can be extended in the following directions.

• Stochastic sparse principal component analysis (SPCA) approach, proposed in Chapter 2,
provides an interpretable dimensionality reduction and visualization tool. Due to its sparsity,

SPCA automatically provides the information about which input features are more important

to obtain the principal components. For this reason, one potential extension of the proposed

SPCA approach can be integrating a feature selection scheme [79], which is a commonly

used technique for gene expression analysis.

• The distributed asynchronous multi-task learning (AMTL) approach proposed in Chapter 3,
offers a practical and robust method to develop predictive models for distributed datasets.

The proposed AMTL model was posed as a distributed optimization problem with stan-

dard classiﬁcation and regression objective function. This framework can be extended in a

framework that can work with different input structures, such as graphs. The relatedness

141

between distributed tasks can be deﬁned with a weighted graph [116], and thus the proposed

distributed MTL formulation can be modiﬁed to handle graph structured input. When the

physical graph structure of the distributed network is consistent with the relationships be-

tween tasks, network communication cost can be further reduced by focusing only on the

communication between neighboring task nodes.

• Healthcare, ﬁnance, and intelligent transportation systems are some of the ﬁelds where time
series data is the main source of information. Time series analysis approaches proposed

in Chapter 4 and 5 can be extended in different application domains. For instance, time

series forecasting is one of the prominent methods to tackle trafﬁc speed and ﬂow estimation

tasks in intelligent transportation systems. In addition to the temporal dependency in trafﬁc

variables, spatial relationships in a city graph may also effect the forecasting performance.

For this reason, T-LSTM and DM-GRN, proposed in Chapter 4 and 5, respectively, can be

extended to incorporate the spatial dependencies between data points.

142

BIBLIOGRAPHY

143

BIBLIOGRAPHY

[1] Clinical phenotypes of copd:

Identiﬁcation, deﬁnition and implications for guidelines.

Archivos de Bronconeumologa (English Edition), 48(3):86–98, 2012.

[2] The exponential growth of data.

https://insidebigdata.com/2017/02/16/the-exponential-

growth-of-data/, 2017.

[3] Retailrocket

recommender

system dataset.

https://www.kaggle.com/retailrocket/

ecommerce-dataset, 2017.

[4] Uniﬁed new york city taxi and uber data. https://github.com/toddwschneider/nyc-taxi-data,

2017.

[5] Ratnadip Adhikari and R. K. Agrawal. An Introductory Study on Time Series Modeling and

Forecasting. LAP LAMBERT Academic Publishing, 2013.

[6] Metin Akay, Dimitrios I. Fotiadis, Konstantina S. Nikita, and Robert W. Williams. Guest
editorial: Biomedical informatics in clinical environments. IEEE Journal of Biomedical and
Health Informatics, 19(1):149–150, 2015.

[7] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task fea-

ture learning. Machine Learning, 73(3):243–272, 2008.

[8] Necdet Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradi-
ent method for composite convex optimization. In Proceedings of the 32nd International
Conference on Machine Learning, pages 2454–2462, 2015.

[9] Inci M. Baytas, Kaixiang Lin, Fei Wang, Anil K. Jain, and Jiayu Zhou. Phenotree: Interac-
tive visual analytics for hierarchical phenotyping from large-scale electronic health records.
IEEE Transactions on Multimedia, 18(11):2257–2270, 2016.

[10] Inci M. Baytas, Kaixiang Lin, Fei Wang, Anil K. Jain, and Jiayu Zhou. Stochastic convex
sparse principal component analysis. EURASIP Journal on Bioinformatics and Systems
Biology, 2016(1):15, 2016.

[11] Inci M. Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K. Jain, and Jiayu Zhou. Patient
subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, pages 65–74, 08 2017.

[12] Inci M. Baytas, Ming Yan, Anil K. Jain, and Jiayu Zhou. Asynchronous multi-task learning.

In 2016 IEEE 16th International Conference on Data Mining, pages 11–20, 2016.

[13] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM J. Imaging Science, 2(1):183–202, 2009.

[14] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review

and new perspectives. arXiv:1206.5538v3[cs.LG], 2014.

144

[15] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with
gradient descent is difﬁcult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

[16] Elmer V. Bernstam, Jack W. Smith, and Todd R. Johnson. What is biomedical informatics?

Journal of Biomedical Informatics, 43(1), 2010.

[17] Mike Bostock. Radial reingoldtilford tree. http://bl.ocks.org/mbostock/4063550, 2017.

[18] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed op-
timization and statistical learning via the alternating direction method of multipliers. Foun-
dations and Trends in Machine Learning, 3(1):1–122, 2011.

[19] Matthew Brand. Fast online SVD revisions for lightweight recommender systems. In Pro-

ceedings of SIAM International Conference on Data Mining, pages 37–46, 2003.

[20] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algo-

rithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.

[21] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.

[22] Chao Che, Cao Xiao, Jian Liang, Bo Jin, Jiayu Zhou, and Fei Wang. An rnn architecture
In

with dynamic temporal matching for personalized predictions of parkinson’s disease.
Proceedings of the 2017 SIAM International Conference on Data Mining, 2017.

[23] Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep
computational phenotyping. In Proceedings of the 21th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pages 507–516, 2015.

[24] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Re-
current neural networks for multivariate time series with missing values. arXiv preprint
arXiv:1606.01865, 2016.

[25] Daqing Chen, Sai Laing Sain, and Kun Guo. Data mining for the online retail industry:
A case study of rfm model-based customer segmentation using data mining. Journal of
Database Marketing & Customer Strategy Management, 19(3):197–208, 2012.

[26] Yun Kuen Cheung and Richard Cole. Amortized analysis on asynchronous gradient descent.

arXiv preprint arXiv:1412.0159, 2014.

[27] Kyunghyun Cho, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statis-
tical machine translation. arXiv:1406.1078v3[cs.CL], 2014.

[28] Edward Choi, Mohammad Taha Bahadori, Andy Schuetzy, Walter F. Stewarty, and
Doctor ai: Predicting clinical events via recurrent neural networks.

Jimeng Sun.
arXiv:1511.05942v11 [cs.LG], 2016.

145

[29] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael
Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation
In Proceedings of the 22nd ACM SIGKDD International
learning for medical concepts.
Conference on Knowledge Discovery and Data Mining, 2016.

[30] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and
Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time
attention mechanism. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, Advances in Neural Information Processing Systems 29, pages 3504–3512, 2016.

[31] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio. Empirical eval-
uation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555,
2014.

[32] Patrick L Combettes and Val´erie R Wajs. Signal recovery by proximal forward-backward

splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.

[33] Alexandre d’aspremont, Laurent E. Ghaoui, Michael I. Jordan, and Gert R. Lanckriet. A
direct formulation for sparse pca using semideﬁnite programming. In Advances in Neural
Information Processing Systems 17, pages 41–48. MIT Press, 2005.

[34] DIDI. Gaia open dataset. https://outreach.didichuxing.com/research/opendata/en/, 2018.

[35] Ivo D. Dinov, Ben Heavner, Ming Tang, Gustavo Glusman, Kyle Chard, Mike Darcy,
Ravi K. Madduri, Judy Pa, Cathie Spino, Carl Kesselman, Ian T. Foster, Eric W. Deutsch,
Nathan D. Price, John Darrell Van Horn, Joseph Ames, Kristi Clark, Leroy Hood, B. M.
Hampstead, William T. Dauer, and Arthur W. Toga. Predictive big data analytics: A study
of parkinsons disease using large, complex, heterogeneous, incongruent, multi-source and
incomplete observations. In PloS one, 2016.

[36] Francesco Dinuzzo, Gianluigi Pillonetto, and Giuseppe De Nicolao. Client server multitask
learning from distributed datasets. IEEE Transactions on Neural Networks, 22(2):290–303,
2011.

[37] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio
Guadarrama, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks
for visual recognition and description. arXiv:1411.4389v4[cs.CV], 2016.

[38] Brent Edmunds, Zhimin Peng, and Wotao Yin. Tmac: A toolbox of modern async-parallel,

coordinate, splitting, and stochastic methods. arXiv preprint arXiv:1606.04551, 2016.

[39] Cristobal Esteban, Oliver Staeck, Yinchong Yang, and Volker Tresp. Predicting clini-
cal events by combining static and dynamic information using recurrent neural networks.
arXiv:1602.02685v1 [cs.LG], 2016.

[40] evariant. What is healthcare big data? https://www.evariant.com/faq/what-is-healthcare-

big-data, 2018.

146

[41] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning.

In Pro-
ceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 109–117, 2004.

[42] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimization heuristic with ap-
plication to minimum order system approximation. In Proceedings of the American Control
Conference, volume 6, pages 4734–4739, 2001.

[43] Seyed-Mohammad Fereshtehnejad, Silvia Ros-Romenets, Julius B. M. Anang, and
Ronald B. Postuma. New clinical subtypes of parkinson disease and their longitudinal
progression: A prospective cohort comparison with other phenotypes. JAMA Neurology,
72(8):863–873, 2015.

[44] Centers for Medicare and Medicaid Services.

codes: Abbreviated and full code titles.
ICD9ProviderDiagnosticCodes/codes.html, 2014.

Icd-9-cm diagnosis and procedure
https://www.cms.gov/Medicare/Coding/

[45] Norbert Funke.

Improving healthcare data management with emc isilon? think holis-
tic, not in separated storage islands. https://blog.dellemc.com/en-us/improving-healthcare-
data-management-with-emc-isilon-think-holistic-not-in-separated-storage-islands/, 2016.

[46] Mengxuan Gao, Hideyoshi Igata, Aoi Takeuchi, Kaoru Sato, and Yuji Ikegaya. Machine
learning-based prediction of adverse drug effects: An example of seizure-inducing com-
pounds. Journal of Pharmacological Sciences, 133(2):70–78, 2017.

[47] Yuan Gao and Dorota Glowacka. Deep gate recurrent neural network. In JMLR: Workshop

and Conference Proceedings, pages 350–365. ACML, 2016.

[48] Dan Garber and Elad Hazan.

Fast and simple pca via convex optimization.

arXiv:1509.05647v4 [math.OC], 2015.

[49] Luke B. Godfrey and Michael S. Gashler. Neural decomposition of time-series data for

effective generalization. CoRR, abs/1705.09137, 2017.

[50] David Gotz, Fei Wang, and Adam Perer. A methodology for interactive mining and visual
analysis of clinical event patterns using electronic health record data. Journal of Biomedical
Informatics, 48:148–159, 2014.

[51] Alex Graves, Abdel rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep

recurrent neural networks. arXiv:1303.5778[cs.NE], 2013.

[52] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural

turing machines.

CoRR,

abs/1410.5401, 2014.

[53] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep
autoregressive networks. In Proceedings of the 31st International Conference on Machine
Learning, volume 32, pages 1242–1250, Jun 2014.

147

[54] Kristiina H¨ayrinen, Kaija Saranto, and Pirkko Nyk¨anen. Deﬁnition, structure, content, use
and impacts of electronic health records: a review of the research literature. International
Journal of Medical Informatics, 77(5):291–304, 2008.

[55] Matthias Hein and Thomas B¨uhler. An inverse power method for nonlinear eigenproblems
with applications in 1-spectral clustering and sparse pca. In J. D. Lafferty, C. K. I. Williams,
J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information
Processing Systems 23, pages 847–855. Curran Associates, Inc., 2010.

[56] Jette Henderson, Joyce C. Ho, Abel N. Kho, Joshua C. Denny, Bradley A. Malin, Jimeng
Sun, and Joydeep Ghosh. Granite: Diversiﬁed, sparse tensor factorization for electronic
health record-based phenotyping. In 2017 IEEE International Conference on Healthcare
Informatics, pages 214–223, 2017.

[57] Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. Marble: High-throughput phenotyping from
electronic health records via sparse nonnegative tensor factorization. In Proceedings of the
20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 115–124, 2014.

[58] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Computation,

9(8):1735–1780, 1997.

[59] Chih-Wei Huang, Richard Lu, Usman Iqbal, Shen-Hsien Lin, Phung Anh Nguyen, Hsuan-
Chia Yang, Chun-Fu Wang, Jianping Li, Kwan-Liu Ma, Yu-Chuan Li, and Wen-Shan Jian.
A richly interactive exploratory data analysis and visualization tool using electronic medical
records. BMC Medical Informatics and Decision Making, 15(1), 2015.

[60] Wei Huang, Shuru Zeng, Min Wan, and Guang Chen. Medical media analytics via ranking
and big learning: A multi-modality image-based disease severity prediction study. Neuro-
computing, 204:125–134, 2016.

[61] Trevor Hastie Hui Zou and Robert Tibshirani. Sparse principal component analysis. Journal

of Computational and Graphical Statistics, 15(2):265–286, 2006.

[62] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-
Alain Muller. Deep learning for time series classiﬁcation: a review. Data Mining and
Knowledge Discovery, Mar 2019.

[63] Franck Iutzeler, Pascal Bianchi, Philippe Ciblat, and Walid Hachem. Asynchronous dis-
In

tributed optimization using a randomized alternating direction method of multipliers.
52nd IEEE Conference on Decision and Control, pages 3671–3676, 2013.

[64] Shuiwang Ji and Jieping Ye. An accelerated gradient method for trace norm minimization.
In Proceedings of the 26th Annual International Conference on Machine Learning, pages
457–464, 2009.

[65] Xi Jin, Ping Luo, Fuzhen Zhuang, Jia He, and He Qing. Collaborating between local and
global learning for distributed online multiple tasks. In Proceedings of the 24th ACM Inter-
national Conference on Information and Knowledge Management, pages 113–122, 2015.

148

[66] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Moham-
mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark.
Mimic-iii, a freely accessible critical care database. Scientiﬁc Data, 3, 2016.

[67] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive
variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–
323, 2013.

[68] Michel Journee, Yurii Nesterov, Peter Richtarik, and Rodolphe Sepulchre. Generalized
power method for sparse principal component analysis. Journal of Machine Learning Re-
search, 11(3):517–553, 2010.

[69] Uri Kartoun. A methodology to generate virtual patient repositories. arXiv:1608.00570

[cs.CY], 2016.

[70] Jessica Kent.

Big data to see explosive growth, challenging healthcare organiza-
https://healthitanalytics.com/news/big-data-to-see-explosive-growth-challenging-

tions.
healthcare-organizations, 2018.

[71] Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert M¨uller, Alexander Zien, and S¨oren
Sonnenburg. Efﬁcient and accurate lp-norm multiple kernel learning. Advances in Neural
Information Processing Systems, pages 997–1005, 2009.

[72] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural networks
for text classiﬁcation. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial
Intelligence, 2015.

[73] Richard C Larson and Amedeo R Odoni. Urban Operations Research. Prentice Hall PTR,

1981.

[74] Hung Le, Truyen Tran, and Svetha Venkatesh. Learning to remember more with less mem-

orization. In International Conference on Learning Representations, 2019.

[75] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[76] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for
minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.

[77] Jorge Nocedal Leon Bottou, Frank E. Curtis. Optimization methods for large-scale machine

learning. CoRR, abs/1606.04838, 2017.

[78] Omer Levy, Kenton Lee, Nicholas FitzGerald, and Luke Zettlemoyer. Long short-term
memory as a dynamically computed element-wise weighted sum. CoRR, abs/1805.03716,
2018.

[79] Yi Li and Zhenyu He. Robust principal component analysis via feature self-representation.
In 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC),
pages 94–99, Dec 2017.

149

[80] Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzell. Learning to diag-

nose with lstm recurrent neural networks. arXiv:1511.03677v6 [cs.LG], 2016.

[81] Ji Liu and Stephen J. Wright. Asynchronous stochastic coordinate descent: Parallelism and

convergence properties. SIAM Journal of Optimization, 25(1):351–376, 2015.

[82] Yves A Lussier and Yang Liu. Computational approaches to phenotyping: high-throughput

phenomics. Proceedings of the American Thoracic Society, 4(1):18–25, 2007.

[83] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Infor-

mation Retrieval. Cambridge University Press, 2008.

[84] Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsu-
pervised pattern discovery in electronic health care data using probabilistic clustering mod-
els. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium,
pages 389–398. ACM, 2012.

[85] David Mateos-Nunez and Jorge Cortes. Distributed optimization for multi-task learning via

nuclear-norm approximation. IFAC Conference Paper Archieve, 48(22):64–69, 2015.

[86] Nikhil Naikal, Allen Y. Yang, and Sastry. S. Shankar.

Informative feature selection for
object recognition via sparse pca. In 2011 International Conference on Computer Vision,
pages 818–825, 2011.

[87] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.

Springer Science & Business Media, 2003.

[88] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques.

In
Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Ad-
vances in Neural Information Processing Systems 27, pages 1574–1582. Curran Associates,
Inc., 2014.

[89] Desmond L. Nuttall, Harvey Goldstein, Robert Prosser, and Jon Rasbash. Differential school

effectiveness. International Journal of Educational Research, 13(7):769–776, 1989.

[90] Ofﬁce of the National Coordinator for Health Information Technology. Ofﬁce-based physi-
https://dashboard.healthit.gov/quickstats/pages/

cian electronic health record adoption.
physician-ehr-adoption-trends.php, 2016.

[91] Pariwat Ongsulee. Artiﬁcial intelligence, machine learning and deep learning.

International Conference on ICT and Knowledge Engineering, 2017.

In 15th

[92] Karl Pearson. On lines and planes of closest ﬁt to systems of points in space. Philosophical

Magazine, Series 6, 2(11):559–572, 1901.

[93] Zhimin Peng, Tianyu Wu, Yangyang Xu, Ming Yan, and Wotao Yin. Coordinate-friendly
structures, algorithms and applications. Annals of Mathematical Sciences and Applications,
1:57–119, 2016.

150

[94] Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. ARock: An algorithmic frame-
work for asynchronous parallel coordinate updates. SIAM Journal on Scientiﬁc Computing,
38(5):A2851–A2879, 2016.

[95] Adam Perer, Fei Wang, and Jianying Hu. Mining and exploring care pathways from elec-
tronic medical records with visual analytics. Journal of Biomedical Informatics, 56:369–
378, 2015.

[96] Trang Pham, Truyen Tran, Dinh Phung, and Svetha Vankatesh. Deepcare: A deep dynamic

memory model for predictive medicine. arxiv:1602.00357v1 [stat.ML], 2016.

[97] Carmen C. Y. Poon, May D. Wang, Paolo Bonato, and David A. Fenstermacher. Edito-
rial: Special issue on health informatics and personalized medicine. IEEE Transactions on
Biomedical Engineering, 60(1):143–146, 2013.

[98] M. Saadeq Raﬁeee and Ali Akbar Khazaei. A novel model characteristics for noise-robust
automatic speech recognition based on hmm. In 2010 IEEE International Conference on
Wireless Communications, Networking and Information Security, pages 215–218, 2010.

[99] Dharavath Ramesh, Pranshu Suraj, and Lokendra Saini. Big data analytics in healthcare:
A survey approach. In 2016 International Conference on Microelectronics, Computing and
Communications, pages 1–6, 2016.

[100] Grand View Research. Electronic health records market size worth $33.41 billion by
2025. https://www.grandviewresearch.com/press-release/global-electronic-health-records-
market, 2017.

[101] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with
an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information
Processing Systems 25, pages 2663–2671, 2012.

[102] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
International Journal of
Li Fei-Fei.
Computer Vision, 115(3):211–252, 2015.

Imagenet large scale visual recognition challenge.

[103] Arthur J. Russell, Emmanouil Benetos, and Arthur d’Avila Garcez. On the memory proper-
ties of recurrent neural models. In 2017 International Joint Conference on Neural Networks
(IJCNN), pages 2596–2603, May 2017.

[104] Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate.

32nd International Conference on Machine Learning, 37, 2015.

[105] Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep ehr: A
survey of recent advances in deep learning techniques for electronic health record (ehr)
analysis. IEEE Journal of Biomedical and Health Informatics, PP(99):1–1, 2017.

[106] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of

video representations using lstms. arXiv:1502.04681v3 [cs.LG], 2016.

151

[107] Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura,
Krzysztof J. Cios, and John N. Clore. Impact of hba1c measurement on hospital readmission
rates: Analysis of 70,000 clinical database patient records. BioMed Research International,
2014.

[108] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory
networks. In Advances in Neural Information Processing Systems 28, pages 2440–2448.
Curran Associates, Inc., 2015.

[109] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng, and Shiyue Zhang. Memory visualization
for gated recurrent neural networks in speech recognition. In 2017 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pages 2736–2740, March
2017.

[110] Tractica. Artiﬁcial intelligence software market to reach $105.8 billion in annual worldwide
revenue by 2025. https://www.tractica.com/newsroom/press-releases/artiﬁcial-intelligence-
software-market-to-reach-105-8-billion-in-annual-worldwide-revenue-by-2025/, 2018.

[111] Russell P Tracy. deep phenotyping: Characterizing populations in the era of genomics and

systems biology. Current Opinion in Lipidology, 19(2):151–157, 2008.

[112] Volker Tresp, J. Marc Overhage, Markus Bundschus, Shahrooz Rabizadeh, Peter A.
Fasching, and Shipeng Yu. Going digital: A survey on digitalization and large-scale data
analytics in healthcare. Proceedings of the IEEE, 104(11):2180–2206, 2016.

[113] Slee VN. The international classiﬁcation of diseases: Ninth revision (icd-9). Annals of

Internal Medicine, 88(3):424–426, 1978.

[114] Chun-Fu Wang, Jianping Li, Kwan-Liu Ma, Chih-Wei Huang, and Yu-Chuan Li. A vi-
sual analysis approach to cohort study of electronic patient records. In IEEE International
Conference on Bioinformatics and Biomedicine (BIBM), pages 521–528. IEEE, 2014.

[115] Jialei Wang, Mladen Kolar, and Nathan Srebro. Distributed multi-task learning with shared

representation. arXiv:1603.02185v1, 2016.

[116] Weiran Wang, Jialei Wang, Mladen Kolar, and Nathan Srebro. Distributed stochastic multi-

task learning with graph regularization. CoRR, abs/1802.03830, 2018.

[117] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve
Young. Semantically conditioned lstm-based natural language generation for spoken dia-
logue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, pages 1711–1721, 2015.

[118] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.

CoRR,

abs/1410.3916, 2014.

[119] Stephen J Wright, Robert D Nowak, and M´ario AT Figueiredo. Sparse reconstruction by
separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.

152

[120] Ting Xiang, Debajyoti Ray, Terry Lohrenz, Peter Dayan, and P. Read Montague. Com-
putational phenotyping of two-person interactions reveals differential neural response to
depth-of-thought. In PLoS Computational Biology, 2012.

[121] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance

reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

[122] Liyang Xie, Inci M. Baytas, Kaixiang Lin, and Jiayu Zhou. Privacy-preserving distributed
multi-task learning with asynchronous updates. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages
1195–1204, 2017.

[123] Xiaolin Yang, Seyoung Kim, and Eric P Xing. Heterogeneous multitask learning with joint
sparsity constraints. In Advances in Neural Information Processing Systems, pages 2151–
2159, 2009.

[124] Zhi Yang, Yusi Zhang, Binghui Guo, Ben Y. Zhao, and Yafei Dai. Deepcredit: Exploiting
user cickstream for loan risk prediction in p2p lending. In Proceedings of the 12th Interna-
tional AAAI Conference on Web and Social Media, 2018.

[125] Yu Zhang, I-Wei Wu, Duygu Tosun, Eric Foster, and Norbert Schuff. Progression of regional
microstructural degeneration in parkinsons disease: A multicenter diffusion tensor imaging
study. In PloS one, 2016.

[126] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection
by deep multi-task learning. In Proceedings of European Conference on Computer Vision
(ECCV), pages 94–108, 2014.

[127] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. Minimal gated unit for
recurrent neural networks. International Journal of Automation and Computing, 13(3):226–
234, Jun 2016.

[128] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating
structure optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.
Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 702–
710. Curran Associates, Inc., 2011.

[129] Jiayu Zhou, Jianhui Chen, and Jieping Ye. MALSAR: Multi-tAsk Learning via StructurAl

Regularization. Arizona State University, 2011.

[130] Jiayu Zhou, Zhaosong Lu, Jimeng Sun, Lei Yuan, Fei Wang, and Jieping Ye. Feaﬁner:
Biomarker identiﬁcation from medical data through feature generalization and selection. In
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 1034–1042, 2013.

[131] Jiayu Zhou, Fei Wang, Jianying Hu, and Jieping Ye. From micro to macro: Data driven
In Proceedings
phenotyping by densiﬁcation of longitudinal electronic medical records.
of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 135–144, 2014.

153

[132] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. A multi-task learning formulation for pre-
dicting disease progression. In Proceedings of the 17th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pages 814–822, 2011.

[133] Xiaoqiang Zhou, Baotian Hu, Qingcai Chen, and Xiaolong Wang. An auto-encoder for
learning conversation representation using lstm. In Proceedings of the 22nd International
Conference on Neural Information Processing, pages 310–317, 2015.

154