STATISTICAL AND COMPUTATIONAL METHODS FOR BIOLOGICAL

DATA

By

Yuning Hao

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computational Mathematics, Science and Engineering — Dual Major

Statistics — Doctor of Philosophy

2019

ABSTRACT

STATISTICAL AND COMPUTATIONAL METHODS FOR BIOLOGICAL

DATA

By

Yuning Hao

The development of biological data focuses on machine learning and statistical methods.

In immunotherapy, gene-expression deconvolution is used to quantify diﬀerent types of cells

in a mixed population. It provides a highly promising solution to rapidly characterize the

tumor-inﬁltrating immune landscape and identify cold cancers. However, a major challenge

is that gene-expression data are frequently contaminated by many outliers that decrease

the estimation accuracy. Thus, it is imperative to develop a robust deconvolution method

that automatically decontaminates data by reliably detecting and removing outliers. Our

development of an algorithm called adaptive Least Trimmed Square (aLTS) identiﬁes outliers

in regression models, allows us to eﬀectively detect and omit the outliers, and provides us

robust estimations of the coeﬃcients. For the guarantees of the convergence property and

parameters recovery, we also included certain theoretical results.

Another interesting topic is the investigation of the association of phenotype responses

with the identiﬁed intricate patterns in transcription factor binding sites for DNA sequences.

To address these concerns, we pushed forward with a deep learning-based framework. On

one hand, to capture regulatory motifs, we utilized convolution and pooling layers. On

the other hand, to understand the long-term dependencies among motifs, we used position

embedding and multi-head self-attention layers. We pursued the improvement of our model’s

overall eﬃcacy through the integration of transfer learning and multi-task learning. To

ascertain conﬁrmed and novel transcription factor binding motifs (TFBMs), along with their

relationships internally, we provided interpretations of our DNA quantiﬁcation model.

ACKNOWLEDGMENTS

I would like to express the deepest appreciation to my advisor and committee chair Professor

Yuying Xie. He encouraged me and led me into the area of statistical machine learning. This

dissertation would not have been possible without his excellent guidance and continuous

support during my PhD research.

I also would like to thank my committee members, Professor Yuehua Cui, Professor Grace

Hong and Professor Ming Yan, for their insightful comments and feedback in my research.

Thanks to my dear friends, Hongnan Wang, Ningyu Sha, Jieqian He, Binbin Huang, Qi

Liang and Minzhi Liu for their consistent help and company. I greatly value their friendship

and they are the treasures I got during my whole life.

Most importantly, I would like to thank my parents Jianhua Hao and Peilan Ning for

loving me so much and supporting me at every stage of my life. Many thanks to my boyfriend

Yang and my little friend Lucky for always standing by my side and be my constant source

of strength.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2 Adaptive Least Trimmed Square . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Convergence property and recovery guarantee . . . . . . . . . . . . . . . . .
2.4 Parameter tunning for ALTS . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 Fast and Robust Deconvolution of Tumor Inﬁltrating Lympho-
cyte from Expression Proﬁles . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Evaluation metrics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In silico simulation with varied error types . . . . . . . . . . . . . . .
In silico simulation based on leukocyte gene signature matrix ﬁle
. .
Synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Synthetic dataset with added unknown content
. . . . . . . . . . . .
3.5 Real-data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Deconvolution performance on immune-cell-rich datasets . . . . . . .
3.5.2 Deconvolution performance on RNA-seq datasets
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Lung Squamous Cell carcinoma datasets and Ovarian Serous Cystade-
nocarcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6 Survival analysis

3.4.1
3.4.2
3.4.3
3.4.4

Chapter 4 Deep learning framework on DNA quantiﬁcation . . . . . . . .
4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Features and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 The DeepCAT model design . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Training of the model . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.5 Clustered multi-task learning . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.6
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Interpreting the model

v

1

4
4
5
8
10
11
12

15
15
19
20
21
21
22
23
23
24
24
25
26

26
28

40
43
43
44
46
47
48
48
49

4.2.1 Random initialization for convolution kernel weight . . . . . . . . . .
4.2.2 Transfer learning with human data . . . . . . . . . . . . . . . . . . .
4.2.3 Transfer learning with known motif database . . . . . . . . . . . . . .
4.2.4 Performance of clustered multi-task learning . . . . . . . . . . . . . .
4.2.5 Transcription factor binding motifs and their interactions . . . . . . .

Chapter 5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . .

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX A Supplementary text for Chapter 2 . . . . . . . . . . . . . . . . . .
APPENDIX B Supplementary text for Chapter 3 . . . . . . . . . . . . . . . . . .
APPENDIX C Supplementary text for Chapter 4 . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49
50
51
52
54

57

59
60
66
77

81

vi

LIST OF TABLES

Table 2.1: Tuned k for ALTS with the adjusted BIC. We simulated normally dis-
tributed errors and heavy-tailed errors respectively for diﬀerent proportion
of outliers and computed true positive rate and false positive rate to eval-
uate the tunning result.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Table 2.2: Parameters of ALTS. To show that ALTS is not sensitive for diﬀerent values
of α1, and α2 in an appropriate range with tuned value of k, we simulated
a dataset with sample size n = 500, number of predictors p = 20, normal
distributed error and 20% outliers using the same setting of section 2.5 in
the paper. Then we ran ALTS with following setting and get the number
of detected outliers, true and false positive rate: (1) Take α2 = 1.1, change
α1 from 0.1 to 0.5 by 0.05 and tune k using BIC*. (2) Take α1 = 0.1,
change α2 from 1.1 to 2 by 0.1 and tune k using BIC*. (3) Take α1 = 0.1,
α2 = 1.5, and change k from 1 to 10 by 0.1. We can see that the accuracy
. . . . . . . . . . . . . . . .
of the result stays stable with a well tuned k.

Table 3.1: Performance of FARDEEP, CIBERSORT, NNLS, PERT and DCQ meth-
. . . . . . . . . . . .

ods under diﬀerent simulation settings with outliers.

Table 3.2: The outlier genes detected by FARDEEP and their average expression val-
ues in 7 ovarian cancer cell lines for TCGA OV dataset. Here we listed all
. . . .
genes with removal frequency larger than 50% among 514 samples.

Table B.1: This table shows the results from CIBERSORT and FARDEEP on two
mixtures (MIX1 and MIX2) with HCT116 and MDA-MB-435 respectively.
FARDEEP (relative prop) was gotten by normalize the default outputs to
sum to 1.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table B.2: Tuned k for FARDEEP with the adjusted BIC. We simulated normally
distributed errors and heavy-tailed errors respectively for diﬀerent propor-
tion of outliers and computed true positive rate and false positive rate to
evaluate the tunning result. . . . . . . . . . . . . . . . . . . . . . . . . . .

Table B.3: We simulated correlated errors with diﬀerent proportion of outliers for
FARDEEP, and computed true positive rate and false positive rate to eval-
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
uate the tunning result.

14

38

39

71

72

74

vii

LIST OF FIGURES

Figure 2.1: Comparing the performance of ALTS and LTS. We simulated diﬀerent per-
centages of outlier ({5%, 10%, 20%, 30%}) with both normally distributed
and heavy-tailed random errors. The two approaches are compared based
on the sum of squared errors for their estimated coeﬃcients.
. . . . . . .

Figure 3.1: SSE of coeﬃcients for diﬀerent approaches. We simulated diﬀerent per-
centage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coef-
ﬁcients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A)
random error with standard normal distribution, (B) random error with
t-distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.2: Compare the estimation accuracy of diﬀerent deconvolution approaches
(the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%}
percentage of outliers, we computed R2 to evaluate how well the estimators
ﬁt for a straight line ˆβ = β. (A) random error with standard normal
distribution, (B) random error with t-distribution. . . . . . . . . . . . . .

Figure 3.3: Applying diﬀerent deconvolution approaches on the gene expression data
of IM-9, Jurkat, Raji, THP-1 and the mixture of these four immune cell
lines with known proportion (MixA, MixB, MixC, MixD). All of the mix-
tures were performed and measured in triplicate. (A) SSE of coeﬃcients
for FARDEEP, CIBERSORT, NNLS, PERT, DCQ. (B) The abundance
of cell lines estimated from diﬀerent deconvolution approaches vs. Abun-
dance of cell lines truly mixed. The R2 and R values are also reported
at the top of the ﬁgures. (C) Deconvolution of individual cell subsets by
FARDEEP and CIBERSORT. The correlation coeﬃcients R, the corre-
sponding p-values against the null hypothesis of R = 0, trend lines with
95% conﬁdence intervals are shown in the ﬁgures. The black dashed line
represents the perfect relationship between the estimate and the true cell
abundances with slope 1 and intercept 0. . . . . . . . . . . . . . . . . . .

13

30

31

32

viii

Figure 3.4: Performance comparison between CIBERSORT and FARDEEP on mix-
tures with unknown content.(A) SSE for various noise and abundance of
tumor contents on 4 diﬀerent mixtures. (B) Abundance of cell lines esti-
mated by CIBERSORT and FARDEEP vs. Abundance of cell lines truly
mixed. The R2 and R values are also reported in the top of the ﬁgures.
(C) Deconvolution of individual cell subsets by FARDEEP and CIBER-
SORT. The correlation coeﬃcients R, the corresponding p-values against
the null hypothesis of R = 0, trend lines with 95% conﬁdence intervals
are shown in the ﬁgures. The black dashed line represents the perfect
relationship between the estimate and the true cell abundances with slope
1 and intercept 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.5: Performance assessment on Microarray data with SSE, R2 and R. The
follicular lymphoma dataset is evaluated for (A) Overall performance and
(B) individual cell subsets. (C) Normal tonsil dataset with puriﬁed B cells
and T cells. The blood samples from pediatric renal transplant patients
are evaluated for (D) Overall performance and (E) individual cell sub-
sets. The black dashed line represents the perfect relationship between
.
the estimate and the true cell abundances with slope 1 and intercept 0.

Figure 3.6: Gene-expression deconvolution performance of FARDEEP and CIBER-
SORT on RNA-seq data.
(A) Overall and (B) individual cell subsets
results for 8 PBMC samples collected from two vaccinated donors at dif-
ferent time points. (C) Overall and (D) individual cell subsets result for
lymph node bulk samples. Correlation coeﬃent R and the correspoinding
p-value are missing for CIBERSORT on NK cell because the CIBERSORT
estimations are all zero. The black dashed line represents the perfect re-
lationship between the estimate and the true cell abundances with slope
1 and intercept 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.7: Kaplan-Meier survival curves are plotted based on ESTIMATE, FARDEEP-
and CIBERSORT- assisted TIL proﬁling. Log-Rank test was applied to
data classiﬁed into two groups according to whether immunoscore (ES-
TIMATE) or collective anti-tumor immune subsets (CIBERSORT and
FARDEEP) was above or below the median. (A) 514 patients with ovar-
ian cancer. (B) 133 patients with lung squamous cell carcinoma. . . . . .

Figure 3.8: Kaplan-Meier survival curves are plotted based on CIBERSORT (abso-
lute mode)- assisted TIL proﬁling. Patients were classiﬁed into two groups
according to whether immunoscore (ESTIMATE) or collective anti-tumor
immune subsets was above or below the median. Log-Rank test was ap-
plied to obtain the p-value. (A) 514 patients with ovarian cancer. (B) 133
. . . . . . . . . . . . . . . .
patients with lung squamous cell carcinoma.

33

34

35

36

37

ix

Figure 3.9: FARDEEP score of TCGA OV and LUSC datasets are calculated from
the summation of 22 TIL subset scores, which show highly negative cor-
relations to the consensus measurement of purity estimations (CPE).
. .

37

Figure 4.1: Model architecture. The convolution layer acts as a scanner of motifs and
produces an output with dimension of (sequence length - kernel size + 1) ×
(number of kernel). The max-pooling layer reduces the output dimension
by extracting the maximum values through each pooling kernel. Then
the special token and position encoding matrix are added to the output.
The two Transformer encoder layers consider the interactions between
each token of the output sequence from previous layers and re-interpret
it. Finally, a sigmoid nonlinear transformation is performed to get the
prediction scores.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.2: Performance comparison among DeepCAT, DanQ, SVM and Random for-
est through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress
conditions. The stress conditions are ordered by the values of DeepCAT
on x-axis, and the average of ROC and PR scores for each approach are
. . . . . . . . . . . . . . . . . . . . . . . . . . .
included in parentheses.

Figure 4.3: The eﬀect of applying pre-trained weight from human data into DeepCAT.
(A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions.
The straight line y = x represents the equality between the results of
randomly initialized DeepCAT and DeepCAT with transfer learning based
on human data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.4: The eﬀect of applying known TFBMs into DeepCAT. (A) ROC-AUC
scores and (B) PR-AUC scores for 57 stress conditions. The straight line
y = x represents the equality between the results of randomly initialized
DeepCAT and DeepCAT with transfer learning based on known TFBM
database.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.5: Hierarchical clustering with Jaccard’s distance for 57 stress responses. . .

Figure 4.6: Performance comparison among DanQ, SVM, Random forest, DeepCAT,
DeepCAT with known TFBMs, multi-task DeepCAT with known TFBMs
through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress con-
ditions. The stress conditions are ordered by the values of DeepCAT on
x-axis, and the average of ROC and PR scores for each approach are
. . . . . . . . . . . . . . . . . . . . . . . . . . .
included in parentheses.

Figure 4.7: Three examples of aligned TFBMs through TOMTOM algorithm. The
upper ﬁgures are the known TFBMs and the lower ones are generated
by convolution kernel weights trained by DeepCAT. The p-values are
7.46e−11, 1.22e−13, 3.27e−11 respectively.
. . . . . . . . . . . . . . . . .

47

50

51

52

53

54

55

x

Figure 4.8: Three examples of identiﬁed novel TFBMs. The upper ﬁgures come from
the initialized random motifs and the lower ones are generated by convo-
lution kernel weights trained by DeepCAT. . . . . . . . . . . . . . . . . .

Figure B.1: Eﬀect of Z-score normalization. Red dashed line indicates the true abso-
lute abundance of each cell. SVR is the common Support vector regres-
sion without normalization and SVR.znormalization is the method used
in CIBERSORT.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure B.2: The y-axis are the sorted expression value of HCT116 and MDA-MB-435
cell line for 584 probesets. The probes which have been detected as outlier
. . . . . . . . . . . . . . . . .
by FARDEEP are marked with color red.

Figure B.3: SSE of coeﬃcients for diﬀerent approaches based on simulations with X
related outliers. We simulated diﬀerent percentage of outliers ({5%, 10%,
20%, 30%}) and compared the SSE for coeﬃcients applying NNLS, DCQ,
PERT, CIBERSORT, and FARDEEP. (A) random error with standard
normal distribution, (B) random error with t-distribution.
. . . . . . . .

Figure B.4: Compare the estimation accuracy of diﬀerent deconvolution approaches for
simulations with X related outliers (the values in parentheses are R2 and
R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed
R2 to evaluate how well are the estimators ﬁt for a straight line ˆβ = β.
(A) random error with standard normal distribution, (B) random error
with t-distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure B.5: SSE of coeﬃcients for diﬀerent approaches based on simulations with
correlated random errors. We simulated diﬀerent percentage of outliers
({5%, 10%, 20%, 30%}) and compared the SSE for coeﬃcients applying
. . . . . . . . . . .
NNLS, DCQ, PERT, CIBERSORT, and FARDEEP.

Figure B.6: Compare the estimation accuracy of diﬀerent deconvolution approaches
for simulations with correlated random errors (the values in parentheses
are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we
computed R2 to evaluate how well are the estimators ﬁt for a straight line
ˆβ = β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure C.1: Sequence logo ﬁgures to validate the transformation between convolution
kernels and motifs. (A) original TFBMs, (B) converted TFBMs. . . . . .

56

68

71

72

73

75

76

77

xi

Chapter 1

Overview

Biological data commonly uses machine learning and statistical analysis. In immunother-

apy, the gene-expression deconvolution is extensively used, which quantiﬁes a variety of cells

in a mixed population. Rapidly emerging evidence suggests that the tumor immune mi-

croenvironment not only predisposes cancer patients to diverse treatment outcomes but also

represents a promising source of biomarkers for better patient stratiﬁcation. Diﬀerent from

the immunohistochemistry-based scoring practice, which focuses on a few selected marker

proteins, immune deconvolution pipelines inform a previously untapped method to com-

prehensively reveal the tumor-inﬁltrating immune landscape. However, outliers frequently

contaminate the gene expression data and decrease the accuracy of the estimations. Those

outliers can be the genes with abnormal expression value which may be caused by measure-

ment error, environmental eﬀect, expression from non-immune cells, or natural ﬂuctuations

in certain type of immune cells. Thus, to automatically detect and remove these outliers,

we developed a robust deconvolution method called the adaptive Least Trimmed Square

(aLTS) for data decontamination. This tackles the challenges presented by the classical

Least Trimmed Square that uses an approximated pre-estimated corruption ratio for outlier

elimination which may cause signiﬁcant loss of data in spite of its ability to create a higher

breakdown point for the regression model. Our theoretical results attempt to highlight the

convergence property and parameters recovery guarantee.

1

The process of the aLTS application into the gene expression deconvolution involves three

factors. First, we obtain the non-negative least squares (NNLS) by using the Lawson-Hanson

algorithm (Lawson and Hanson, 1995). Second, given that the number of cells is consistently

non-negative, we swapped the ordinary least squares (OLS) with the NNLS. Third, to make

it appropriate for log-normal distributed errors, we altered the tuning parameters, which

are also known as penalty parameters. For gene expression deconvolution, we identiﬁed this

emerging machine learning tool as Fast And Robust DEconvolution of Expression Proﬁles

(FARDEEP), which could itemize the subsets of the immune cells originating from the whole

tumor tissue samples. Furthermore, in contrast to the current methods with numerical

simulations and real datasets, our choice to move forward with FARDEEP provides better

approximations of coeﬃcients and showcases less vulnerability to outliers. In addition to

the relative percentage values, the estimation provided by FARDEEP is associated with the

absolute quantity of each immune cell subset. As a result, to supplement the current toolkit

that characterizes the tissue-inﬁltrating immune cell landscape, FARDEEP serves as a novel

robust algorithm.

Another area that we tackled is the identiﬁcation of complex patterns in transcription

factor binding sites for DNA sequences while investigating its association with phenotype

responses.

In this part, the emphasis of our study is predict the gene expression level

for plant data under diﬀerent stressful environments using the information extracted from

transcription factor binding motifs (TFBMs). Comparing to the traditional machine learning

approaches which need to do feature engineering manually, we developed a deep learning

structured model, DeepCAT, which can generate the prediction of gene expression directly

from raw DNA sequence information. Moreover, in addition to the motifs scanning with

convolution and pooling layers, we also take the long-term interactions between motifs into

2

account for DeepCAT by using multi-heads attention layers.

In this way, our model can

be widely applied to learn the phenotype related motifs along with their interactions with

regardless of the sequence length.

3

Chapter 2

Adaptive Least Trimmed Square

2.1 Introduction

Robust regression or outlier detection methods aim to obtain reliable information from the

contaminated data since outliers are pervasive in most of the datasets. Hampel et al. (2005)

suggests that there are usually 1% to 10% outliers exist in a routine dataset. Outliers refer

to abnormal points that come from other populations which have diﬀerent distributions

from the original dataset. Outliers may have serious inﬂuence on the result of studies. For

example, ordinary least squares (OLS), the most common statistical modeling approach, is

very sensitive to outliers, it can be broken down by a very small group of abnormal points.

To address this problem, Rousseeuw (1984) and Rousseeuw and Leroy (1987) proposed Least

Trimmed Square (LTS) framework which ﬁnds h observations with smallest residuals, and

estimate the coeﬃcients with this subset of data,

(cid:88)

ˆβ =

(yi − xiβ)

i

However, this LTS is known to be an NP-hard problem (Bernholt, 2006), then many studies

reformulated this problem and proposed many eﬀective ways to solve it (Rousseeuw and

Driessen, 2006; Nguyen and Welsch, 2010; Alfons et al., 2013; Shen et al., 2013; Bhatia

et al., 2015; Shen and Sanghavi, 2019). Nevertheless, most of these approaches require a

4

ﬁxed ratio of outliers which have to be roughly estimated in the beginning. This may give a

suboptimal ﬁtting result or inaccurate estimation of outliers since using the information of

only a ﬁxed proportion of the data reduces the power of the estimator, because the amount

of outliers in the real case cannot be presumed and can be either small or large. On the

other hand, several studies used the soft thresholding to got the recoverability for parameters

from dense corrupted observations (Wright and Ma, 2009; She and Owen, 2011; Nguyen and

Tran, 2011). However, some of these methods did not provide its theoretical properties, and

others have severe restrictions on the data. To further improve the above approaches, we

propose an Adaptive Least Trimmed Square (ALTS) which can automatically adjust the size

of outliers based on the information from residuals during each iteration in this study. It can

either identify outliers or make robust estimation for the coeﬃcients. And we also provide

the theoretical guarantees for its convergence and parameter recovery properties.

2.2 Model formulation

We consider the linear regression model can be expressed by,

y = Xβ + ε,

ε ∼ N (0, σ2I),

where X ∈ Rn×p is the ﬁxed predictor matrix, y ∈ Rn is the observed response vector,
β ∈ Rp is the coeﬃcient vector, and ε ∈ Rn is a vector of random errors with zero mean

and variance of σ2I. To incorporate outliers, we propose the following model

y = Xβ + γ + ε,

ε ∼ N (0, σ2I),

(2.1)

5

where parameter γ = (γ1, ..., γn)(cid:48) is a sparse vector in Rn with γi (cid:54)= 0 indicating the ith
observation is an outlier.

Under the formulation of (2.1), we want to estimate β with all outliers removed, such

that ˆβ = (X(cid:62)

S∗XS∗)−1X(cid:62)

S∗yS∗, where ˆβ is the Ordinary Least Square (OLS) solution with
samples in subset S∗, and S∗ is the set of good points with all zero elements in γ. Let the
projection matrix H = X(X(cid:62)

S∗, then the residuals r = (r1, ..., rn) could be

S∗XS∗)−1X(cid:62)

formulated as

r = y − X ˆβ = γ + ε − HεS∗.

(2.2)

From (2.2), the residuals, ri with the corresponding γi (cid:54)= 0, would deviate from zero, which
suggests that the set of outliers can be identiﬁed through thresholding as follows

E = {i : |ri| > k × rmed},

(2.3)

where E is the set of detected outliers, k is a tuning parameter controlling the sensitivity of
the model, and rmed is the median of {|ri|}n
i=1. Let N be the number of true outliers in the
data which is the cadinality of set E. First, we can use least squares and formula (2.3) to

obtain a rough estimate of the set of outliers and denote its cardinality to be N. Since the

model at this moment is inaccurate with contamination of outliers, N is an overestimation
of N which can be used to get an underestimate via N = α1N with α1 ∈ (0, 1). With N,
we can then update the least square ﬁtting after removing the N samples with the largest

absolute value of residuals and obtain an improved estimate of E and the corresponding

N. We can improve the model by repeating the procedure, but we need to increase the

underestimate of outliers, N, by a factor of α2 with α2 > 1 for each iteration to force the

6

convergence between N and N. In sum, we initialize our algorithm by setting

= (X(cid:62)X)−1X(cid:62)y,

(0)

ˆβ
r(0) = y − X ˆβ

(0)

,

which is the OLS solution. For the jth iteration, where j ≥ 1, we update N

(j) by

med}(cid:12)(cid:12)(cid:12),

(j)

| > r

(j)
i



(cid:12)(cid:12)(cid:12){i : |r
min((cid:12)(cid:12)(cid:110)

(j)

N

=

(cid:111)(cid:12)(cid:12), N

(j−1)

),

j = 0,

j ≥ 1.

(2.4)

i : |r

(j)
i

| > k · r

(j)
med

where the min(·,·) operator guarantees that N

(j), an overestimation of N, is non-increasing.

Similarly, we update N (j) through



N (j) =

(j)(cid:101),

(cid:100)α1N
min{(cid:100)α2N (j−1)(cid:101), N

j = 0,

(j)},

j ≥ 1,

(2.5)

where (cid:100)x(cid:101) means the ceiling of x ∈ R, α1 ∈ (0, 1) is used to obtain a lower bound for N
in the ﬁrst step, α2 > 1 guarantees the monotonicity of N (j), and the min(·,·) operator
(j). Then we update ˆβ and r after removing N (j) outliers
guarantees N (j) is smaller than N

by

ˆβ

(j+1)

= (X(cid:62)
Sj

XSj
r(j+1) = y − X ˆβ

(j+1)

,

)−1X(cid:62)
Sj

ySj

,

where XSj

∈ R(n−N (j))×p and ySj

∈ R(n−N (j)) are submatrix and subvector of X and

7

y with row indices in set Sj = {π(i) : i ≤ (n − N (j))}, here π is a permutation operator
π(n)|. We repeat this procedure until N and N
on [n] such that |r

π(2)| ≤ ··· ≤ |r

(j)

(j)

π(1)| ≤ |r

(j)

converge, and the elements in set of ˆE at this time is the ﬁnal estimated outliers.

Hence, we hereby report a novel approach, coined as Adaptive Least Trimmed Square

(ALTS), to automatically detect and remove contaminating outliers. In the next section, we

will provide the theoretical results for the convergence property and the parameter recovery

guarantee of ALTS.

2.3 Convergence property and recovery guarantee

The prove of convergence relies on two properties of the algorithm. For each iteration, the

OLS solution is optimized on the current active subset, and the selected points for the next

updated active subset should have the smallest residuals. We ﬁrst show that the objective

function is decreasing during each of the iteration as the following theorem.

Theorem 2.3.1. The Adaptive Least Trimmed Square (ALTS) iteration sequence satisﬁes
L(β(j+1)) ≤ L(β(j)), for any j ≥ 0, with objective function

n(cid:88)
(yi − xiβ)21{γi=0}.

i=1

L(β) =

In addition, the algorithm is fast and guarantees to converge within ﬁnite steps, which is

summarized in the following theorem.

Theorem 2.3.2. The Adaptive Least Trimmed Square (ALTS) stops in no more than j∗

steps, where

j∗ =

(cid:22)− log α1

(cid:23)

.

log α2

8

Here (cid:98)·(cid:99) is the largest integer that is less than or equal to x.

To further show that ALTS has a linear convergence rate and its parameter recovery

guarantee, we ﬁrst deﬁne the subset strong convexity and smoothness constant for X follow

Bhatia et al. (2015).

Deﬁnition 2.3.1. For matrix X ∈ Rn×p, we deﬁne its subset strong convexity and smooth-
ness constants c−

τ , c+

τ at level τ as

minS∈Sτ λmin(XT

S XS) ≥ c−
τ ,

maxS∈Sτ λmax(XT

S XS) ≤ c+
τ ,

where Sτ is the set of all subsets with size of (cid:98)nτ(cid:99), and 0 ≤ τ ≤ 1.

Theorem 2.3.3. Suppose the corruption rate at jth iteration of ALTS is ρj, the outlier
vector (cid:107)γ(cid:107)0 ≤ nρ∗ and ρ0 > ρ∗. Let c−
smoothness constants for X at level τ. We assume that η = 4

τ be the subset strong convexity and

< 1, then if

τ and c+

(cid:115)

c+
ρ0
c−
1−ρ0

logα2

1
α1

= O(logη

√

(cid:13)(cid:13)(cid:13)γS0

n

(cid:13)(cid:13)(cid:13)2

),

(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  +

C√
n

(cid:107)ε(cid:107)2 ,

with  > 0 and some constant C.

Corollary 2.3.1. Suppose the assumptions are same with Theorem 2.3.3 and the random
noises follow Gaussian distribution, i.e. ε ∼ N (0, σ2I). Then when n > log 1

δ for some

9

δ > 0, with probability at least 1 − δ,

(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  + Cσ,

with  > 0 and some constant C.

We provided the proofs of the above theorems and corollary in Appendix A refer to Bhatia

et al. (2015). From the theoretical properties above, we know that the three parameters

inﬂuence the performance of ALTS, we will discuss the parameter tuning in the next section.

2.4 Parameter tunning for ALTS

There are three tuning parameters k, α1, and α2 in ALTS. As introduced in the last section,

a relative small α1 and α2 are preferred to make the estimation more accurate. In practice,

ALTS is not very sensitive to diﬀerent values of α1 and α2 if they are taken from an ap-

propriate range, so we set them to 0.1 and 1.5 respectively by default. However, k controls

the number of outliers and is critical for the performance of ALTS. Thus, we tune k on a

case-by-case basis for each dataset to preserve meaningful information. Eﬀects for diﬀerent

tuning parameters are shown in Table 2.2. Since the test group may contain outliers that

inﬂuence the accuracy of the tuning result, cross-validation is not advised. Instead, we ap-

plied an modiﬁed Bayesian Information Criterion (BIC) referring to the setting of She and

BIC∗(k) = m log

+ b(log(m) + 1),

(2.6)

Owen (She and Owen, 2011) as follow:

(cid:80)n
i=1 1{i(cid:54)∈ ˆE}(yi − ˆyi)2

m

10

where ˆE being the set of estimated outliers, b is number of parameters and equals ˆN + p + 1
with ˆN = | ˆE| being the number of outliers, and m equals n− ˆN. Then, we choose the value
of k associated with the smallest BIC∗.

2.5 Simulation study

To test the performance of ALTS, we simulated several datasets with diﬀerent proportions

of outliers following the setting in She and Owen (2011) and Alfons et al. (2013). The

observations were generated based on the linear regression model (2.1). The predictor matrix
is X = (x1, . . . , xn)(cid:48) = UΣ1/2, where U ij ∼ U(0, 20) and Σij = e
1{i(cid:54)=j} with e = 0.5.
Consider the proportion of outliers f ∈ {5%, 10%, 20%, 30%}, sample size n = 500, and

number of predictors p = 20, we randomly added outliers to the simulated data as follows:

i) Vertical outliers: we generated a n dimensional zero vector τ and randomly selected nf

elements in τ to be the outliers generated from a non-central t-distribution with 1 degree of

freedom and a non-centrality parameter of 30.

ii) Leverage points : we took 20% of the contaminated data in i) as leverage points, that is,
replacing the corresponding predictors by the samples from N (2max(X), 1).
The coeﬃcients βj were sampled from U(0, 1) with j = 1, . . . , p, and we simulated both
normally distributed and heavy-tailed random noise ε for the model respectively. Then,

based on the framework above, the dependent variable could be obtained by

y = Xβ + τ + ε.

We simulated each of the setting 50 times and compared our estimation results with the

11

classic Least Trimmed Square, Figure 2.1 shows that ALTS outperforms LTS in most of the

cases since the corruption rates were adjusted automatically by the algorithm for diﬀerent

datasets. For the ALTS results, parameter k was tunned by the modiﬁed BIC in (2.6), we

show that k decreases when the amount of outliers becomes larger. The true positive rates

always stay around 1, indicating that the tunning of k is highly eﬀective (Table 2.1).

Table 2.1: Tuned k for ALTS with the adjusted BIC. We simulated normally distributed
errors and heavy-tailed errors respectively for diﬀerent proportion of outliers and computed
true positive rate and false positive rate to evaluate the tunning result.

Percentage of outliers

Normal

Heavy-tailed

True positive rate
False positive rate

Parameter (mean of k)

True positive rate
False positive rate

Parameter (mean of k)

5%
1

0.005
3.63

1

0.05
3.03

10% 20% 30%

1

0.009
2.43

1

0.05
2.06

1

0.02
1.46

1

0.08
1.35

1

0.05
1.19

1

0.11
1.14

2.6 Discussion

Least Trimmed Square is a highly robust regression method, Visek (2006a,b) have shown

that its solution is guaranteed consistency. However, it is very computationally expensive,

especially for any large dataset. In practice, we only prefer polynomial time algorithms. And

to the best of our knowledge, most of the eﬃcient algorithms based on the idea of LTS are

hard be proved statistically consistent, so as ALTS, but we have discussed the parameter

recovery guarantee in the previous section. Additionally, although a high breakdown point

can improve the tolerance of the regressor for bad observations, it may cause loss of informa-

tion if the data is not heavily contaminated. Because a ﬁxed ratio of outliers is not helpful

to outlier detection for diﬀerent dataset, a roughly estimated ratio may improve either the

false positive rate or the false negative rate. For example, in gene-expression deconvolution,

12

Figure 2.1: Comparing the performance of ALTS and LTS. We simulated diﬀerent per-
centages of outlier ({5%, 10%, 20%, 30%}) with both normally distributed and heavy-tailed
random errors. The two approaches are compared based on the sum of squared errors for
their estimated coeﬃcients.

since the signature matrices are measured in pure environment, the normal tissue usually

contains much less outliers than tumor tissue. Some biological meaningful genes would be

removed if we set the ratio of outliers same for both cases. Thus, ALTS has been proposed

in this study which is computationally eﬃcient and can automatically detect and remove

outliers before the estimation of coeﬃcients. It can be applied in many area with diﬀerent

type of data. We will show its further application on cancer data in next chapter.

13

5% outlier10% outlier20% outlier30% outliernormal heavy−tailed0.000.050.100.150.000.050.100.15Proportion of outliersSSEMethodALTSLTSTable 2.2: Parameters of ALTS. To show that ALTS is not sensitive for diﬀerent values of α1,
and α2 in an appropriate range with tuned value of k, we simulated a dataset with sample
size n = 500, number of predictors p = 20, normal distributed error and 20% outliers using
the same setting of section 2.5 in the paper. Then we ran ALTS with following setting and
get the number of detected outliers, true and false positive rate: (1) Take α2 = 1.1, change
α1 from 0.1 to 0.5 by 0.05 and tune k using BIC*. (2) Take α1 = 0.1, change α2 from 1.1 to
2 by 0.1 and tune k using BIC*. (3) Take α1 = 0.1, α2 = 1.5, and change k from 1 to 10 by
0.1. We can see that the accuracy of the result stays stable with a well tuned k.

101 102 100 100 100 101 101 103 102
1 1 1 1 1 1 1 1 1
0.0025 0.0050 0.0000 0.0000 0.0000 0.0025 0.0025 0.0075 0.0050

Change α1 from 0.1 to 0.5 by 0.05
Number of outliers
True positive rate
False positive rate
Change α2 from 1.1 to 2 by 0.1
107 101 101 101 101 107 101 101 101 107
Number of outliers
1 1 1 1 1 1 1 1 1 1
True positive rate
0.0175 0.0025 0.0025 0.0025 0.0025 0.0175 0.0025 0.0025 0.0025 0.0175
False positive rate
Change k from 1 to 10 by 0.1, (Tuning result is 1.8 by BIC*, and we marked it as red color)

Number of outliers

True positive rate

False positive rate

249 220 199 176 156 136 115 107 101 95 95 94 92 92 91 89 88 86 82 79 77 75 72 66
65 62 62 60 54 50 44 39 35 29 27 25 22 19 16 15 14 10 9 6 6 4 4 4 4 4 2 1 5 5 4 3 2 2
16 15 10 5 5 5 4 3 2 2 16 15 10 5 5 5 4 3 2 2 16 15 10 5 5 5 4 3 2 2 16 15 10
1 1 1 1 1 1 1 1 1 0.95 0.95 0.94 0.92 0.92 0.91 0.89 0.88 0.86 0.82 0.79 0.77 0.75 0.72
0.66 0.65 0.62 0.62 0.60 0.54 0.50 0.44 0.39 0.35 0.29 0.27 0.25 0.22 0.19 0.16 0.15
0.14 0.10 0.09 0.06 0.06 0.04 0.04 0.04 0.04 0.04 0.02 0.01 0.05 0.05 0.04 0.03 0.02
0.02 0.16 0.15 0.10 0.05 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.05 0.05 0.05
0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.05 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10
0.3725 0.3000 0.2475 0.1900 0.1400 0.0900 0.0375 0.0175 0.0025 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

14

Chapter 3

Fast and Robust Deconvolution of

Tumor Inﬁltrating Lymphocyte from

Expression Proﬁles

3.1 Introduction

Immune checkpoint blockade has revolutionized the rational design of neoadjuvant cancer

therapies. Compelling evidence suggests that a favorable tumor immune microenviron-

ment underpins better clinical responses to radiotherapy, chemotherapy, and immunotherapy

(Deng et al., 2014; Nagarsheth et al., 2017; Corrales et al., 2016a). Immunohistochemistry

(IHC)-based immunoscores, which quantify the number of CD8+ cytotoxic T lymphocytes

and CD45RO+ memory T cells, show better prognostic potential than conventional patho-

logical methods in colon cancer patients (Galon et al., 2006; Mlecnik et al., 2016). Hence,

harnessing the composition of intra-tumoral immune cell inﬁltration is a highly promising

approach to stratify tumors (Balermpas et al., 2014, 2016; Nguyen et al., 2016; Pages et al.,

2009; Wolf et al., 2015; Lei et al., 2016). The current IHC immunoscoring approach has two

limitations. First, the interpretation of immune cell subsets varies among pathologists and

institutions, thus lacking a consistent standard for the scoring practice. Second, only a lim-

15

ited number of biomarkers can be assessed simultaneously, which prevents a comprehensive

annotation of the immune contexture in the tumor microenvironment (TME). Hence, robust

methods for genome data-informed cell type quantitation are in urgent need.

Immunogenomics presents an unprecedented opportunity to resolve the intra-tumoral im-

mune landscape. Cell type deconvolution using leukocyte signature gene expression proﬁling

is a highly promising approach to estimate the global immune cell composition from whole

tumor gene expression data (Newman et al., 2015; Aran et al., 2017; Finotello et al., 2017;

Tosolini et al., 2016, 2017; Vallania et al., 2018). However, a signiﬁcant technical obstacle

is that the eﬃcacy and accuracy of gene expression deconvolution are limited by the large

number of outliers, which are frequently observed in tumor gene expression datasets (Jiang

et al., 2004). The ﬁrst step towards enhancing the overall gene deconvolution algorithms is to

improve methods for outliers identiﬁcation and processing. Those outliers include genes with

abnormal expression value which may be caused by measurement error, environmental eﬀect,

expression from non-immune cells, or natural ﬂuctuations in certain type of immune cells.

Notably, the current immune deconvolution gene signature matrix relies on the proﬁling of

diﬀerentially expressed genes among diﬀerent immune subsets. Frequent contamination of

transcripts reading from cancer cells may signiﬁcantly bias the algorithms. In this study, we

report a novel FAst and Robust DEconvolution of Expression Proﬁles (FARDEEP) method

that signiﬁcantly improves the estimation of coeﬃcients.

Let yi be the observed expression value for the ith gene; xi, a p-dimensional vector, be
the expected expression of the ith gene for the p diﬀerent cell types; and X = [x1, . . . , xn](cid:48)
be the signature matrix. The gene-expression deconvolution problem can be formulated as

follows,

yi = x(cid:48)

iβ + εi,

16

(3.1)

where β ∈ Rp is an unknown parameter corresponding to the compositions of p cell types,

and εi is a noise term with a mean of 0. Several methods were proposed to solve this decon-

volution problem. To enforce the non-negativity of β in (3.1), several algorithms, such as

the Non-Negative Least Square (NNLS), Non-negative Maximum Likelihood (NNML) frame-

works and the perturbation model (PERT) were developed. They all rely on the signature

matrix (X) derived from Microarray experiments (Lawson and Hanson, 1995; Gong et al.,

2011; Gong and Szustakowski, 2013; Li et al., 2016; Mackey et al., 1996; Qiao et al., 2012;

Finotello et al., 2017). To extend this work to RNA-seq data, Finotello et al. (Finotello

et al., 2017) proposed a constraint linear model with a signature matrix derived from RNA-

seq data. Additionally, the gene expression of each cell may vary depending on its microen-

vironment and other factors, which will lead to a biased estimation. To address this issue,

Microarray Microdissection with Analysis of Diﬀerences (MMAD) incorporates the concept

of the eﬀective RNA fraction and estimates coeﬃcients using a maximum likelihood ap-

proach (Liebner et al., 2014). To further adapt deconvolution to high-dimensional settings,

Altboum et al. (Altboum et al., 2014) proposed a penalized regression framework, Digital

Cell Quantiﬁer (DCQ), to encourage sparsity for the estimated β using the elastic net (Zou

and Hastie, 2005). Cell-type identiﬁcation by estimating relative subsets of RNA transcripts

(CIBERSORT) uses ν-support vector regression (ν-SVR) to enhance the robustness of gene

expression deconvolution. CIBERSORT performs a regression by ﬁnding a hyperplane that

ﬁts as many data points as possible within a tube whose vertical length is a constant ε

(Newman et al., 2015). The ε-tube provides a region in which estimation errors are ignored.

This model does not include an intercept to capture contributions of other contents. Ad-

ditionally, to increase the computational eﬃciency, CIBERSORT applies Z-normalization

to the data before ﬁtting the regression, which may introduce estimation bias. Based on

17

the CIBERSORT framework, several extensions have been proposed to overcome limitations

such as platform inconsistency between signature and mixture matrices and low estimation

accuracy for γδ T cell (Tosolini et al., 2017; Vallania et al., 2018; Tosolini et al., 2016).

However, the quantitative information of cell proportions of these two approaches is built

on CIBERSORT whose performance may be challenged by frequent outliers in whole tumor

tissue transcriptomes. To reduce the dependence on the signature matrix, xCell utilizes the

concept of single-sample gene set enrichment analysis (ssGSEA) to calculate an immune cell

score which could predict the enrichment of immune cells (Aran et al., 2017). Despite its

robustness, xCell relies much on the ranking of gene expression value which leads to subop-

timal solution for the estimation accuracy. Overall, a robust method that determines both

the distribution and absolute volume of tumor-inﬁltrating lymphocytes (TILs) will further

improve the current gene deconvolution pipeline.

To handle the heavily contaminated gene expression data and provide absolute cell abun-

dance estimation, we developed a robust method based on the Least Trimmed Square (LTS)

framework (Rousseeuw, 1984; Rousseeuw and Leroy, 1987). LTS ﬁnds h observations with

smallest residuals, and the estimator ˆβ is the least squares ﬁt over these h observations.

LTS is an NP-hard problem, and Rousseeuw and Driessen (Rousseeuw and Driessen, 2006)

proposed a stochastic FAST-LTS algorithm. Nevertheless, it may give a suboptimal ﬁtting

result and get much slower when the sample size and dimension of variables become larger

and higher since its accuracy relies on the initial random h-subsets and the number of initial

subsets. When n is the sample size and p is the number of coeﬃcients, h is suggested to be

the smallest integer that is not less than (n + p + 1)/2 to remove as many outliers as possible

while keeping an unbias result. Using the information of only half of the data reduces the

power of the estimator because the amount of outliers in the real case cannot be presumed

18

and can be small. Xu et al. (Xu et al., 2017) proposed an adaptive least trimmed square

which is not limited to the randomness of initial subset but only applied the binary dataset.

In this study, we extend the adaptive least trimmed square to introduce a model-free method,

which could ﬁnd the number of outliers automatically based on LTS. FARDEEP provides a

ﬂexible framework which is suitable for both Microarray and RNA-seq data using LM22 and

Immunostate signature matrices respectively. As evidence of high ﬁdelity and robustness,

FARDEEP exhibits superior performance in simulated and real-world datasets.

3.2 Materials and methods

Because the abundance of cell types are always non-negative, we replaced the OLS regression

in the ALTS procedure with non-negative least square regression (NNLS). By applying the

modiﬁed ALTS to the deconvolution model, the objective function at each iteration becomes,

ˆβ = argmin

β

(cid:107)yS − XSβ(cid:107)2
2 ,

subject to β ≥ 0

(3.2)

where S is the subset of genes with smallest residuals, and (3.2) can be solved using Lawson-

Hanson algorithm (Lawson and Hanson, 1995).

The ˆβ from FARDEEP, denoted as TIL subset score, is the direct estimate of the linear

model without any normalization and hence reﬂects the absolute abundance of TILs.

In

addition, we can derive the relative TILs abundance from the TIL subset scores through

ˆβj(cid:80)p

k=1

,

ˆβj

˜βj =

(3.3)

where ˆβj is the jth TIL subset score.

In practice, the TIL subset score provides impor-

19

tant information of patient’s tumor-inﬁltrating immune landscape, and we have included a

discussion in Appendix B.

For the tuning parameters, we assume that the errors follow a log-normal distribution

instead of a normal distribution among gene expression datasets as suggest by Beal Beal

(2017). Then the modiﬁed BIC formula from (2.6) turns to

(cid:80)n
i=1 1{i(cid:54)∈ ˆE} log2(yi − ˆyi)2

m

BIC∗(k) = m log

+ b(log(m) + 1),

(3.4)

With all the features above, we developed a robust tool, FARDEEP, for cellular decon-

volution.

3.3 Evaluation metrics

To test the performance of FARDEEP, we compared our approach with the existing meth-

ods using numerical simulations and real datasets. Here, we list the outlier genes detected

by FARDEEP for real datasets in the supplemetary table. We use LM22 signature matrix

containing 22 immune cell types hematopoietic cells for Microarray data and use quanTIseq

signature matrix containing 10 immune cell types for RNA-Seq data. To compare the per-

formance of diﬀerent methods, we report the sum of squared error (SSE), the coeﬃcient of

20

determination denoted as R-squared (R2) and the Pearson correlation (R) deﬁned as follows

j=1

p(cid:88)
(β∗
j − ˆβj)2,
p(cid:88)
(β∗
j − ˆβj)2/
(cid:80)p
(cid:113)(cid:80)p
j=1(β∗
j=1(β∗
j − ¯β∗)2

j=1

p(cid:88)
(cid:113)(cid:80)p
j − ¯β∗)( ˆβj − ¯ˆβ)

j=1

j − ¯β∗)2,
(β∗

j=1( ˆβj − ¯ˆβ)2

SSE =

R2 = 1 −

R =

p(cid:88)
β∗
j ,
p(cid:88)

j=1

1
p

j=1

ˆβj,

¯β∗ =

1
p

¯ˆβ =

,

where β∗ is the ground truth, and ˆβ is the estimate.

3.4 Simulation study

3.4.1 In silico simulation with varied error types

To test the robustness of FARDEEP under diﬀerent error conditions, we simulated three

datasets with normally distributed errors, heavy tailed errors using the same way in section
2.5. Here, we set U ij ∼ U(0, 20), proportion of outliers f ∈ {5%, 10%, 20%, 30%}, sample
size n = 500, and number of predictors p = 20, we added random errors and outliers to the

simulated data as follows:

- Random errors: we generated the random error vector from i) standard normal distri-

bution, ii) t-distribution with 3 degrees of freedom.

- Vertical outliers: we generated a n dimensional zero vector τ and randomly selected

nf elements in τ to be the outliers generated from a non-central t-distribution with 1

degree of freedom and a non-centrality parameter of 30.

- Leverage points: we took 20% of the contaminated data as leverage points, that is,

21

replacing the corresponding predictors by the samples from N (2max(X), 1).

The non-negative coeﬃcients βj were sampled from U(0, 1), where j = 1, . . . , p. We repli-
cated each setting 50 times. As shown in Figures 3.1 and 3.2, FARDEEP outperforms other

methods, evidenced by the SSE, R2 and R values.

In the Appendix B, we also included another outlier construction scheme with X related

outliers and a simulation setting with correlated responses.In both scenarios, FARDEEP

dominates other methods in terms of SSE, R2 and R values.

3.4.2 In silico simulation based on leukocyte gene signature matrix

ﬁle

Following the similar procedure as in Newman et al., we randomly generated the abundance

of diﬀerent cells from interval [0, 1] (Newman et al., 2015). Notably, the sum of cell abundance
is not necessarily 1. The measurement errors were sampled from 2N (0, (0.1 log2(s))2). To

incorporate outliers, we randomly selected i/50 of the data and replaced them with data
drawn from 2N (10, (0.3 log2(s))2) where i = 1, 2, . . . , 25 and s is the standard deviation of

original mixtures.

We repeated the procedure nine times and estimated the cell type abundance using

FARDEEP, CIBERSORT (without converting to percentage), NNLS, PERT, and DCQ. As
shown in Table 3.1, we found that the SSE range for FARDEEP is 1.51×10−7 to 1.47×10−4,

R2 and R keeps being 1 regardless of the number of outliers, while Other methods show

signiﬁcantly larger SSE and smaller R2, R.

22

3.4.3 Synthetic dataset

We used the cell line dataset GSE11103 generated by (Abbas et al., 2009) that contains

gene expression proﬁles of four immune cell lines (Jurkat, IM-9, Raji, and THP-1) and

four mixtures (MixA, MixB, MixC, and MixD) with various ratios of cells. Before analysis,

we quantile normalized the mixture data for 54675 probesets and downloaded the immune

gene signature matrix with 584 probesets from CIBERSORT website. Then, we applied

ﬁve deconvolution methods, including FARDEEP, CIBERSORT (without converting to per-

centage), DCQ, NNLS, and PERT, to calculate the sum of squared errors of the estimated

abundance of the four immune cell lines. We show that FARDEEP gives the smallest SSE

and the largest R2, which indicates the most accurate result (Figure 3.3).

3.4.4 Synthetic dataset with added unknown content

Both CIBERSORT and FARDEEP are robust deconvolution methods and show advan-

tages in the above datasets, we next sought to compare their performances on mixtures

with unknown content. We followed the simulation setting proposed by Newman et al.

(Newman et al., 2015) and downloaded the signature gene ﬁle from CIBERSORT web-

site. The mixture ﬁle was constructed from the four immune cell lines data, as mentioned

in the previous section, and a colon cancer cell line HCT116 (average of GSM269529 and

GSM269530 in GSE10650). Cancer cells were mingled into immune cells at diﬀerent ratios
{0%, 30%, 60%, 90%}. Noise was added by sampling from the distribution 2N (0,(f log2(s))2),
in which f ∈ {0%, 30%, 60%, 90%} and s is the standard deviation of original mixtures. By

applying FARDEEP and CIBERSORT (without converting to percentage) on 64 mixtures,

we found that FARDEEP remains an accurate estimation, while the tumor contents skew

23

the results of CIBERSORT with larger deviation from the ground truth (Figure 3.4).

3.5 Real-data application

3.5.1 Deconvolution performance on immune-cell-rich datasets

To evaluate the performance of FARDEEP in immune-cell-rich settings that are less af-

fected by outliers, we downloaded and analyzed two gene expression datasets (GSE65135

and GSE20300) (Newman et al., 2015; Shen-Orr et al., 2010) generated from the Aﬀymetrix

Microarray, which is the same platform used to generate the signature matrix LM22. The

GSE65135 dataset consists of (i) surgical lymph node biopsies of 14 follicular lymphoma pa-

tients and (ii) puriﬁed B and T cells from the tonsils of 5 healthy controls, and the GSE20300

dataset includes 24 blood samples from pediatric renal transplant patients. Flow cytometry

or coulter counter data in these studies, which are presented in relative scales, are treated

as ground truth. Thus, we normalized the estimated parameters of each method to the sum

of 1 before comparison.

As shown in Figures 3.5A - 3.5B for case (i) of GSE65135 and Figures 3.5D - 3.5E

for GSE20300, FARDEEP outperformed CIBERSORT in terms of R2, R and SSE, which is

consistent with our ﬁndings with simulated datasets. For case (ii) of GSE65136, we estimated

the immune cell composition for puriﬁed B and T cells with purity level exceeding 95% and

98%, respectively. For puriﬁed B cells, CIBERSORT tends to return non-zero estimates

for T cell and a large proportion of other cell types, while FARDEEP gave almost all zero

estimates for T cell and on average reduced the estimation error by 61%. Similarly, for the

puriﬁed T cell, although CIBERSORT had a better performance compared to puriﬁed B cell,

FARDEEP still signiﬁcantly improves the estimation accuracy by reducing on average 48%

24

of the estimation error (Figure 3.5C). Furthermore, as shown in the supplementary table,

FARDEEP detected gene CD79A and BCL2A1 as outliers for most samples in case (i) of

GSE65135. These two genes are known to have high expression levels in follicular lymphoma

(B-cell lymphoma) cells (Eray et al., 2003).

Overall, even in specimens that are rich in immune cells without contamination by non-

hematopoietic malignancy, FARDEEP still outperforms CIBERSORT in immune cell decon-

volution.

3.5.2 Deconvolution performance on RNA-seq datasets

In addition to eﬀectively handling Microarray data, FARDEEP can also deconvolve TILs us-

ing RNA-seq data when we replace the signature matrix LM22 with quanTIseq, a signature

matrix generated from RNA-seq data containing ten diﬀerent immune cell types (Finotello

et al., 2017). We applied CIBERSORT and FARDEEP using signature matrix quanTIseq to

peripheral blood mononuclear cell (PBMC) mixtures (GSE64655) generated by Hoek et al.

(Hoek et al., 2015), and lymph node bulk samples of 4 melanoma patients from GSE93722

(Racle et al., 2017). Flow cytometry data in these studies are on a relative scale and are

treated as ground truth. We normalized the estimated parameters of each method to a rela-

tive scale using (3.3) before comparison. The RNA-seq data are usually less noisy compared

to Microarray, and PBMC datasets are usually clean with less unknown contents. Therefore,

we expect FARDEEP and CIBERSORT will return comparable results, which is the case

in Figure 3.6A and 3.6B. However, when dealing with noisier data containing more outliers

such as lymph node bulk samples, FARDEEP obtained larger advantage over CIBERSORT

as shown in Figures 3.6C and 3.6D.

25

3.6 Survival analysis

3.6.1 Lung Squamous Cell carcinoma datasets and Ovarian Serous

Cystadenocarcinoma

TME of solid carcinomas are diﬀerent from a lymph node biopsy or peripheral blood, and

the highly variable gene expression in cancer cells challenges the accuracy of immune cell

deconvolution. It is well-established that immune inﬁltration proﬁle serves as a promising

prognosticator (Galon et al., 2006; Mlecnik et al., 2016). Hence, we next utilized survival

and gene expression data of ovarian cancer (OV) and lung squamous cell carcinoma (LUSC)

from The Cancer Genome Atlas (TCGA) database to assess the prognostic relevance of

diﬀerent deconvolution methods. These two datasets were chosen because only LM22 not

the RNA-seq based signature matrix quanTIseq includes γδ T cells, and OV and LUSC from

TCGA datasets are the only two cancer types with Aﬀymetrix microarray data. Using gene

expression data (n = 514 for OV and n = 133 for LUSC), we estimated the immunoscore

using ESTIMATE proposed by yoshihara et al. (Yoshihara et al., 2013), TILs proportion

using CIBERSORT, as well as TILs subset scores using CIBERSORT (without converting

to percentage) and FARDEEP. Cold tumors typically harbor lower numbers of CD8+ T

cells, γδ T cells, M1-like macrophages, and NK cells (Lei et al., 2016; Qiao et al., 2017;

Binnewies et al., 2018; Corrales et al., 2016b). Thus, we calculated an anti-tumor immune

subsets score by the summation of CD8+ T cells, γδ T cells, M1-macrophages, and NK

cells. Then, we partitioned the patients into two groups with equal size using the median of

either the immunoscore (ESTIMATE) or anti-tumor immune subsets score (CIBERSORT

and FARDEEP). We compared the survival curves between the two groups. As shown in

26

Figure 3.7, FARDEEP most eﬀectively separates patients into high- and low- risk groups with

the smallest p-value (p-value = 0.0065 and 0.059 for OV and LUSC respectively). Recently,

CIBERSORT website supports a beta-version of an absolute mode for cell deconvolution.

We also included CIBERSORT absolute mode in this survival analysis and showed that

it returned a better result (p-value = 0.037) compared to the relative mode in the OV

dataset. FARDEEP shows a stronger performance with a smaller p-value under this setting

(Figure 3.8). These results demonstrated that the TIL subset scores could provide additional

clinical-relevant information compared to the relative abundance.

In addition, we expected the summation of these TIL subset scores would negatively

correlate with tumor purity. To prove this hypothesis, we calculated the summation of 22

TIL subset scores for both OV and LUSC datasets and correlated them with the tumor

purity estimated from consensus measurement of purity estimations (CPE) (Aran et al.,

2015). Even without taking account of stromal cells, as shown in Figure 3.9 the summation

of TIL subset scores is negatively correlated with tumor purity.

Next, we sought to investigate whether outlier removal reduces contamination by tran-

scripts from cancer cells. We ﬁrst identiﬁed those top outlier-genes, which were consistently

removed by FARDEEP in the OV dataset and obtained the average expression values of

those outlier-genes from OV cell lines in GSE32474 (Wang et al., 2006). As shown in Ta-

ble 3.2, most of these outlier-genes have high expression in cancer cell lines. For example,

CXCL10 gene encodes an important chemokine to recruit CD8+ T cells and is also highly

expressed in ovarian cancer cells. Thus, although some genes in LM22 may play a role in

immune cells, they may be also highly expressed and variable among cancer cells. Such

cross-contamination likely skews immune deconvolution analysis. As shown in Table 3.2,

FARDEEP successfully detected and removed those genes, leading to a more robust and

27

accurate deconvolution analysis.

3.7 Discussion

The cancer immune microenvironment has emerged as a critical prognostic dimension that

modulates patient responses to neoadjuvant therapy. However, the current clinical TNM

staging system does not have a consistent method to stratify cancers based on their im-

munogenicity. The recent study shows that the RNA-seq datasets of whole tumors contain

valuable prognostic information to assess the cancer-immunity interactions (Newman et al.,

2015; Robinson et al., 2017). But the current methods to extract immune signatures are

susceptible to the frequent outliers in the datasets, leading to less eﬀective identiﬁcation of

cold tumors. Based on support vector regression, CIBERSORT is one of the most popular

robust deconvolution methods. However, this model does not include an intercept to capture

possible contribution from other cell types and performs a z-normalization to the data be-

fore ﬁtting the regression model, which introduces biases into the output. Discussion of the

eﬀect of Z-score normalization for CIBERSORT is included in Appendix B. In this study, we

developed a new machine learning tool, FARDEEP, to streamline the removal of outliers and

increase the robustness of gene-expression proﬁle deconvolution. Robustness is an indispens-

able feature to solve a problem of deconvolution because gene expression data are frequently

contaminated by a large amount of outliers. FARDEEP solves the deconvolution problem in

a robust way because this tool evaluates all outliers across the datasets and then examines

the true immune gene signature using non-negative regression. This feature is especially use-

ful to analyze tumors with signiﬁcant non-hematopoietic tumor components. Interestingly,

although FARDEEP and the current robust methods can both tackle immune-cell-rich spec-

28

imens such as lymph node and PBMCs, FARDEEP exhibits improved prognostic potential

when dealing more complex datasets with signiﬁcant carcinoma cell content.

Although FARDEEP provides a robust computational algorithm to better solve the gene-

expression deconvolution problem with noisy datasets, its performance and application rely

on the choice of the signature matrix. To avoid estimation bias, it is important to choose

the signature matrix derived from the same platform as the mixture matrix. For example, if

dealing with gene expression data measured by Aﬀymetrix HGU133A, we should use LM22,

but if dealing with RNA-seq data, the signature matrix quanTIseq is preferred. Overall, here

we show that FARDEEP is a powerful and rapid machine learning tool that outperforms

existing robust methods for gene deconvolution in datasets with signiﬁcant heavy-tailed

noise. FARDEEP provides a new technology to interrogate cancer immunogenomics and

more accurately map the immune landscape of solid tumors.

29

Figure 3.1: SSE of coeﬃcients for diﬀerent approaches. We simulated diﬀerent percentage of
outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coeﬃcients applying NNLS, DCQ,
PERT, CIBERSORT, and FARDEEP. (A) random error with standard normal distribution,
(B) random error with t-distribution.

30

Figure 3.2: Compare the estimation accuracy of diﬀerent deconvolution approaches (the
values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers,
we computed R2 to evaluate how well the estimators ﬁt for a straight line ˆβ = β.
(A)
random error with standard normal distribution, (B) random error with t-distribution.

31

Figure 3.3: Applying diﬀerent deconvolution approaches on the gene expression data of IM-9,
Jurkat, Raji, THP-1 and the mixture of these four immune cell lines with known proportion
(MixA, MixB, MixC, MixD). All of the mixtures were performed and measured in tripli-
cate. (A) SSE of coeﬃcients for FARDEEP, CIBERSORT, NNLS, PERT, DCQ. (B) The
abundance of cell lines estimated from diﬀerent deconvolution approaches vs. Abundance of
cell lines truly mixed. The R2 and R values are also reported at the top of the ﬁgures. (C)
Deconvolution of individual cell subsets by FARDEEP and CIBERSORT. The correlation
coeﬃcients R, the corresponding p-values against the null hypothesis of R = 0, trend lines
with 95% conﬁdence intervals are shown in the ﬁgures. The black dashed line represents
the perfect relationship between the estimate and the true cell abundances with slope 1 and
intercept 0.

32

Figure 3.4: Performance comparison between CIBERSORT and FARDEEP on mixtures
with unknown content.(A) SSE for various noise and abundance of tumor contents on 4
diﬀerent mixtures. (B) Abundance of cell lines estimated by CIBERSORT and FARDEEP
vs. Abundance of cell lines truly mixed. The R2 and R values are also reported in the top of
the ﬁgures. (C) Deconvolution of individual cell subsets by FARDEEP and CIBERSORT.
The correlation coeﬃcients R, the corresponding p-values against the null hypothesis of
R = 0, trend lines with 95% conﬁdence intervals are shown in the ﬁgures. The black dashed
line represents the perfect relationship between the estimate and the true cell abundances
with slope 1 and intercept 0.

33

Figure 3.5: Performance assessment on Microarray data with SSE, R2 and R. The follicular
lymphoma dataset is evaluated for (A) Overall performance and (B) individual cell subsets.
(C) Normal tonsil dataset with puriﬁed B cells and T cells. The blood samples from pediatric
renal transplant patients are evaluated for (D) Overall performance and (E) individual cell
subsets. The black dashed line represents the perfect relationship between the estimate and
the true cell abundances with slope 1 and intercept 0.

34

Figure 3.6: Gene-expression deconvolution performance of FARDEEP and CIBERSORT on
RNA-seq data. (A) Overall and (B) individual cell subsets results for 8 PBMC samples col-
lected from two vaccinated donors at diﬀerent time points. (C) Overall and (D) individual
cell subsets result for lymph node bulk samples. Correlation coeﬃent R and the correspoind-
ing p-value are missing for CIBERSORT on NK cell because the CIBERSORT estimations
are all zero. The black dashed line represents the perfect relationship between the estimate
and the true cell abundances with slope 1 and intercept 0.

35

Figure 3.7: Kaplan-Meier survival curves are plotted based on ESTIMATE, FARDEEP- and
CIBERSORT- assisted TIL proﬁling. Log-Rank test was applied to data classiﬁed into two
groups according to whether immunoscore (ESTIMATE) or collective anti-tumor immune
subsets (CIBERSORT and FARDEEP) was above or below the median. (A) 514 patients
with ovarian cancer. (B) 133 patients with lung squamous cell carcinoma.

36

Figure 3.8: Kaplan-Meier survival curves are plotted based on CIBERSORT (absolute
mode)- assisted TIL proﬁling. Patients were classiﬁed into two groups according to whether
immunoscore (ESTIMATE) or collective anti-tumor immune subsets was above or below the
median. Log-Rank test was applied to obtain the p-value. (A) 514 patients with ovarian
cancer. (B) 133 patients with lung squamous cell carcinoma.

Figure 3.9: FARDEEP score of TCGA OV and LUSC datasets are calculated from the
summation of 22 TIL subset scores, which show highly negative correlations to the consensus
measurement of purity estimations (CPE).

37

p = 0.037Method: CIBERSORT.abs0100020003000400050000.000.250.500.751.00TimeSurvival probabilityhigh.immune.scorelow.immune.scoreAp = 0.78Method: CIBERSORT.abs010002000300040000.000.250.500.751.00TimeSurvival probabilityhigh.immune.scorelow.immune.scoreBOV (R =−0.62)LUSC (R =−0.71)0.50.60.70.80.91.00.40.60.81.00.000.030.060.090.000.010.020.030.04CPEFARDEEP SCORETable 3.1: Performance of FARDEEP, CIBERSORT, NNLS, PERT and DCQ methods under
diﬀerent simulation settings with outliers.

Outliers

2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
22%
24%
26%
28%
30%
32%
34%
36%
38%
40%
42%
44%
46%
48%
50%

In all

FARDEEP

1.68e-7 (1.21e-7)
2.20e-7 (1.09e-7)
2.35e-7 (2.17e-7)
2.31e-7 (1.22e-7)
2.18e-7 (2.19e-7)
1.51e-7 (4.86e-8)
2.38e-6 (1.01e-7)
3.70e-7 (3.00e-7)
2.83e-7 (9.38e-8)
4.93e-7 (4.08e-7)
5.99e-7 (3.51e-7)
4.00e-6 (7.16e-6)
1.02e-6 (1.23e-6)
8.14e-7 (9.14e-7)
1.54e-6 (2.04e-6)
5.12e-6 (7.68e-6)
3.03e-6 (3.56e-6)
3.36e-5 (6.32e-5)
4.53e-6 (5.91e-6)
1.01e-5 (1.19e-5)
1.06e-4 (2.96e-4)
3.04e-5 (4.10e-5)
6.44e-6 (7.10e-6)
1.47e-4 (1.78e-4)
5.28e-5 (5.26e-5)

FARDEEP
1.00 ∼ 1.00

FARDEEP
1.00 ∼ 1.00

2.72 (4.37)

6.25 (5.17)

2.70 (7.01)
0.35 (0.67)

1.80e1 (5.35e1)

6.03 (1.16)
2.09 (2.35)

1.98e1 (2.43e1)
7.36e1 (2.00e2)

Mean of SSE (Standard deviation of SSE)
PERT

NNLS

3.33e1 (6.29e1)
9.62 (2.25e1)
8.76 (1.06e1)
1.64e2 (3.56e2)

CIBERSORT
6.37 (1.50)
5.07 (1.84)
6.32 (1.47)
4.76 (1.42)
6.45 (1.63)
5.32 (1.84)
6.42 (1.47)
5.58 (1.52)
6.34 (1.45)
5.08 (1.25)
6.42 (1.54)
5.95 (2.11)
6.42 (1.48)
5.87 (1.08)
6.55 (1.26)
6.42 (1.44)
6.59 (1.61)
6.51 (1.33)
6.39 (1.41)
5.46 (1.44)
6.43 (1.44)
6.37 (1.92)
6.46 (1.46)
6.00 (1.81)
6.46 (1.41)
6.36 (1.11)
6.60 (1.46)
6.61 (1.74)
6.48 (1.34)
6.59 (1.13)
6.62 (1.51)
6.82 (1.51)
6.43 (1.38)
6.61 (1.66)
6.47 (1.46)
6.60 (1.67)
6.41 (1.44)
6.66 (1.54)
6.53 (1.55)
6.69 (1.87)
6.56 (1.44)
6.90 (1.53)
6.55 (1.38)
6.90 (1.43)
6.41 (1.42)
6.60 (1.28)
6.74 (1.61)
7.10 (1.68)
6.86 (1.58)
6.57 (1.35)
Range of R2 (lower bound ∼ upper bound)
CIBERSORT
PERT
-3.87 ∼ -0.47
Range of R (lower bound ∼ upper bound)
CIBERSORT
PERT
0.58 ∼ 1.00

2.37e1 (3.55e1)
1.94e1 (2.26e1)
6.66e1 (1.08e2)
2.34e1 (3.88e1)
4.36e1 (4.24e1)
1.29e1 (1.60e1)
7.34e1 (1.58e2)
1.18e1 (1.00e1)
3.96e1 (5.08e1)
1.98e1 (2.99e1)
8.03e1 (1.55e2)
2.97e1 (3.01e1)

NNLS

-7.10e2 ∼ 1.00

NNLS

-0.38 ∼ 1.00

-3.86 ∼ -1.13

-0.38 ∼ 0.76

DCQ

3.58 (1.64)
3.17 (0.54)
9.00 (1.83e1)
3.85 (2.17)
2.98 (0.58)
5.45 (5.24)
3.71 (2.77)
4.33 (4.35)

2.02e1 (3.68e1)

2.98 (0.78)
5.13 (4.18)
8.95 (1.66e1)
3.33 (1.92)
6.51 (6.30)
6.58 (6.51)

2.17e1 (3.81e1)

6.42 (8.53)
9.90 (9.98)
3.65 (2.18)

2.37e1 (5.54e1)

3.37 (1.58)

1.11e1 (1.30e1)

7.03 (9.26)

2.14e1 (3.81e1)

6.88 (7.83)

DCQ

-9.88e1 ∼ 0.45

DCQ

-0.36 ∼ 0.86

38

Table 3.2: The outlier genes detected by FARDEEP and their average expression values in
7 ovarian cancer cell lines for TCGA OV dataset. Here we listed all genes with removal
frequency larger than 50% among 514 samples.

Gene
CHI3L1

APOBEC3G

CXCL10
IFI44L
SLC12A8

RGS1
C1orf54
FZD3
PNOC
STXBP6
TRIB2
GSTT1
MS4A6A
SPOCK2

TCF7

PPFIBP1
RSAD2

PLEKHF1
FCGR2B
HLA-DQA1

LAMP3
NCF2

Frequency NCI-ADR-RES OVCAR-3 OVCAR-4 OVCAR-5 OVCAR-8 SK-OV-3 IGROV1

96%
84%
84%
79%
79%
75%
71%
69%
68%
66%
66%
63%
61%
59%
57%
56%
56%
54%
53%
52%
52%
51%

9.76
654
25.05
37.15
291.4
3.63
80.14
65.76
31.96
537.71
140.69
749.88
29.42
117.32
63.48
391.73
17.54
120.38
7.64
15.65
250.12
183.96

37.4
24.5
66.55
182.44
426.97
3.47
33.86
394.81
4.63

2231.15
568.85
1787.77
31.78
326.4
79.55
1279.24
46.41
2345.08
30.34
27.75
103.46
9.12

1149.51
32.12
15.25
15.44
342.54
2.51
117
216.9
28.14
464.33
108.99
19.09
16.15
94.08
92.49
819.14
33.83
522.76
30.8
32.08
1963.43
87.42

4.32
242.22
26.47
58.11
81.4
3.28
19.46
74.34
25.08
217.93
1910.09
994.12
21.76
213.96
238.59
828.22
86.84
167.33
8.08
6.74
165.71
317.48

4.28
612.56
46.24
61.26
432.2
10.2
98.08
108.69
16.63
240.62
113.24
604.82
40.23
76.59
93.24
277.91
36.56
284.82
14.7
29.95
1186.43
423.67

7.78
98.77
47.85
66.49
77.93
7.74
46.11
477.65
40.56
4.99
439.38
518.46
23.15
88.95
119.39
415.94
30.13
687.72
15.7
19.1
179.73
21.72

28.91
90.38
45.19
4.63
89.91
6.82
98.27
878.79
314.87
6.41
216.98
1176.07
31.04
145.94
80.23
615.33
18.02
322.99
32.79
36.81
366.39
237.07

39

Chapter 4

Deep learning framework on DNA

quantiﬁcation

In stressful environments, the ability of plants to modulate gene expression levels enables

them to respond. Plants have the most prominent regulation of such changes in the gene

expression. One of which is the binding of transcription factors to speciﬁc DNA sequences,

such as the regulatory elements situated near (i.e. cis) or far (i.e. trans) from the gene.

To further develop our skills in modeling the adaptive capabilities of various species in

diﬀerent environments across time, we pursued a competitive molecular biology challenge

on predicting gene expression by taking into account the diverse environmental conditions.

Our aptitude to predict gene expression responses has been developed by our attempts to

identify both transcription and regulatory factors (Uygun et al., 2017; Wilkins et al., 2016).

However, we recommend further research on the regulation of these responses.

One of the ﬁrst deep learning models was a multi-layer fully connected neural network

that was programmed to predict and use gene expression proﬁles. To measure these proﬁles,

the founders measured it on a set of 1000 landmark genes, which captured approximately

80% of the total expression variation in humans (Subramanian et al., 2017; Chen et al.,

2016). To identify the gene expression in the remaining non-landmark genes, they used this

variation in the features. To predict non-landmark gene expression patterns, Ghasedi Dizaji

40

et al. (2018); Wang et al. (2018) developed the generative adversarial networks through the

discriminator with landmark gene expression patterns. The information generated through

omic information which includes histone modiﬁcation patterns, nucleosome occupancy, chro-

mosome conformation, and transcription factor binding sites leads to the prediction of its

gene expression patterns through deep learning (Singh et al., 2016; Read et al., 2019).

However, these approaches have two signiﬁcant drawbacks. First, the generation of input

data is expensive. It usually has a higher cost than the direct measurement of the expression

levels. Consequently, this limits the relevance of these methodologies in modeling organisms

in spite of the publicly available data. Second, bias emerges from the choice in or managing

of predictive factors. More speciﬁcally, the poor performance of models on genes, which were

trained on landmark gene expression patterns, is evident with unusual expression patterns.

Fortunately, training models with deep learning based structure allow us to generate the

prediction of gene expression directly from raw DNA sequence information, which address

the two limitations mentioned above. For instance, an input that Washburn et al. (2019)

used for a deep learning model was the 1.5 kilobase (kb) of DNA sequence that surrounded

the start and stop sites of gene transcription. This was utilized to predict the presence or

absence of gene expression and which of the two orthologous genes would demonstrate a

higher level of expression. A convolutional neural network (CNN) is a type of deep learning

algorithm that allowed Washburn et al. (2019) to accomplish their investigation. Moreover,

CNN learns kernels or pattern ﬁnders to detect local patterns and optimize the performance

of respective models. Functionally, these kernels are motif ﬁnders upon its application on the

DNA sequence information. To obtain a better understanding of the model’s recognition and

learning of motifs, the probing of these trained kernels is necessary for identifying essential

known and novel regulatory elements.

41

Studies for the past decade clariﬁed how the indicator of a regulatory motif’s presence

or absence does not suﬃce in evaluating the whole gene regulation in store.

Instead, a

concept called regulatory grammar is comprised of the number, location, and orientation of

regulatory elements, along with the co-localization of these factors (Weingarten-Gabbay and

Segal, 2014). In modeling complex regulatory grammar, a proposal has been made for the

use of multiple deep learning strategies, which was created for natural language processing.

First, the model proposed by Quang and Xie (2016) combines a convolutional layer with a

bi-directional long short-term memory network (BiLSTM). The BiLSTM enables the model

to distinguish dependencies that are long-range while maintaining a spatial memory. As a

result, DanQ studied complex sequence grammar and predicted sequence features, such as

chromatin-proﬁles, transcription factor binding, DNaseI sensitivity, and histone marks.

However, for RNN based methods, the long-term information need to go through all cells

sequentially, and it can be easily corrupted if it keep being multiplied by some small values,

and we call this problem gradient vanish. Although LSTM has been proposed to address

this issue, it would start to make the gradient vanish when dealing with long range sequence,

such as DNA sequencing data. To address these limitations, Transformer was developed

by Vaswani et al. (2017), it totally abandon the previous RNN structure, and using the

self-attention to identify complex long range interactions. Since the self-attention is not

processing sequentially, it can perfectly ﬁt long sequences.

However, the sequential process that the long-term information undergoes through all the

cells is essential for methods that are RNN-based. It can be easily corrupted if it keep being

multiplied by some small values, and we call this problem gradient vanish. When addressing

long-range sequences, such as DNA sequence data, LSTM would continue making the gra-

dient vanish in spite of its ability to address this issue. Then, Transformer was developed

42

by Vaswani et al. (2017) to eﬀectively address these limitations. This approach disregards

the previous RNN structure fully and identiﬁes the complex long-range interactions through

the self-attention feature which can perfectly ﬁt long sequences because its process is not

sequential.

In this case, we made a recommendation regarding deep learning-based convolutional

and self-attention neural network (DeepCAT) framework. Additionally, we demonstrated

the utility of DeepCAT in predicting gene expression responses to 57 environmental stress

conditions in the raw sequence data for Arabidopsis thaliana. Finally, we have proved that

DeepCAT can be used to eﬃciently identify novel TFBMs and their higher order interac-

tions.

4.1 Materials and Methods

4.1.1 Features and data

The gene expression data of Arabidopsis thaliana were downloaded from the AtGenExpress

database and processed as in Uygun et al. (2017). And the 57 responses indicates if an

Arabidopsis gene was up-regulated or not in shoot tissue under each of 36 abiotic (e.g.

cold, heat, osmotic) and 21 biotic (e.g. Pseudomonas syringae, bacterial ﬂagellin) stress

conditions. To be more speciﬁc, the expression data were processed by Limma (Ritchie

et al., 2015) in the R environment, and the genes with log2 fold changes greater than 1 were

considered up-regulated and labeled as 1.

DNA sequence information was pulled from TAIR10 and included 1-kilobase (kb) up-

stream and 500-base pairs (bp) downstream the transcription start site and 500-bp upstream

and 1-kb downstream the transcription stop site. This sequence information was one-hot en-

43

coded into a 3002× 4 binary matrix, with columns corresponding to A, C, G and T and rows

corresponding to each position in the sequence. Gene were randomly assigned to the training

(70%), validation (10%), or testing (20%) set. The validation set was used to determine the

optimal number of training iterations. Model performance on the testing set at that iteration

was then determined using two metrics: the Receiver Operating Characteristic-Area Under

the Curve (ROC-AUC) and the Precision Recall-AUC (PR-AUC).

4.1.2 The DeepCAT model design

DeepCAT is made up of convolution layer, pooling layer, Transformer encoder layer (Vaswani

et al., 2017) and fully connected layer as shown in Figure 4.1. The convolution and pooling

layers help extracting features from sequence, and the Transformer encoder layers learn the

interaction between diﬀerent features of the sequence. Finally, the fully connected layer

aggregates all information and computes the probability for each of the 57 stress factors

through sigmoid activation function.

Formally, suppose the kernel size of convolution and pooling layer are kc and kp respec-

tively, then the ﬁrst two layers of the model can be formulated as

kc(cid:88)

4(cid:88)

Y

(0)
mk = ReLU(

W

(k)
ij Xm+i−1,j + bk),

i=1

j=1

(1)

nk = max({Y

Y

(0)
(n−1)kp+1,k, Y

(0)
(n−1)kp+2,k, . . . , Y

(0)

(n−1)kp+kp,k}),

where X is the input 3002× 4 sequence matrix, Y (0) and Y (1) are the outputs of convolution

and pooling layer, W and b are the kernel weight matrix and bias, k is the kernel index, and

the activation function ReLU(x) = max(0, x).

44

Then, the output can be considered as a new sequence with 3001−kc

kp

tokens, and the

length of each token is k. Inspired by BERT (Devlin et al., 2018), we add a special token

([CLS]) at the start of the sequence before feed it into the Transformer encoder layer, and

the ﬁnal hidden state of this token can be used as the aggregate sequence representation.

This [CLS] has k + 1 dimensions with all zero elements except the (k + 1)th one, and we

also pad the (k + 1)th elements for all other tokens with values of zero. Besides, to inject

position information of the vectors, the positional encoding was added with sine and cosine

functions as described in Vaswani et al. (2017).

P E(pos,2i) = sin(pos/100002i/(k+1)),

P E(pos,2i+1) = cos(pos/100002i/(k+1)),

Y (2) =

[CLS]

 + P E

Y (1)

where pos is the position and i is the dimension. After that, the outputs were fed into two

deep Transformer encoders which contains multi-head self-attention layers and feed-forward

neural network layers. The multi-head self-attention layers can be computed as

Y (3) = [H1, ..., Hh]ZO,

where h is the number of heads and we set it as four in this study, ZO is a (k + 1) × (k + 1)

dimensional projection matrix, and

Hi = softmax(

Y (2)ZQ

i )T

i (Y (2)ZK
√
d

)Y (2)ZV
i ,

45

ZQ, ZK and ZV are three weight matrices for query, key and value in Transformer with
dimension of (k + 1) × d, and d is set to be (k + 1)/h.

In addition to attention sub-layers, a fully connected feed-forward network is also included

in each of the Transformer encoder layers. It consists of two linear projections with weight

matrices W1, W2 and biases b1, b2 as

Y (4) = ReLU(Y (3)W1 + b1)W2 + b2.

At last, we consider the ﬁnal hidden state of the special token ([CLS]) as a summary

of information from previous layers, and feed it into a fully connected layer with activa-

tion function of sigmoid which scale the results into range of zero and one to achieve the

classiﬁcation task,

sigmod(x) =

1

1 + e−x .

4.1.3 Training of the model

To train the model, we minimized the average of muti-task binary cross entropy loss functions

by minibatch with size of 50 using Adam optimizer, and all weights are initialized with Xavier

uniform initializer. Dropout with rate of 0.1 in attention layers are also included to regularize

the models. The validation ROC score was used for monitoring the model convergence. Our

model took around 300 epochs to fully train, and the training time for each epoch is 30

seconds.

Our implementation utilizes pytorch backend and NVIDIA K80 GPU was used for train-

ing the model.

46

Figure 4.1: Model architecture. The convolution layer acts as a scanner of motifs and
produces an output with dimension of (sequence length - kernel size + 1) × (number of
kernel). The max-pooling layer reduces the output dimension by extracting the maximum
values through each pooling kernel. Then the special token and position encoding matrix are
added to the output. The two Transformer encoder layers consider the interactions between
each token of the output sequence from previous layers and re-interpret it. Finally, a sigmoid
nonlinear transformation is performed to get the prediction scores.

4.1.4 Transfer learning

Since the size of our dataset is not quite suﬃcient, we consider to transfer the stored knowl-

edge of motifs into DeepCAT, so that our model can be trained from a better starting point.

This part has been introduced into the convolution layer since we have mentioned before

that the convolution layer plays a role in scanning motifs, then a good initialization of kernel

weight can help our model getting closer to the optimized solution.

We have tried two source of data, one is the pre-trained kernel weight of DeepCAT from

human dataset (Zhou and Troyanskaya, 2015) which contains 4 millions of training data.

Another is the position weight matrices of arabidopsis thaliana from DAP-seq (O’Malley

47

et al., 2016) and CIS-BP (Weirauch et al., 2014) database.

4.1.5 Clustered multi-task learning

In contrast to the traditional machine learning methods, the loss function of DeepCAT in

section 4.1.2 is based on all 57 stress responses. However, some of the stresses with closer

relationship are expected to mutually beneﬁt from each other for prediction, but others may

limit their prediction accuracy. Thus, we cluster the stress responses into three groups by

k-means clustering, and train the model separately. From the results showing in the next

section, some of the responses with poor performance got signiﬁcantly improved.

4.1.6

Interpreting the model

To visualize the pattern learned by DeepCAT, we convert k weight matrices into position

weight matrices (PWM) which can be also considered as k motifs refer to the approach in

DeepBind (Alipanahi et al., 2015) as follows,

• After training the model, we will get k kernel weight matrices from the convolution

layer for each sequence, and its corresponding output Y

(0)
k .

• Find position t where the motif show maximum response for Y

(0)
k , such that t =
argmaxmYmk for each kernel, and extract a subsequence with the range of t ∼ t+kc−1
from the raw sequence.

• Consider only the sequences with non-zero values in some positions of Y

(0)
k , we stack
the subsequences together and count the nucleotide frequencies. Then PWMs can be

calculated by normalizing the nucleotide counts to one.

48

Once we get the position weight matrices, we aligned them to the known motif database

using the TOMTOM algorithm (Gupta et al., 2007), a threshold of 0.05 has been used for

p-value to measure the similarities. We have validated this kernel conversion method in

Appendix C.

In addition, to further measure the importance of each kernel to diﬀerent stresses, we

individually replace each kernel with all zero values and track the change of prediction AUC

scores. The kernel with sharpest reduction on AUC score is considered to be the most

important one for predicting the corresponding stress responses.

Finally, we learned the interactions between motifs by extracting weight matrices from

the ﬁrst self-attention layer, that is softmax(

Y (2)Z

4.2 Results

Q
i (Y (2)ZK

i )T

√

d

) for each head i.

4.2.1 Random initialization for convolution kernel weight

In this section, we compared the performance of our model with DanQ, support vector

regression (SVR) and Random forest. We ﬁrst trained our model with parameters k = 319,

kc = 26, kp = 13 and h = 4. Our model took about 3000 epochs to get fully trained with
learning rate 10−6, and the training time for each epoch is around 10 seconds. For DanQ, we

have kept its original setting for parameters. In summary, DeepCAT shows the best average

ROC and PR scores in Figure 4.2.

49

Figure 4.2: Performance comparison among DeepCAT, DanQ, SVM and Random forest
through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The stress
conditions are ordered by the values of DeepCAT on x-axis, and the average of ROC and
PR scores for each approach are included in parentheses.

4.2.2 Transfer learning with human data

Comparing to the training set in Zhou and Troyanskaya (2015) for human data which is

about 4 millions sequences, our data size is not suﬃcient. Thus, we pre-trained DeepCAT

on human data with same parameters and transfer its ﬁnal convolution kernel weight into our

model as a initialization. Figure 4.3 shows that the performance of DeepCAT got improved

on both ROC-AUC and PR-AUC scores, indicating that there exists some overlapping motif

information for DNA sequencing data among diﬀerent species.

50

0.60.70.802040StressROC AUCSVM (mean = 0.674)Random forest (mean = 0.679)DanQ (mean = 0.691)DeepCAT (mean = 0.714)A0.00.10.202040StressPR AUCSVM (mean = 0.123)Random forest (mean = 0.120)DanQ (mean = 0.116)DeepCAT (mean = 0.138)BFigure 4.3: The eﬀect of applying pre-trained weight from human data into DeepCAT. (A)
ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The straight line y = x
represents the equality between the results of randomly initialized DeepCAT and DeepCAT
with transfer learning based on human data.

4.2.3 Transfer learning with known motif database

We have shown in the last section that a pre-trained kernel weight from large dataset would

eﬃciently improve the model performance, even if among diﬀerent species. Thus, we expected

to see further improvement by involving some pre-knowledge of the Arabidopsis thaliana into

the model. Firstly, we summarized 865 known TFBMs of Arabidopsis thaliana from DAP-

seq (O’Malley et al., 2016) and CIS-BP (Weirauch et al., 2014) database, then we used

hierachical clustering to pick out the most diﬀerent 100 TFBMs with length from 6 to 23,

and placed them into the middle of 100 kernels with randomly generated edges. In addition,

51

0.60.70.80.60.70.8DeepCAT pre−trained human weight (mean = 0.730)DeepCAT (mean = 0.714)ROC AUCA0.00.10.20.30.00.10.20.3DeepCAT pre−trained human weight (mean = 0.151)DeepCAT (mean = 0.138)PR AUCBwe also added another 219 random kernels to catch other information. As shown in Figure

4.4, both ROC-AUC and PR-AUC got signiﬁcantly improved. Thus, the known TFBMs can

help DeepCAT learn much more eﬃciently.

Figure 4.4: The eﬀect of applying known TFBMs into DeepCAT. (A) ROC-AUC scores and
(B) PR-AUC scores for 57 stress conditions. The straight line y = x represents the equality
between the results of randomly initialized DeepCAT and DeepCAT with transfer learning
based on known TFBM database.

4.2.4 Performance of clustered multi-task learning

From Figure 4.2 we can see that SVM and Random forest perform better on a few group of

the responses, especially the ones with poor scores of the deep learning based model. The

reason is that the “shallow" machine learning methods train the model separately for each

response, which means that each of the loss function only depends on one single response, so

52

0.60.70.80.60.70.8DeepCAT known TFBM (mean = 0.740)DeepCAT (mean = 0.714)ROC AUCA0.00.10.20.30.40.00.10.20.30.4DeepCAT known TFBM (mean = 0.162)DeepCAT (mean = 0.138)PR AUCBthat they will not aﬀect each other during the training process. Thus, those under-performing

responses of DeepCAT may have far relationship from others, and got negative impacts as

results.

To further solve this problem, we grouped the 57 responses using hierarchical clustering

with Jaccard’s distance as Figure 4.5. Not surprisingly, the responses related to "heat" con-

ditions are grouped separately with others, and most of them are also the under-performing

responses.

Figure 4.5: Hierarchical clustering with Jaccard’s distance for 57 stress responses.

Referring to the results in Figure 4.5, we separately trained the models for these two

clusters. The convolution kernel weight matrices are all initialized by the known TFBMs

using the same way with section 4.2.3. The results are showing in Figure 4.6, we can see that

the multi-task learning in addition to transfer leaning obviously improved the performance of

DeepCAT, and most of the ROC-AUC and PR-AUC scores from under-performing responses

turned to be higher than the individually trained SVM and Random forest.

53

Figure 4.6: Performance comparison among DanQ, SVM, Random forest, DeepCAT, Deep-
CAT with known TFBMs, multi-task DeepCAT with known TFBMs through (A) ROC-AUC
scores and (B) PR-AUC scores for 57 stress conditions. The stress conditions are ordered by
the values of DeepCAT on x-axis, and the average of ROC and PR scores for each approach
are included in parentheses.

4.2.5 Transcription factor binding motifs and their interactions

Based on the best results we got in section 4.2.4, we aligned these motifs to known TFBMs

(DAP-seq and CIS-BP) using the TOMTOM algorithm (Gupta et al., 2007). In summary,

196 out of 319 motifs have been signiﬁcantly matched to the known TFBMs with E-value less

than 0.05 (Figure 4.7). By using the method introduced in section 4.1.6, we have listed the

top important aligned TFBMs for each of the 57 stress conditions in Appendix C. And we

considered the top motifs which can not aligned to any known TFBMs as the identiﬁed novel

54

0.60.70.802040StressROC AUCSVM (mean = 0.674)Random forest (mean = 0.679)DanQ (mean = 0.691)DeepCAT (mean = 0.714)DeepCAT known TFBMs (mean = 0.740)DeepCAT known TFBMs multi−task (mean = 0.750)A0.00.10.20.302040StressPR AUCSVM (mean = 0.123)Random forest (mean = 0.120)DanQ (mean = 0.116)DeepCAT (mean = 0.138)DeepCAT known TFBMs (mean = 0.162)DeepCAT known TFBMs multi−task (mean = 0.177)BTFBMs which are important for predicting stress responds, we have shown some example

sequence logo ﬁgures in 4.8.

Figure 4.7: Three examples of aligned TFBMs through TOMTOM algorithm. The upper
ﬁgures are the known TFBMs and the lower ones are generated by convolution kernel weights
trained by DeepCAT. The p-values are 7.46e−11, 1.22e−13, 3.27e−11 respectively.

55

Homeodomain_ATHB-14.pwm71Tomtom 27.07.19 21:10012bits1234C5CG6CT7GA8A9T10G11A12T13GT14GA15GC16GAT17ATC18CTG19GC012bits1234TA5678CTAG9GCAT10CTGA11CGTA12CGAT13CATG14CGTA15GCAT16CAGT17GTA18GATC19AT2021222324252627282930313233343536MADS-box_43709.pwm77Tomtom 27.07.19 21:12012bits123GAT45C6TC7CTA8GAT9TA10AT11TA12GACT13GA14G15CTA16CTA17CGTA18012bits12345AT6GCAT7TA8GTAC9ATC10GCTA11TA12GCTA13TA14GTA15GCAT16TGA17CTAG18CTA19GCTA20CGTA21222324252627282930313233343536Myb-SANT_MYB15.pwm85Tomtom 27.07.19 21:15012bits1TAC2ATC3A4C5C6GAT7A8C9C10ACG11CGT12GTC13TCAG012bits12345678910GATC11GATC12GCTA13GATC14GATC15GCAT16GCTA17GTAC18GATC1920GACT2122G2324252627282930313233343536Figure 4.8: Three examples of identiﬁed novel TFBMs. The upper ﬁgures come from the
initialized random motifs and the lower ones are generated by convolution kernel weights
trained by DeepCAT.

56

Chapter 5

Conclusion and Future work

The contributions of our work can be summarized in two respects. From the respect of

statistical methodology, we proposed Adaptive Least Trimmed Square, which could adjust

the proportion of outliers at each iteration, and ﬁnally automatically identify and remove

outliers before the coeﬃcients estimation. This a hard threshold based robust regression and

outlier detection method. The theoretical guarantees are also provided in this study. From

the application perspective, we developed FARDEEP based on Adaptive Least Trimmed

Square to solve gene-expression deconvolution problem, the algorithm perfectly address the

challenge that tumor micro-environmental data are very noisy so that the tumor speciﬁc

genes may aﬀect the estimation accuracy.

In addition, we also proposed a deep learning

based structure which can identify the complex patterns in transcription factor binding

sites for DNA sequences and use it to predict the responses of plants in diﬀerent stressful

environments. The combination of convolution layer and self-attention layer can help the

model identify both motifs and their long-term dependencies for sequences with any length.

In summary, our work mainly contributes to the area of developing statistical methodologies

and machine learning algorithms to solve biological problems.

In the future, we plan to extend our outlier detection and robust regression framework

onto non-linear model, so that it can be widely used in other practical problems. In addition,

for DeepCAT, we have only focused on the prediction of whether gene expression would be

57

up-regulated or not under diﬀerent stress environments. More interestingly, we may also want

to know the quantity of fold change for gene expression level. Thus, we will further adapt

DeepCAT for a regression problem by modifying some of its current layers. In conclusion,

these left problems will be investigated in our future work.

58

APPENDICES

59

APPENDIX A

Supplementary text for Chapter 2

Proof of Theorem 2.3.1

The Adaptive Least Trimmed Square (ALTS) iteration sequence satisﬁes L(β(j+1)) ≤ L(β(j)),
for any j ≥ 0, with objective function

n(cid:88)
(yi − xiβ)21{γi=0}.

i=1

L(β) =

Proof. For jth iteration,

L(β(j)) =

(cid:88)
≥ (cid:88)

i∈Sj−1

i∈Sj−1

(yi − xiβ(j))2

(yi − xiβ(j+1))2 ≥ (cid:88)

i∈Sj

(yi − xiβ(j+1))2 = L(β(j+1)).

60

Proof of Theorem 2.3.2

The Adaptive Least Trimmed Square (ALTS) stops in no more than j∗ steps, where

(cid:22)− log α1

(cid:23)

log α2

.

j∗ =

Here (cid:98)·(cid:99) is the largest integer that is less than or equal to x.

Proof. Assume that j∗ steps have been taken in ALTS,

αj∗
2 α1n

2

0

=

αj∗
2 α1N
αj∗
2 α1 ≤ 1,

≤ n
2

,

which leads to the result.

Proof of Theorem 2.3.3

Suppose the corruption rate at jth iteration of ALTS is ρj, the outlier vector (cid:107)γ(cid:107)0 ≤ nρ∗
and ρ0 > ρ∗. Let c−
τ be the subset strong convexity and smoothness constants for X
at level τ. We assume that η = 4

τ and c+

),

n

= O(logη

< 1, then if logα2

1
α1

√

(cid:13)(cid:13)(cid:13)γS0

(cid:13)(cid:13)(cid:13)2

c+
ρ0
c−
1−ρ0

(cid:115)
(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  +

C√
n

(cid:107)ε(cid:107)2 ,

with  > 0 and some constant C.

61

Proof. For the jth iteration, we have

(cid:13)(cid:13)(cid:13)XSj

(β∗ − β(j+1))

(cid:13)(cid:13)(cid:13)2

−(cid:13)(cid:13)(cid:13)γSj

+ εSj

Thus,

Since

(β∗ − β(j+1))

(cid:13)(cid:13)(cid:13)XSj
(cid:13)(cid:13)(cid:13)XSj
(cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2

(β∗ − β(j+1))

≤

≤ 2

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)γSj
(cid:13)(cid:13)(cid:13)2
≥(cid:113)
2(cid:113)
c−
1−ρj

(cid:13)(cid:13)(cid:13)γSj

c−
1−ρj

(cid:13)(cid:13)(cid:13)2

+ εSj

(β∗ − β(j+1)) + γSj
− XSj
− XSj

β(j+1)(cid:13)(cid:13)(cid:13)2
β∗(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2

+ εSj

.

=

=

(cid:13)(cid:13)(cid:13)2

≤(cid:13)(cid:13)(cid:13)XSj
(cid:13)(cid:13)(cid:13)ySj
≤(cid:13)(cid:13)(cid:13)ySj
(cid:13)(cid:13)(cid:13)γSj
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2
2(cid:113)
c−
1−ρj

(cid:13)(cid:13)(cid:13)2

+ εSj

≤

.

+ εSj

, then we have

(cid:13)(cid:13)(cid:13)γSj

(cid:13)(cid:13)(cid:13)2

(

(cid:13)(cid:13)(cid:13)εSj

(cid:13)(cid:13)(cid:13)2

).

(A.1)

+

For jth iteration, the residuals were ordered as |r
permutation of n elements, and we deﬁne ˜Sj = {π(i), i ≤ n(1 − ρ0)}. Then we have

π(2)| ≤ ··· ≤ |r

π(1)| ≤ |r

π(n)|, where π is a

(j)

(j)

(j)

(cid:13)(cid:13)(cid:13)(cid:13)y ˜Sj+1

(cid:13)(cid:13)(cid:13)(cid:13)X ˜Sj+1

− X ˜Sj+1
(β∗ − β(j+1)) + γ ˜Sj+1

β(j+1)

+ ε ˜Sj+1

(cid:13)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)(cid:13)2

≤(cid:13)(cid:13)(cid:13)yS∗ − XS∗β(j+1)(cid:13)(cid:13)(cid:13)2
≤(cid:13)(cid:13)(cid:13)XS∗(β∗ − β(j+1)) + εS∗

,

(cid:13)(cid:13)(cid:13)2

.

Deﬁne the sets Ej+1 = ˜Sj+1 \ S∗ and Tj+1 = S∗ \ ˜Sj+1, then

(cid:13)(cid:13)(cid:13)XEj+1

(β∗ − β(j+1)) + γEj+1

+ εEj+1

(β∗ − β(j+1)) + εTj+1

(cid:13)(cid:13)(cid:13)2

,

(cid:13)(cid:13)(cid:13)2

≤(cid:13)(cid:13)(cid:13)XTj+1

62

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)εEj+1

(cid:13)(cid:13)(cid:13)XTj+1
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)XTj+1

+

+

(β∗ − β(j+1)) + εTj+1

(β∗ − β(j+1))

+

(cid:13)(cid:13)(cid:13)2

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)εTj+1

(cid:13)(cid:13)(cid:13)2

(cid:13)(cid:13)(cid:13)γEj+1

(β∗ − β(j+1)) + εEj+1
(β∗ − β(j+1))

≤ 2

(cid:13)(cid:13)(cid:13)2

≤(cid:13)(cid:13)(cid:13)XEj+1
≤(cid:13)(cid:13)(cid:13)XEj+1
(cid:13)(cid:13)(cid:13)2
(cid:113)
(cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj+1
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)γEj+1
(cid:13)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj+1

≤ 4

c+
ρ0

=

ρ0
c−
1−ρj

+
+ 2(cid:107)ε(cid:107)2 .

Since

, we apply (A.1) and get

(cid:118)(cid:117)(cid:117)(cid:116) c+
(cid:118)(cid:117)(cid:117)(cid:116) c+

ρ0
c−
1−ρj

ρ0
c−
1−ρj

+ 4

+ (4

(cid:13)(cid:13)(cid:13)εSj

(cid:13)(cid:13)(cid:13)2

+ 2(cid:107)ε(cid:107)2

+ 2)(cid:107)ε(cid:107)2

(cid:118)(cid:117)(cid:117)(cid:116) c+
(cid:118)(cid:117)(cid:117)(cid:116) c+
(cid:118)(cid:117)(cid:117)(cid:116) c+

ρ0
c−
1−ρj

ρ0
c−
1−ρj

(cid:13)(cid:13)(cid:13)γSj
(cid:13)(cid:13)(cid:13)γSj
(cid:13)(cid:13)(cid:13)γSj

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2

≤ 4

≤ 4

+ 3(cid:107)ε(cid:107)2 .

(A.2)

Then we apply (A.2) to (A.1)

(cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2

(cid:113)

Since

√
≥ O(

c−
1−ρ0

≤

2(cid:113)
c−
1−ρj
2(cid:113)
c−
1−ρ0
≤ 2ηj(cid:113)
c−
1−ρ0

≤

(

(

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)γSj
(cid:13)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)γS0

(cid:13)(cid:13)(cid:13)εSj

(cid:13)(cid:13)(cid:13)2

)

+

+ (cid:107)ε(cid:107)2)
2(cid:113)
c−
1−ρ0

+

3(1 − ηj)
1 − η

(

+ 1)(cid:107)ε(cid:107)2 .

√

(cid:13)(cid:13)(cid:13)γS0

n

(cid:13)(cid:13)(cid:13)2

),

n) for large enough n, then when j = O(logη

(cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2

≤  +

C√
n

(cid:107)ε(cid:107)2 .

63

Thus, if logα2

1
α1

= O(logη

n

),

√

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)γS0
(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  +

C√
n

(cid:107)ε(cid:107)2 .

Proof of Corollary 2.3.1

Suppose the assumptions are same with Theorem 2.3.3 and the random noises follow Gaus-
sian distribution, i.e. ε ∼ N (0, σ2I). Then when n > log 1
at least 1 − δ,

δ for some δ > 0, with probability

(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  + Cσ,

with  > 0 and some constant C.

Proof. If ε ∼ N (0, σ2I), then

(cid:107)ε(cid:107)2
σ2 ∼ χ2
2

n. Follow the lemma 1 from Laurent and Massart

(2000)

2 ≤ nσ2 + 2σ2

nlog1
δ

+ 2σ2log1
δ

5nσ(cid:1) > 1 − δ. Thus, combine the result with Theorem 1, we

√

For any n > log 1

√
2 − nσ2 > 2

(cid:107)ε(cid:107)2

nσ2

(cid:114)

(cid:114)

log1
δ

(cid:33)

≤ δ,

≥ 1 − δ.

+ 2σ2log1
δ

(cid:33)

(cid:32)

P

(cid:107)ε(cid:107)2

(cid:32)
δ , P(cid:0)(cid:107)ε(cid:107)2 <

P

64

can get

(cid:13)(cid:13)(cid:13)β∗ − ˆβALTS

(cid:13)(cid:13)(cid:13)2

≤  + Cσ,

with probability at least 1 − δ.

65

APPENDIX B

Supplementary text for Chapter 3

Eﬀect of Z-score normalization

CIBERSORT employed a Z-score normalization on both y (gene expression proﬁle of tumor)

and X (expression values of signature genes) to stabilize the Support Vector Regression

algorithm. To assess the eﬀect of Z-score normalization, we performed a brief mathematical

derivation together with several simulations as follows:

y = Xβ + ,

(B.1)

where  is the noise. First, CIBERSORT performs Z-score on both y and X to get

y − ¯y1

,

sy

X − ¯X11T

sX

,

y∗ =

X∗ =
(cid:80)p

where ¯y = (cid:80)n

i=1 yi/n; ¯X = (cid:80)n

i=1

j=1 xij/(np); sy and sX are the sample standard
deviation of {yi}n
i=1 and {xij} respectively. Then, CIBERSORT utilizes Support Vector
Regression (SVR) on y∗ and X∗ to obtain the coeﬃcients. Note that in practice sy and
¯y diﬀers from sX and ¯X respectively. Thus, the SVR coeﬃcients based on y∗ and X∗ are
much less likely to equal those based on y and X.

66

To demonstrate the point, we simulated several mixtures based on the ﬁrst 9 columns of
LM22 signature matrix with β ∈ R9 = (2, 1.5, 1.2, 1, 0.8, 0.6, 0.5, 0.3, 0.2)(cid:48) and random noise
 ∼ 2N (0,(0.3 log2(s))2), where s is the standard deviation of the original mixtures. As shown

in Figure B.1, the estimates from SVR with Z-score normalization showed skewness from

the truths which are denoted as red dashed lines in the ﬁgures. Thus, Z-score normalization

in CIBERSORT blocks its way to getting the absolute cell abundance. Also, although the

SVR without Z-score normalization performed well in this simulation, it is very slow and

may not converge for the high-dimensional signature matrix, and the variables with greater

numeric ranges will dominate those with smaller numeric ranges as discussed in Chang and

Lin (2011); Hsu et al. (2003). This is also the reason that we only include the ﬁrst 9 columns

of LM22 in this simulation.

67

Figure B.1: Eﬀect of Z-score normalization. Red dashed line indicates the true absolute
abundance of each cell. SVR is the common Support vector regression without normalization
and SVR.znormalization is the method used in CIBERSORT.

68

T.cells.CD8T.cells.follicular.helperT.cells.regulatory.TregsT.cells.CD4.memory.activatedT.cells.CD4.memory.restingT.cells.CD4.naiveB.cells.memoryB.cells.naivePlasma.cells0.250.500.751.001.250.20.40.60.80.050.100.150.200.51.01.52.00.20.40.60.10.20.30.40.81.20.10.20.30.40.50.250.500.751.00Estimated valueSVR (SSE = 0.09)SVR.znormalization (SSE = 691.07)Importance of TIL subset scores

The TIL subset score is a good indicator for the absolute abundance of TIL and is an im-

portant scale to evaluate the immune level for cancer patients. To demonstrate this, we

simulate two mixtures without noise using the pure immune cell lines (Jurkat, IM-9, Raji

and THP-1) expression proﬁles from GSE11103 and two diﬀerent cancer cell lines (Colon can-

cer: HCT116, Melanoma: MDA-MB-435) expression proﬁle from GSE10650 and GSE32474

respectively as below,

MIX1 = 0.2cancer cell line + 0.2Jurkat + 0.2IM-9 + 0.2Raji + 0.2THP-1

MIX2 = 0.8cancer cell line + 0.05Jurkat + 0.05IM-9 + 0.05Raji + 0.05THP-1

where cancer cell line can be either HCT116 or MDA-MB-435. Then we applied CIBERSORT

and FARDEEP with the signature matrix generated by Jurkat, IM-9, Raji, and THP-1 which

could be downloaded from CIBERSORT website. The outputs are shown below in Table

B.1. CIBERSORT gives only the relative proportion as around 0.25 for both mixtures even

if mixture 2 has less immune contents than the mixture 1. In contrast, FARDEEP shows

its ability to accurately estimate the exact abundance of immune cells through TIL subset

scores because they are the direct estimate from the linear model. Also, we can derive

relative abundance (fraction) among immune cells from TIL subset scores by

ˆβj(cid:80)p

k=1

,

ˆβj

˜βj =

where ˆβj is the jth TIL subset score. As shown in Table B.1, the relative abundance results

derived from TIL subset scores still outperform CIBERSORT because

69

a. although most of the probesets in the signature matrix have high expression in the

immune cell lines and low expression in the cancer cell line (HCT116 and MDA-MB-

435), several signature probesets are highly expressed in HCT116 or MDA-MB-435

(Figure B.2) which likely introduce an estimation bias. FARDEEP automatically re-

moves those outlier probesets before estimating the abundance of the immune cells, as

shown in Figure B.2, hence delivering a better estimate;

b. compared to CIBERSORT, FARDEEP includes an extra intercept in the regression

model, which could capture the background eﬀect from non-immune content, such as

tumor cells.

Interestingly, even with diﬀerent ratios of cancer cells in MIX1 and MIX2 regardless of the

cancer types, FARDEEP detected the same set of outliers from these two mixtures, but there

are 137 and 151 outliers from HCT116 mixture and MDA-MB-435 mixture respectively from

which 90 outliers are shared. This result suggests that diﬀerent cancer cell line has distinct

high expressed genes, and the outlier-genes strongly depend on the sample type.

70

Table B.1: This table shows the results from CIBERSORT and FARDEEP on two mixtures
(MIX1 and MIX2) with HCT116 and MDA-MB-435 respectively. FARDEEP (relative prop)
was gotten by normalize the default outputs to sum to 1.

Method

Truth

Sample
MIX1
MIX2
CIBERSORT MIX1
MIX2
MIX1
MIX2
MIX1
MIX2

FARDEEP
FARDEEP

(relative prop)

Jurkar

0.2
0.05
0.2512
0.2682
0.2004
0.0515
0.2505
0.2588

HCT116

IM-9
0.2
0.05
0.2493
0.2404
0.1995
0.0479
0.2494
0.2410

Raji
0.2
0.05
0.2493
0.2387
0.1999
0.0498
0.2500
0.2502

THP-1

0.2
0.05
0.2503
0.2527
0.1999
0.0497
0.2500
0.2500

Jurkar

0.2
0.05
0.2507
0.2801
0.2013
0.0552
0.2511
0.2667

MDA-MB-435
Raji
IM-9
0.2
0.2
0.05
0.05
0.2480
0.2514
0.2153
0.2649
0.2000
0.2002
0.0509
0.0498
0.2494
0.2497
0.2459
0.2409

THP-1

0.2
0.05
0.2499
0.2398
0.2002
0.0510
0.2498
0.2465

Figure B.2: The y-axis are the sorted expression value of HCT116 and MDA-MB-435 cell
line for 584 probesets. The probes which have been detected as outlier by FARDEEP are
marked with color red.

71

HCT116: MIXTURE 1IndexExpression value010020030040050060001000020000300004000050000HCT116: MIXTURE 2IndexExpression value010020030040050060001000020000300004000050000MDA−MB−435: MIXTURE 1IndexExpression value0100200300400500600010000200003000040000MDA−MB−435: MIXTURE 2IndexExpression value0100200300400500600010000200003000040000X related outliers

To evaluate the performance of FARDEEP under a more complex outlier construction, we

simulated datasets 50 times follow the setting in section of “in silico simulation with varied

error types”, but let the outliers follow a non-central t-distribution with 1 degree of freedom
and a non-centrality parameter being 10 · max(xi) for the ith gene, where xi is the ith row
of X. In this way, amplitude of the outliers for the ith gene will depend on xi. The results

Figure B.3, Figure B.4 and Table B.2 show that FARDEEP dominates the performances of

other methods and keeps good accuracy of outlier detection.

Table B.2: Tuned k for FARDEEP with the adjusted BIC. We simulated normally distributed
errors and heavy-tailed errors respectively for diﬀerent proportion of outliers and computed
true positive rate and false positive rate to evaluate the tunning result.

Percentage of outliers

Normal

Heavy-tailed

True positive rate
False positive rate

Parameter (mean of k)

True positive rate
False positive rate

Parameter (mean of k)

5% 10% 20% 30%
1

1

1

0.007
2.53

0.02
1.32

1

0.06
1.09

0.05
1.16

1

0.06
2.05

1

0.06
1.24

1

0.13
1.10

1

0.17
1.04

Figure B.3: SSE of coeﬃcients for diﬀerent approaches based on simulations with X related
outliers. We simulated diﬀerent percentage of outliers ({5%, 10%, 20%, 30%}) and compared
the SSE for coeﬃcients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A)
random error with standard normal distribution, (B) random error with t-distribution.

72

20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEPA20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEPBFigure B.4: Compare the estimation accuracy of diﬀerent deconvolution approaches for
simulations with X related outliers (the values in parentheses are R2 and R). Based on
{5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are the
estimators ﬁt for a straight line ˆβ = β. (A) random error with standard normal distribution,
(B) random error with t-distribution.

73

0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −237.42 , R =  0.06 )DCQ (R2 =  −5.32 , R =  0 )PERT (R2 =  −2.57 , R =  0.07 )CIBERSORT (R2 =  0.98 , R =  0.99 )FARDEEP (R2 =  0.99 , R =  1 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −422.33 , R =  0 )DCQ (R2 =  −20.23 , R =  0.01 )PERT (R2 =  −2.86 , R =  0.01 )CIBERSORT (R2 =  −44.74 , R =  0.1 )FARDEEP (R2 =  0.99 , R =  0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −1440.76 , R =  0.05 )DCQ (R2 =  −55.48 , R =  0.03 )PERT (R2 =  −2.84 , R =  0 )CIBERSORT (R2 =  −118.72 , R =  0.01 )FARDEEP (R2 =  0.99 , R =  0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −1390.76 , R =  −0.02 )DCQ (R2 =  −49.41 , R =  0 )PERT (R2 =  −2.91 , R =  0.01 )CIBERSORT (R2 =  −157.65 , R =  0.05 )FARDEEP (R2 =  0.99 , R =  0.99 )30% OutliersA0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −353.51 , R =  −0.01 )DCQ (R2 =  −21.43 , R =  0.03 )PERT (R2 =  −2.45 , R =  −0.01 )CIBERSORT (R2 =  0.94 , R =  0.97 )FARDEEP (R2 =  0.98 , R =  0.99 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −310.28 , R =  0 )DCQ (R2 =  −11.78 , R =  0.03 )PERT (R2 =  −2.71 , R =  0 )CIBERSORT (R2 =  −43.6 , R =  0.03 )FARDEEP (R2 =  0.98 , R =  0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −879.42 , R =  0.03 )DCQ (R2 =  −40.91 , R =  0 )PERT (R2 =  −2.76 , R =  0.04 )CIBERSORT (R2 =  −122.88 , R =  0.09 )FARDEEP (R2 =  0.98 , R =  0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −302541.27 , R =  −0.04 )DCQ (R2 =  −41420.95 , R =  −0.02 )PERT (R2 =  −2.63 , R =  0.02 )CIBERSORT (R2 =  −168.49 , R =  0.05 )FARDEEP (R2 =  0.97 , R =  0.99 )30% OutliersBDiscussion of correlated genes

Gene-expression deconvolution problem can be expressed as a linear regression model (B.1).

Sometimes the expression values among diﬀerent genes are highly correlated. To evaluate

the performance of FARDEEP under the violation of independent errors. We have simulated

datasets 50 times with diﬀerent proportion of outliers exactly following the simulation setting

in section “in silico simulation with varied error types” of the paper except that the error

term

 ∼ N (0, Σ), where Σij =



1

i = j,

0.7 1 ≤ i, j ≤ 20 and i (cid:54)= j,

0.5 21 ≤ i, j ≤ 40 and i (cid:54)= j,

0

others.

Figure B.5, B.6 and Table B.3 show that FARDEEP outperforms other approaches and keeps

its accuracy under the correlated random errors.

Table B.3: We simulated correlated errors with diﬀerent proportion of outliers for FARDEEP,
and computed true positive rate and false positive rate to evaluate the tunning result.

Percentage of outliers

True positive rate
False positive rate

Parameter (mean of k)

0.009
3.53

5% 10% 20% 30%
1

1
0.01
2.48

1
0.02
1.45

1
0.06
1.19

74

Figure B.5: SSE of coeﬃcients for diﬀerent approaches based on simulations with corre-
lated random errors. We simulated diﬀerent percentage of outliers ({5%, 10%, 20%, 30%})
and compared the SSE for coeﬃcients applying NNLS, DCQ, PERT, CIBERSORT, and
FARDEEP.

75

20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEPFigure B.6: Compare the estimation accuracy of diﬀerent deconvolution approaches for
simulations with correlated random errors (the values in parentheses are R2 and R). Based
on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are
the estimators ﬁt for a straight line ˆβ = β.

76

0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −8.42 , R =  0.19 )DCQ (R2 =  −2.91 , R =  0.08 )PERT (R2 =  −2.58 , R =  0 )CIBERSORT (R2 =  0.96 , R =  0.98 )FARDEEP (R2 =  0.98 , R =  0.99 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −12.43 , R =  0.11 )DCQ (R2 =  −2.46 , R =  0.01 )PERT (R2 =  −2.33 , R =  0.02 )CIBERSORT (R2 =  0.86 , R =  0.94 )FARDEEP (R2 =  0.98 , R =  0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −46.98 , R =  0.03 )DCQ (R2 =  −2.72 , R =  −0.02 )PERT (R2 =  −2.37 , R =  0.02 )CIBERSORT (R2 =  0.68 , R =  0.86 )FARDEEP (R2 =  0.98 , R =  0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 =  −26.77 , R =  0.02 )DCQ (R2 =  −1.94 , R =  0.03 )PERT (R2 =  −2.23 , R =  0.03 )CIBERSORT (R2 =  0.6 , R =  0.84 )FARDEEP (R2 =  0.97 , R =  0.99 )30% OutliersAPPENDIX C

Supplementary text for Chapter 4

Transformation between kernel and motifs

We have introduced the approach of conversion between convolution kernel weight and po-

sition weight matrix (PWM) in section 4.1.6. Here, we use the known transcription factor

binding motifs (TFBM) database of arabidopsis thaliana to valid this method.

In other

words, we compare the known PWMs with its converted result by sequence logo ﬁgure. We

have ran 865 TFBMs, and we randomly show ten of them in Figure C.1. The results indicate

that our transformation approach is feasible.

Figure C.1: Sequence logo ﬁgures to validate the transformation between convolution kernels
and motifs. (A) original TFBMs, (B) converted TFBMs.

77

0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsBFigure C.1 (cont’d)

78

0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsBFigure C.1 (cont’d)

79

0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsBMost important motifs for each stress condition

Top2
Top1
MYBrelated_tnt.TRP1_col_m1.pwm
MYBrelated_tnt.RVE1_col_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm
S1Falike_tnt.AT3G09735_col_a_m1.pwm
MYBrelated_tnt.TRP1_col_m1.pwm
S1Falike_tnt.AT3G09735_col_a_m1.pwm
S1Falike_tnt.AT3G09735_col_a_m1.pwm
MYBrelated_tnt.RVE1_col_a_m1.pwm
MYBrelated_tnt.At4g01280_col_a_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm
C2C2gata_tnt.GATA15_col_b_m1.pwm
MYBrelated_tnt.RVE1_col_a_m1.pwm
MYBrelated_tnt.RVE1_col_a_m1.pwm
MYB_tnt.MYB73_col_a_m1.pwm
C2H2_tnt.At5g04390_col200_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm

Top3
MADS-box_43709.pwm
bZIP_tnt.bZIP48_col_a_m1.pwm
MYBrelated_tnt.At5g08520_col_b_m1.pwm
bHLH_MYC4.pwm
C2H2_tnt.JKD_col_a_m1.pwm
bHLH_MYC4.pwm

Stress
Heat_0.25
Heat_0.5
Heat_1
Heat_3
Heat_3.1
Heat_3.21
Heat_3.3
Heat_3.9
DC3000_2
DC3000_24
DC3000_6
Flg22_1
Flg22_4
GST.NPP1_1 WRKY_tnt.WRKY8_col_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
GST.NPP1_4
WRKY_tnt.WRKY29_col_a_m1.pwm
HrcC._2
WRKY_tnt.WRKY29_col_a_m1.pwm
HrcC._24
C2H2_tnt.At5g66730_col_m1.pwm
HrcC._6
WRKY_tnt.WRKY55_col_a_m1.pwm
HrpZ_1
HrpZ_4
WRKY_tnt.WRKY29_col_a_m1.pwm
P.infestans_12
C2H2_tnt.At5g66730_col_m1.pwm
P.infestans_24 WRKY_tnt.WRKY8_col_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
P.infestans_6
C2H2_tnt.At5g66730_col_m1.pwm
Psph_2
WRKY_tnt.WRKY11_col_a_m1.pwm
Psph_24
Psph_6
WRKY_tnt.WRKY29_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
avrRpm1_2
C2H2_tnt.At5g66730_col_m1.pwm
avrRpm1_24
C2H2_tnt.At5g66730_col_m1.pwm
avrRpm1_6
C2H2_tnt.At5g66730_col_m1.pwm
Cold4C_12
C2C2gata_tnt.GATA15_col_b_m1.pwm
Cold4C_24
C2H2_tnt.At5g66730_col_m1.pwm
Cold4C_3
Cold4C_6
HB_tnt.HDG7_col_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm
Drought_3
AP2EREBP_tnt.At1g22810_col_m1.pwm
Drought_6
C2H2_tnt.At5g66730_col_m1.pwm
Genotoxic_12
C2H2_tnt.At5g66730_col_m1.pwm
Genotoxic_6
C2H2_tnt.At5g66730_col_m1.pwm
Osmotic_1
Osmotic_12
C2H2_tnt.At5g66730_col_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
Osmotic_24
C2H2_tnt.At5g66730_col_m1.pwm
Osmotic_3
C2H2_tnt.At5g66730_col_m1.pwm
Osmotic_6
bZIP_tnt.bZIP16_col_v3a_m1.pwm
Salt_12
C2H2_tnt.At5g66730_col_m1.pwm
Salt_24
Salt_3
BZR_tnt.At4g18890_col_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm
Salt_6
AP2EREBP_tnt.SHN3_col_a_m1.pwm
UV.B_0.25
C2H2_tnt.At5g66730_col_m1.pwm
UV.B_1
WRKY_tnt.WRKY11_col_a_m1.pwm
UV.B_12
WRKY_tnt.WRKY55_col_a_m1.pwm
UV.B_24
UV.B_3
WRKY_tnt.WRKY29_col_a_m1.pwm
UV.B_6
WRKY_tnt.WRKY29_col_a_m1.pwm
Wounding_0.25 C2H2_tnt.At5g66730_col_m1.pwm
Wounding_0.5 MYBrelated_tnt.TRP1_col_m1.pwm
Wounding_1
Wounding_12 MADS-box_43709.pwm
Wounding_3

C2H2_tnt.At5g66730_col_m1.pwm

GATA_GATA3.pwm

AP2EREBP_tnt.ERF11_col_b_m1.pwm
C2H2-ZF_AT3G53600.pwm
GATA_GATA3.pwm
MYBrelated_tnt.TRP1_col_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
G2like_tnt.KAN2_col_a_m1.pwm
C2H2_tnt.JKD_col_a_m1.pwm

Myb-SANT_MYB15.pwm
Trihelix_tnt.AT5G05550_col_a_m1.pwm MYBrelated_tnt.At4g01280_col_a_m1.pwm
NAC_tnt.ANAC042_col_a_m1.pwm
HB_tnt.HDG7_col_a_m1.pwm
C2H2-ZF_AT3G53600.pwm
HB_tnt.HDG7_col_a_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
C2C2dof_tnt.OBP3_col_a_m1.pwm
AP2EREBP_tnt.At1g22810_col_m1.pwm G2like_tnt.KAN2_col_a_m1.pwm
C2H2_tnt.At5g04390_col200_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm
C2H2-ZF_AT3G53600.pwm
MYBrelated_tnt.TRP1_col_m1.pwm
C2C2dof_tnt.OBP3_col_a_m1.pwm
C2C2gata_tnt.GATA15_col_b_m1.pwm
C2H2_tnt.At5g04390_col200_a_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm
bHLH_tnt.BIM2_col_v3b_m1.pwm
AP2EREBP_tnt.ERF3_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
bZIP_tnt.bZIP68_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
bZIP_tnt.bZIP68_col_a_m1.pwm
MADS-box_43709.pwm
GATA_GATA3.pwm
bHLH_BHLH121.pwm
C2H2_tnt.At5g66730_col_m1.pwm
C2C2gata_tnt.GATA15_col_b_m1.pwm GATA_GATA3.pwm
bZIP_tnt.bZIP16_col_v3a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY8_col_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
C2C2gata_tnt.GATA15_col_b_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm
MYBrelated_tnt.TRP1_col_m1.pwm
Trihelix_tnt.AT5G05550_col_a_m1.pwm C2H2-ZF_AT3G53600.pwm
AP2EREBP_tnt.At1g22810_col_m1.pwm AP2_CRF3.pwm

MADS-box_43709.pwm
HB_tnt.HDG7_col_a_m1.pwm
G2like_tnt.KAN2_col_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
AP2EREBP_tnt.At1g22810_col_m1.pwm
MYB_tnt.MYB73_col_a_m1.pwm
MADS-box_43709.pwm
bZIP_tnt.bZIP68_col_a_m1.pwm
BZR_tnt.At4g18890_col_a_m1.pwm

bZIP_tnt.bZIP68_col_a_m1.pwm
MYBrelated_tnt.At4g01280_col_a_m1.pwm
S1Falike_tnt.AT3G09735_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm
WRKY_tnt.WRKY29_col_a_m1.pwm
C2H2_tnt.At5g66730_col_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY11_col_a_m1.pwm
WRKY_tnt.WRKY55_col_a_m1.pwm

GATA_GATA3.pwm

80

BIBLIOGRAPHY

81

BIBLIOGRAPHY

Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z., and Clark, H. F. (2009). Decon-
volution of blood microarray data identiﬁes cellular activation patterns in systemic lupus
erythematosus. PLOS ONE, 4, e6098.

Alfons, A., Croux, C., and Gelper, S. (2013). Sparse least trimmed squares regression for
analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7, 226–248.

Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence
speciﬁcities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology,
33(8), 831–838.

Altboum, Z., Steuerman, Y., David, E., Barnett-Itzhaki, Z., Valadarsky, L., Keren-Shaul,
H., Meningher, T., Mendelson, E., Mandelboim, M., Gat-Viks, I., and Amit, I. (2014).
Digital cell quantiﬁcation identiﬁes global immune cell dynamics during inﬂuenza infection.
Molecular Systems Biology, 10, 720.

Aran, D., Sirota, M., and Butte, A. J. (2015). Systematic pan-cancer analysis of tumour
purity. Nature Communications, 6, 8971.

Aran, D., Hu, Z., and Butte, A. J. (2017). xcell: digitally portraying the tissue cellular
heterogeneity landscape. Genome Biology, 18, 220.

Balermpas, P., Michel, Y., Wagenblast, J., Seitz, O., Weiss, C., Rodel, F., Rodel, C., and
Fokas, E. (2014). Tumour-inﬁltrating lymphocytes predict response to deﬁnitive chemora-
diotherapy in head and neck cancer. British Journal of Cancer, 110, 501–509.

Balermpas, P., Rodel, F., Rodel, C., Krause, M., Linge, A., Lohaus, F., Baumann, M.,
Tinhofer, I., Budach, V., Gkika, E., Stuschke, M., Avlar, M., Grosu, A.-L., Abdollahi,
A., Debus, J., Bayer, C., Stangl, S., Belka, C., Pigorsch, S., Multhoﬀ, G., Combs, S. E.,
Monnich, D., Zips, D., and Fokas, E. (2016). CD8+ tumour-inﬁltrating lymphocytes in
relation to HPV status and clinical outcome in patients with head and neck cancer after
postoperative chemoradiotherapy: A multicentre study of the German cancer consortium
radiation oncology group (DKTK-ROG). International Journal of Cancer, 138, 171–181.

Beal, J. (2017). Biochemical complexity drives log-normal variation in genetic expression.
Engineering Biology, 1(1), 55–60.

Bernholt, T. (2006). Robust Estimators are Hard to Compute. Technical Report 52/2005,
Univ. Dortmund.

82

Bhatia, K., Jain, P., and Kar, P. (2015). Robust regression via hard thresholding.
In
Proceedings of the 28th International Conference on Neural Information Processing Systems
- Volume 1 , NIPS’15, pages 721–729, Cambridge, MA, USA. MIT Press.

Binnewies, M., Roberts, E. W., Kersten, K., Chan, V., Fearon, D. F., Merad, M., Coussens,
L. M., Gabrilovich, D. I., Ostrand-Rosenberg, S., Hedrick, C. C., Vonderheide, R. H., Pittet,
M. J., Jain, R. K., Zou, W., Howcroft, T. K., Woodhouse, E. C., Weinberg, R. A., and
Krummel, M. F. (2018). Understanding the tumor immune microenvironment (time) for
eﬀective therapy. Nature Medicine, 24, 541–550.

Chang, C.-C. and Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.

Chen, Y., Li, Y., Narayan, R., Subramanian, A., and Xie, X. (2016). Gene expression
inference with deep learning. Bioinformatics, 32(12), 1832–1839.

Corrales, L., McWhirter, S. M., Dubensky, T. W., and Gajewski, T. F. (2016a). The
host STING pathway at the interface of cancer and immunity. The Journal of Clinical
Investigation, 126(7), 2404–2411.

Corrales, L., McWhirter, S. M., Thomas W. Dubensky, J., and Gajewski, T. F. (2016b).
The host sting pathway at the interface of cancer and immunity. The Journal of Clinical
Investigation, 126, 2404–2411.

Deng, L., Liang, H., Xu, M., Yang, X., Burnette, B., Arina, A., Li, X.-D., Mauceri, H., Beck-
ett, M., Darga, T., Huang, X., Gajewski, T. F., Chen, Z. J., Fu, Y.-X., and Weichselbaum,
R. R. (2014). STING-Dependent Cytosolic DNA Sensing Promotes Radiation-Induced Type
I Interferon-Dependent Antitumor Immunity in Immunogenic Tumors. Immunity, 41(5),
843–852.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]. arXiv:
1810.04805.

Eray, M., Postila, V., Eeva, J., Ripatti, A., Karjalainen-Lindsberg, M.-L., Knuutila, S., An-
dersson, L. C., and Pelkonen, J. (2003). Follicular Lymphoma Cell Lines, an In Vitro Model
for Antigenic Selection and Cytokine-Mediated Growth Regulation of Germinal Centre B
Cells. Scandinavian Journal of Immunology, 57(6), 545–555.

Finotello, F., Mayer, C., Miranda, N., and Trajanoski, Z. (2017). quantiseq: quantifying
immune contexture of human tumors. bioRxiv, page doi: https://doi.org/10.1101/223180.

Galon, J., Costes, A., Sanchez-Cabo, F., Kirilovsky, A., Mlecnik, B., Lagorce-Pages, C.,
Tosolini, M., Camus, M., Berger, A., Wind, P., Zinzindohoue, F., Bruneval, P., Cugnenc,
P.-H., Trajanoski, Z., Fridman, W.-H., and Pages, F. (2006). Type, density, and location

83

of immune cells within human colorectal tumors predict clinical outcome. Science, 313,
1960–1964.

Ghasedi Dizaji, K., Wang, X., and Huang, H. (2018). Semi-Supervised Generative Adver-
sarial Network for Gene Expression Inference. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, KDD ’18, pages 1435–
1444, New York, NY, USA. ACM. event-place: London, United Kingdom.

Gong, T. and Szustakowski, J. (2013). Deconrnaseq: a statistical framework for decon-
volution of heterogeneous tissue samples based on mrna-seq data. Bioinformatics, 29,
1083–1085.

Gong, T., Hartmann, N., Kohane, I. S., Brinkmann, V., Staedtler, F., Letzkus, M., Bongio-
vanni, S., and Szustakowski, J. D. (2011). Optimal deconvolution of transcriptional proﬁl-
ing data using quadratic programming with application to complex clinical blood samples.
PLOS ONE, 6, e27156.

Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., and Noble, W. S. (2007). Quantifying
similarity between motifs. Genome Biology, 8(2), R24.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (2005). Robust
statistics: the approach based on inﬂuence functions. Wiley. Google-Books-ID: NhbvAAAA-
MAAJ.

Hoek, K. L., Samir, P., Howard, L. M., Niu, X., Prasad, N., Galassie, A., Liu, Q., Allos,
T. M., Floyd, K. A., Guo, Y., Shyr, Y., Levy, S. E., Joyce, S., Edwards, K. M., and Link,
A. J. (2015). A cell-based systems biology assessment of human blood to monitor immune
responses after inﬂuenza vaccination. PLoS One, 10, e0118528.

Hsu, C.-W., Chang, C.-C., and Lin, C.-J. (2003). A practical guide to support vector
classiﬁcation. Technical report.

Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: A
survey. IEEE Transactions on Knowledge and Data Engineering, 16, 1370–1386.

Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model
selection. The Annals of Statistics, 28(5), 1302–1338.

Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems. Classics in
Applied Mathematics. Society for Industrial and Applied Mathematics.

Lei, Y., Xie, Y., Tan, Y. S., Prince, M. E., Moyer, J. S., Nor, J., and Wolf, G. T. (2016).
Telltale tumor inﬁltrating lymphocytes (til) in oral, head & neck cancer. Oral oncology,
61, 159–165.

84

Li, B., Severson, E., Pignon, J.-C., Zhao, H., Li, T., Novak, J., Jiang, P., Shen, H., Aster,
J. C., Rodig, S., Signoretti, S., Liu, J. S., and Liu, X. S. (2016). Comprehensive analyses
of tumor immunity: implications for cancer immunotherapy. Genome Biology, 17, 174.

Liebner, D. A., Huang, K., and Parvin, J. D. (2014). Microarray microdissection with anal-
ysis of diﬀerences is a computational tool for deconvoluting cell type-speciﬁc contributions
from tissue samples. Bioinformatics, 30, 682–689.

Mackey, M. D., Mackey, D. J., Higgins, H. W., and Wright, S. W. (1996). Chemtax
- a program for estimating class abundances from chemical markers:application to hplc
measurements of phytoplankton. Marine Ecology Progress Series, 144, 265–283.

Mlecnik, B., Bindea, G., Angell, H. K., Maby, P., Angelova, M., Tougeron, D., Church,
S. E., Lafontaine, L., Fischer, M., Fredriksen, T., Sasso, M., Bilocq, A. M., Kirilovsky, A.,
Obenauf, A. C., Hamieh, M., Berger, A., Bruneval, P., Tuech, J.-J., Sabourin, J.-C., Pessot,
F. L., Mauillon, J., Raﬁi, A., Laurent-Puig, P., Speicher, M. R., Trajanoski, Z., Michel,
P., Sesboue, R., Frebourg, T., Pages, F., Valge-Archer, V., Latouche, J.-B., and Galon, J.
(2016). Integrative analyses of colorectal cancer show immunoscore is a stronger predictor
of patient survival than microsatellite instability. Immunity, 44, 698–711.

Nagarsheth, N., Wicha, M. S., and Zou, W. (2017). Chemokines in the cancer microen-
vironment and their relevance in cancer immunotherapy. Nature Reviews. Immunology,
17(9), 559–572.

Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D.,
Diehn, M., and Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue
expression proﬁles. Nature Methods, 12, 453–457.

Nguyen, N., Bellile, E., Thomas, D., McHugh, J., Rozek, L., Virani, S., Peterson, L.,
Carey, T. E., Walline, H., Moyer, J., Spector, M., Perim, D., Prince, M., McLean, S.,
Bradford, C. R., Taylor, J. M. G., Wolf, G. T., Head, and Investigators, N. S. P. (2016).
Tumor inﬁltrating lymphocytes and survival in patients with head and neck squamous cell
carcinoma. Head Neck, 38, 1074–1084.

Nguyen, N. H. and Tran, T. D. (2011). Exact Recoverability From Dense Corrupted Ob-
servations via $\ell _{1}$-Minimization. IEEE Transactions on Information Theory, 59,
2017–2035.

Nguyen, T. D. and Welsch, R. (2010). Outlier detection and least trimmed squares ap-
proximation using semi-deﬁnite programming. Computational Statistics & Data Analysis,
54(12), 3212–3226.

O’Malley, R. C., Huang, S.-S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli,
M., Gallavotti, A., and Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the
Regulatory DNA Landscape. Cell, 165(5), 1280–1292.

85

Pages, F., Kirilovsky, A., Mlecnik, B., Asslaber, M., Tosolini, M., Bindea, G., Lagorce, C.,
Wind, P., Marliot, F., Bruneval, P., Zatloukal, K., Trajanoski, Z., Berger, A., Fridman,
W.-H., and Galon, J. (2009).
In situ cytotoxic and memory t cells predict outcome in
patients with early-stage colorectal cancer. Journal of clinical oncology, 27, 5944–5951.

Qiao, J., Tang, H., and Fu, Y. X. (2017). DNA sensing and immune responses in cancer
therapy. Current opinion in immunology, 45, 16–20.

Qiao, W., Quon, G., Csaszar, E., Yu, M., Morris, Q., and Zandstra, P. W. (2012). Pert :
A method for expression deconvolution of human blood samples from varied microenviron-
mental and developmental conditions. PLOS Computational Biology, 8, e1002838.

Quang, D. and Xie, X. (2016). DanQ: a hybrid convolutional and recurrent deep neural
network for quantifying the function of DNA sequences. Nucleic Acids Research, 44(11),
e107–e107.

Racle, J., de Jonge, K., Baumgaertner, P., Speiser, D. E., and Gfeller, D. (2017). Simulta-
neous enumeration of cancer and immune cell types from bulk tumor gene expression data.
eLife, 6.

Read, D. F., Cook, K., Lu, Y. Y., Roch, K. L., and Noble, W. (2019). Predicting gene
expression in the human malaria parasite Plasmodium falciparum. bioRxiv, page 431049.

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K.
(2015). limma powers diﬀerential expression analyses for RNA-sequencing and microarray
studies. Nucleic Acids Research, 43(7), e47.

Robinson, D., Wu, Y.-M., Lonigro, R., Vats, P., Cobain, E., Everett, J., Cao, X., Rabban,
E., Kumar-Sinha, C., Raymond, V., Schuetze, S., Alva, A., Siddiqui, J., Chugh, R., Worden,
F., Zalupski, M. M., Innis, J., Mody, R. J., Tomlins, S. A., Lucas, D., Baker, L., Ramnath,
N., Schott, A., Hayes, D., Vijai, J., Oﬃt, K., Stoﬀel, E., Roberts, S., Smith, D., Kunju,
L., Talpaz, M., Cieslik, M., and Chinnaiyan, A. M. (2017). Integrative clinical genomics of
metastatic cancer. Nature, 548, 297–303.

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American
Statistical Association, 79, 871–880.

Rousseeuw, P. J. and Driessen, K. V. (2006). Computing lts regression for large data sets.
Data Mining and Knowledge Discovery, 12, 29–45.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John
Wiley & Sons.

She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression.
Journal of the American Statistical Association, 106, 626–639.

86

Shen, F., Shen, C., Hengel, A. v. d., and Tang, Z. (2013). Approximate Least Trimmed
Sum of Squares Fitting and Applications in Image Analysis. IEEE Transactions on Image
Processing, 22(5), 1836–1847.

Shen, Y. and Sanghavi, S. (2019).
Regression.

Iterative Least Trimmed Squares for Mixed Linear

Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M.,
Hastie, T., Sarwal, M. M., Davis, M. M., and Butte, A. J. (2010). Cell type–speciﬁc gene
expression diﬀerences in complex tissues. Nature Methods, 7, 287–289.

Singh, R., Lanchantin, J., Robins, G., and Qi, Y. (2016). DeepChrome: deep-learning for
predicting gene expression from histone modiﬁcations. Bioinformatics, 32(17), i639–i648.

Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould,
J., Davis, J. F., Tubelli, A. A., Asiedu, J. K., Lahr, D. L., Hirschman, J. E., Liu, Z.,
Donahue, M., Julian, B., Khan, M., Wadden, D., Smith, I. C., Lam, D., Liberzon, A.,
Toder, C., Bagul, M., Orzechowski, M., Enache, O. M., Piccioni, F., Johnson, S. A., Lyons,
N. J., Berger, A. H., Shamji, A. F., Brooks, A. N., Vrcic, A., Flynn, C., Rosains, J., Takeda,
D. Y., Hu, R., Davison, D., Lamb, J., Ardlie, K., Hogstrom, L., Greenside, P., Gray, N. S.,
Clemons, P. A., Silver, S., Wu, X., Zhao, W.-N., Read-Button, W., Wu, X., Haggarty, S. J.,
Ronco, L. V., Boehm, J. S., Schreiber, S. L., Doench, J. G., Bittker, J. A., Root, D. E.,
Wong, B., and Golub, T. R. (2017). A Next Generation Connectivity Map: L1000 Platform
and the First 1,000,000 Proﬁles. Cell, 171(6), 1437–1452.e17.

Tosolini, M., Algans, C., Pont, F., Ycart, B., and Fournie, J.-J. (2016). Large-scale mi-
croarray proﬁling reveals four stages of immune escape in non-Hodgkin lymphomas. On-
coimmunology, 5(7).

Tosolini, M., Pont, F., Poupot, M., Vergez, F., Nicolau-Travers, M.-L., Vermijlen, D., Sarry,
J.-E., Dieli, F., and Fournie, J.-J. (2017). Assessment of tumor-inﬁltrating tcrvγ9vδ2 γδ
lymphocyte abundance by deconvolution of human cancers microarrays. Oncoimmunology,
6, e1284723.

Uygun, S., Seddon, A. E., Azodi, C. B., and Shiu, S.-H. (2017). Predictive Models of Spatial
Transcriptional Response to High Salinity. Plant Physiology, 174(1), 450–464.

Vallania, F., Tam, A., Lofgren, S., Schaﬀert, S., Azad, T. D., Bongen, E., Alsup, M., Alonso,
M., Davis, M., Engleman, E., and Khatri, P. (2018). Leveraging heterogeneity across
multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and
technical biases. Nature Communications, 9, 4735.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,
L., and Polosukhin, I. (2017). Attention is All you Need. In I. Guyon, U. V. Luxburg,

87

S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in
Neural Information Processing Systems 30 , pages 5998–6008. Curran Associates, Inc.

Visek, J. (2006a). The least trimmed squares. Part I: Consistency. Kybernetika, 42(1),
1–36.

Visek, J. (2006b). The least trimmed squares part II: sqrtn-consistency. Kybernetika, 42,
181–202.

Wang, H., Huang, S., Shou, J., Su, E. W., Onyia, J. E., Liao, B., and Li, S. (2006).
Comparative analysis and integrative classiﬁcation of NCI60 cell lines and primary tumors
using gene expression proﬁling data. BMC Genomics, 7, 166.

Wang, X., Ghasedi Dizaji, K., and Huang, H. (2018). Conditional generative adversarial
network for gene expression inference. Bioinformatics, 34(17), i603–i611.

Washburn, J. D., Mejia-Guerra, M. K., Ramstein, G., Kremling, K. A., Valluru, R., Buckler,
E. S., and Wang, H. (2019). Evolutionarily informed deep learning methods for predicting
relative transcript abundance from DNA sequence. Proceedings of the National Academy
of Sciences, 116(12), 5542–5549.

Weingarten-Gabbay, S. and Segal, E. (2014). The grammar of transcriptional regulation.
Human Genetics, 133(6), 701–711.

Weirauch, M. T., Yang, A., Albu, M., Cote, A. G., Montenegro-Montero, A., Drewe, P.,
Najafabadi, H. S., Lambert, S. A., Mann, I., Cook, K., Zheng, H., Goity, A., van Bakel,
H., Lozano, J.-C., Galli, M., Lewsey, M. G., Huang, E., Mukherjee, T., Chen, X., Reece-
Hoyes, J. S., Govindarajan, S., Shaulsky, G., Walhout, A. J. M., Bouget, F.-Y., Ratsch,
G., Larrondo, L. F., Ecker, J. R., and Hughes, T. R. (2014). Determination and inference
of eukaryotic transcription factor sequence speciﬁcity. Cell, 158(6), 1431–1443.

Wilkins, O., Hafemeister, C., Plessis, A., Holloway-Phillips, M.-M., Pham, G. M., Nico-
tra, A. B., Gregorio, G. B., Jagadish, S. V. K., Septiningsih, E. M., Bonneau, R., and
Purugganan, M. (2016). EGRINs (Environmental Gene Regulatory Inﬂuence Networks) in
Rice That Function in the Response to Water Deﬁcit, High Temperature, and Agricultural
Environments. The Plant Cell, 28(10), 2365–2384.

Wolf, G. T., Chepeha, D. B., Bellile, E., Nguyen, A., Thomas, D., McHugh, J., of Michi-
gan Head, T. U., and Program, N. S. (2015). Tumor inﬁltrating lymphocytes (til) and
prognosis in oral cavity squamous carcinoma: a preliminary study. Oral oncology, 51,
90–95.

Wright, J. and Ma, Y. (2009). Dense error correction via l1-minimization. In 2009 IEEE
International Conference on Acoustics, Speech and Signal Processing, pages 3033–3036.

88

Xu, Q., Yan, M., Huang, C., Xiong, J., Huang, Q., and Yao, Y. (2017). Exploring outliers
in crowdsourced ranking for QoE. Proceedings of the 25th ACM Multimedia.

Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres-Garcia, W.,
Treviño, V., Shen, H., Laird, P. W., Levine, D. A., Carter, S. L., Getz, G., Stemke-Hale,
K., Mills, G. B., and Verhaak, R. G. W. (2013). Inferring tumour purity and stromal and
immune cell admixture from expression data. Nature Communications, 4, 2612.

Zhou, J. and Troyanskaya, O. G. (2015). Predicting eﬀects of noncoding variants with deep
learning–based sequence model. Nature Methods, 12(10), 931–934.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320.

89