STATISTICAL AND COMPUTATIONAL METHODS FOR BIOLOGICAL DATA By Yuning Hao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering — Dual Major Statistics — Doctor of Philosophy 2019 ABSTRACT STATISTICAL AND COMPUTATIONAL METHODS FOR BIOLOGICAL DATA By Yuning Hao The development of biological data focuses on machine learning and statistical methods. In immunotherapy, gene-expression deconvolution is used to quantify different types of cells in a mixed population. It provides a highly promising solution to rapidly characterize the tumor-infiltrating immune landscape and identify cold cancers. However, a major challenge is that gene-expression data are frequently contaminated by many outliers that decrease the estimation accuracy. Thus, it is imperative to develop a robust deconvolution method that automatically decontaminates data by reliably detecting and removing outliers. Our development of an algorithm called adaptive Least Trimmed Square (aLTS) identifies outliers in regression models, allows us to effectively detect and omit the outliers, and provides us robust estimations of the coefficients. For the guarantees of the convergence property and parameters recovery, we also included certain theoretical results. Another interesting topic is the investigation of the association of phenotype responses with the identified intricate patterns in transcription factor binding sites for DNA sequences. To address these concerns, we pushed forward with a deep learning-based framework. On one hand, to capture regulatory motifs, we utilized convolution and pooling layers. On the other hand, to understand the long-term dependencies among motifs, we used position embedding and multi-head self-attention layers. We pursued the improvement of our model’s overall efficacy through the integration of transfer learning and multi-task learning. To ascertain confirmed and novel transcription factor binding motifs (TFBMs), along with their relationships internally, we provided interpretations of our DNA quantification model. ACKNOWLEDGMENTS I would like to express the deepest appreciation to my advisor and committee chair Professor Yuying Xie. He encouraged me and led me into the area of statistical machine learning. This dissertation would not have been possible without his excellent guidance and continuous support during my PhD research. I also would like to thank my committee members, Professor Yuehua Cui, Professor Grace Hong and Professor Ming Yan, for their insightful comments and feedback in my research. Thanks to my dear friends, Hongnan Wang, Ningyu Sha, Jieqian He, Binbin Huang, Qi Liang and Minzhi Liu for their consistent help and company. I greatly value their friendship and they are the treasures I got during my whole life. Most importantly, I would like to thank my parents Jianhua Hao and Peilan Ning for loving me so much and supporting me at every stage of my life. Many thanks to my boyfriend Yang and my little friend Lucky for always standing by my side and be my constant source of strength. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Adaptive Least Trimmed Square . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Convergence property and recovery guarantee . . . . . . . . . . . . . . . . . 2.4 Parameter tunning for ALTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Fast and Robust Deconvolution of Tumor Infiltrating Lympho- cyte from Expression Profiles . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In silico simulation with varied error types . . . . . . . . . . . . . . . In silico simulation based on leukocyte gene signature matrix file . . Synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthetic dataset with added unknown content . . . . . . . . . . . . 3.5 Real-data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Deconvolution performance on immune-cell-rich datasets . . . . . . . 3.5.2 Deconvolution performance on RNA-seq datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Lung Squamous Cell carcinoma datasets and Ovarian Serous Cystade- nocarcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Survival analysis 3.4.1 3.4.2 3.4.3 3.4.4 Chapter 4 Deep learning framework on DNA quantification . . . . . . . . 4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Features and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 The DeepCAT model design . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Training of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Clustered multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.6 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpreting the model v 1 4 4 5 8 10 11 12 15 15 19 20 21 21 22 23 23 24 24 25 26 26 28 40 43 43 44 46 47 48 48 49 4.2.1 Random initialization for convolution kernel weight . . . . . . . . . . 4.2.2 Transfer learning with human data . . . . . . . . . . . . . . . . . . . 4.2.3 Transfer learning with known motif database . . . . . . . . . . . . . . 4.2.4 Performance of clustered multi-task learning . . . . . . . . . . . . . . 4.2.5 Transcription factor binding motifs and their interactions . . . . . . . Chapter 5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX A Supplementary text for Chapter 2 . . . . . . . . . . . . . . . . . . APPENDIX B Supplementary text for Chapter 3 . . . . . . . . . . . . . . . . . . APPENDIX C Supplementary text for Chapter 4 . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 51 52 54 57 59 60 66 77 81 vi LIST OF TABLES Table 2.1: Tuned k for ALTS with the adjusted BIC. We simulated normally dis- tributed errors and heavy-tailed errors respectively for different proportion of outliers and computed true positive rate and false positive rate to eval- uate the tunning result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Table 2.2: Parameters of ALTS. To show that ALTS is not sensitive for different values of α1, and α2 in an appropriate range with tuned value of k, we simulated a dataset with sample size n = 500, number of predictors p = 20, normal distributed error and 20% outliers using the same setting of section 2.5 in the paper. Then we ran ALTS with following setting and get the number of detected outliers, true and false positive rate: (1) Take α2 = 1.1, change α1 from 0.1 to 0.5 by 0.05 and tune k using BIC*. (2) Take α1 = 0.1, change α2 from 1.1 to 2 by 0.1 and tune k using BIC*. (3) Take α1 = 0.1, α2 = 1.5, and change k from 1 to 10 by 0.1. We can see that the accuracy . . . . . . . . . . . . . . . . of the result stays stable with a well tuned k. Table 3.1: Performance of FARDEEP, CIBERSORT, NNLS, PERT and DCQ meth- . . . . . . . . . . . . ods under different simulation settings with outliers. Table 3.2: The outlier genes detected by FARDEEP and their average expression val- ues in 7 ovarian cancer cell lines for TCGA OV dataset. Here we listed all . . . . genes with removal frequency larger than 50% among 514 samples. Table B.1: This table shows the results from CIBERSORT and FARDEEP on two mixtures (MIX1 and MIX2) with HCT116 and MDA-MB-435 respectively. FARDEEP (relative prop) was gotten by normalize the default outputs to sum to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table B.2: Tuned k for FARDEEP with the adjusted BIC. We simulated normally distributed errors and heavy-tailed errors respectively for different propor- tion of outliers and computed true positive rate and false positive rate to evaluate the tunning result. . . . . . . . . . . . . . . . . . . . . . . . . . . Table B.3: We simulated correlated errors with different proportion of outliers for FARDEEP, and computed true positive rate and false positive rate to eval- . . . . . . . . . . . . . . . . . . . . . . . . . . . . uate the tunning result. 14 38 39 71 72 74 vii LIST OF FIGURES Figure 2.1: Comparing the performance of ALTS and LTS. We simulated different per- centages of outlier ({5%, 10%, 20%, 30%}) with both normally distributed and heavy-tailed random errors. The two approaches are compared based on the sum of squared errors for their estimated coefficients. . . . . . . . Figure 3.1: SSE of coefficients for different approaches. We simulated different per- centage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coef- ficients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A) random error with standard normal distribution, (B) random error with t-distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.2: Compare the estimation accuracy of different deconvolution approaches (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well the estimators fit for a straight line ˆβ = β. (A) random error with standard normal distribution, (B) random error with t-distribution. . . . . . . . . . . . . . Figure 3.3: Applying different deconvolution approaches on the gene expression data of IM-9, Jurkat, Raji, THP-1 and the mixture of these four immune cell lines with known proportion (MixA, MixB, MixC, MixD). All of the mix- tures were performed and measured in triplicate. (A) SSE of coefficients for FARDEEP, CIBERSORT, NNLS, PERT, DCQ. (B) The abundance of cell lines estimated from different deconvolution approaches vs. Abun- dance of cell lines truly mixed. The R2 and R values are also reported at the top of the figures. (C) Deconvolution of individual cell subsets by FARDEEP and CIBERSORT. The correlation coefficients R, the corre- sponding p-values against the null hypothesis of R = 0, trend lines with 95% confidence intervals are shown in the figures. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. . . . . . . . . . . . . . . . . . . 13 30 31 32 viii Figure 3.4: Performance comparison between CIBERSORT and FARDEEP on mix- tures with unknown content.(A) SSE for various noise and abundance of tumor contents on 4 different mixtures. (B) Abundance of cell lines esti- mated by CIBERSORT and FARDEEP vs. Abundance of cell lines truly mixed. The R2 and R values are also reported in the top of the figures. (C) Deconvolution of individual cell subsets by FARDEEP and CIBER- SORT. The correlation coefficients R, the corresponding p-values against the null hypothesis of R = 0, trend lines with 95% confidence intervals are shown in the figures. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.5: Performance assessment on Microarray data with SSE, R2 and R. The follicular lymphoma dataset is evaluated for (A) Overall performance and (B) individual cell subsets. (C) Normal tonsil dataset with purified B cells and T cells. The blood samples from pediatric renal transplant patients are evaluated for (D) Overall performance and (E) individual cell sub- sets. The black dashed line represents the perfect relationship between . the estimate and the true cell abundances with slope 1 and intercept 0. Figure 3.6: Gene-expression deconvolution performance of FARDEEP and CIBER- SORT on RNA-seq data. (A) Overall and (B) individual cell subsets results for 8 PBMC samples collected from two vaccinated donors at dif- ferent time points. (C) Overall and (D) individual cell subsets result for lymph node bulk samples. Correlation coeffient R and the correspoinding p-value are missing for CIBERSORT on NK cell because the CIBERSORT estimations are all zero. The black dashed line represents the perfect re- lationship between the estimate and the true cell abundances with slope 1 and intercept 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.7: Kaplan-Meier survival curves are plotted based on ESTIMATE, FARDEEP- and CIBERSORT- assisted TIL profiling. Log-Rank test was applied to data classified into two groups according to whether immunoscore (ES- TIMATE) or collective anti-tumor immune subsets (CIBERSORT and FARDEEP) was above or below the median. (A) 514 patients with ovar- ian cancer. (B) 133 patients with lung squamous cell carcinoma. . . . . . Figure 3.8: Kaplan-Meier survival curves are plotted based on CIBERSORT (abso- lute mode)- assisted TIL profiling. Patients were classified into two groups according to whether immunoscore (ESTIMATE) or collective anti-tumor immune subsets was above or below the median. Log-Rank test was ap- plied to obtain the p-value. (A) 514 patients with ovarian cancer. (B) 133 . . . . . . . . . . . . . . . . patients with lung squamous cell carcinoma. 33 34 35 36 37 ix Figure 3.9: FARDEEP score of TCGA OV and LUSC datasets are calculated from the summation of 22 TIL subset scores, which show highly negative cor- relations to the consensus measurement of purity estimations (CPE). . . 37 Figure 4.1: Model architecture. The convolution layer acts as a scanner of motifs and produces an output with dimension of (sequence length - kernel size + 1) × (number of kernel). The max-pooling layer reduces the output dimension by extracting the maximum values through each pooling kernel. Then the special token and position encoding matrix are added to the output. The two Transformer encoder layers consider the interactions between each token of the output sequence from previous layers and re-interpret it. Finally, a sigmoid nonlinear transformation is performed to get the prediction scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.2: Performance comparison among DeepCAT, DanQ, SVM and Random for- est through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The stress conditions are ordered by the values of DeepCAT on x-axis, and the average of ROC and PR scores for each approach are . . . . . . . . . . . . . . . . . . . . . . . . . . . included in parentheses. Figure 4.3: The effect of applying pre-trained weight from human data into DeepCAT. (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The straight line y = x represents the equality between the results of randomly initialized DeepCAT and DeepCAT with transfer learning based on human data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.4: The effect of applying known TFBMs into DeepCAT. (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The straight line y = x represents the equality between the results of randomly initialized DeepCAT and DeepCAT with transfer learning based on known TFBM database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.5: Hierarchical clustering with Jaccard’s distance for 57 stress responses. . . Figure 4.6: Performance comparison among DanQ, SVM, Random forest, DeepCAT, DeepCAT with known TFBMs, multi-task DeepCAT with known TFBMs through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress con- ditions. The stress conditions are ordered by the values of DeepCAT on x-axis, and the average of ROC and PR scores for each approach are . . . . . . . . . . . . . . . . . . . . . . . . . . . included in parentheses. Figure 4.7: Three examples of aligned TFBMs through TOMTOM algorithm. The upper figures are the known TFBMs and the lower ones are generated by convolution kernel weights trained by DeepCAT. The p-values are 7.46e−11, 1.22e−13, 3.27e−11 respectively. . . . . . . . . . . . . . . . . . 47 50 51 52 53 54 55 x Figure 4.8: Three examples of identified novel TFBMs. The upper figures come from the initialized random motifs and the lower ones are generated by convo- lution kernel weights trained by DeepCAT. . . . . . . . . . . . . . . . . . Figure B.1: Effect of Z-score normalization. Red dashed line indicates the true abso- lute abundance of each cell. SVR is the common Support vector regres- sion without normalization and SVR.znormalization is the method used in CIBERSORT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure B.2: The y-axis are the sorted expression value of HCT116 and MDA-MB-435 cell line for 584 probesets. The probes which have been detected as outlier . . . . . . . . . . . . . . . . . by FARDEEP are marked with color red. Figure B.3: SSE of coefficients for different approaches based on simulations with X related outliers. We simulated different percentage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coefficients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A) random error with standard normal distribution, (B) random error with t-distribution. . . . . . . . . Figure B.4: Compare the estimation accuracy of different deconvolution approaches for simulations with X related outliers (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are the estimators fit for a straight line ˆβ = β. (A) random error with standard normal distribution, (B) random error with t-distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure B.5: SSE of coefficients for different approaches based on simulations with correlated random errors. We simulated different percentage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coefficients applying . . . . . . . . . . . NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. Figure B.6: Compare the estimation accuracy of different deconvolution approaches for simulations with correlated random errors (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are the estimators fit for a straight line ˆβ = β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure C.1: Sequence logo figures to validate the transformation between convolution kernels and motifs. (A) original TFBMs, (B) converted TFBMs. . . . . . 56 68 71 72 73 75 76 77 xi Chapter 1 Overview Biological data commonly uses machine learning and statistical analysis. In immunother- apy, the gene-expression deconvolution is extensively used, which quantifies a variety of cells in a mixed population. Rapidly emerging evidence suggests that the tumor immune mi- croenvironment not only predisposes cancer patients to diverse treatment outcomes but also represents a promising source of biomarkers for better patient stratification. Different from the immunohistochemistry-based scoring practice, which focuses on a few selected marker proteins, immune deconvolution pipelines inform a previously untapped method to com- prehensively reveal the tumor-infiltrating immune landscape. However, outliers frequently contaminate the gene expression data and decrease the accuracy of the estimations. Those outliers can be the genes with abnormal expression value which may be caused by measure- ment error, environmental effect, expression from non-immune cells, or natural fluctuations in certain type of immune cells. Thus, to automatically detect and remove these outliers, we developed a robust deconvolution method called the adaptive Least Trimmed Square (aLTS) for data decontamination. This tackles the challenges presented by the classical Least Trimmed Square that uses an approximated pre-estimated corruption ratio for outlier elimination which may cause significant loss of data in spite of its ability to create a higher breakdown point for the regression model. Our theoretical results attempt to highlight the convergence property and parameters recovery guarantee. 1 The process of the aLTS application into the gene expression deconvolution involves three factors. First, we obtain the non-negative least squares (NNLS) by using the Lawson-Hanson algorithm (Lawson and Hanson, 1995). Second, given that the number of cells is consistently non-negative, we swapped the ordinary least squares (OLS) with the NNLS. Third, to make it appropriate for log-normal distributed errors, we altered the tuning parameters, which are also known as penalty parameters. For gene expression deconvolution, we identified this emerging machine learning tool as Fast And Robust DEconvolution of Expression Profiles (FARDEEP), which could itemize the subsets of the immune cells originating from the whole tumor tissue samples. Furthermore, in contrast to the current methods with numerical simulations and real datasets, our choice to move forward with FARDEEP provides better approximations of coefficients and showcases less vulnerability to outliers. In addition to the relative percentage values, the estimation provided by FARDEEP is associated with the absolute quantity of each immune cell subset. As a result, to supplement the current toolkit that characterizes the tissue-infiltrating immune cell landscape, FARDEEP serves as a novel robust algorithm. Another area that we tackled is the identification of complex patterns in transcription factor binding sites for DNA sequences while investigating its association with phenotype responses. In this part, the emphasis of our study is predict the gene expression level for plant data under different stressful environments using the information extracted from transcription factor binding motifs (TFBMs). Comparing to the traditional machine learning approaches which need to do feature engineering manually, we developed a deep learning structured model, DeepCAT, which can generate the prediction of gene expression directly from raw DNA sequence information. Moreover, in addition to the motifs scanning with convolution and pooling layers, we also take the long-term interactions between motifs into 2 account for DeepCAT by using multi-heads attention layers. In this way, our model can be widely applied to learn the phenotype related motifs along with their interactions with regardless of the sequence length. 3 Chapter 2 Adaptive Least Trimmed Square 2.1 Introduction Robust regression or outlier detection methods aim to obtain reliable information from the contaminated data since outliers are pervasive in most of the datasets. Hampel et al. (2005) suggests that there are usually 1% to 10% outliers exist in a routine dataset. Outliers refer to abnormal points that come from other populations which have different distributions from the original dataset. Outliers may have serious influence on the result of studies. For example, ordinary least squares (OLS), the most common statistical modeling approach, is very sensitive to outliers, it can be broken down by a very small group of abnormal points. To address this problem, Rousseeuw (1984) and Rousseeuw and Leroy (1987) proposed Least Trimmed Square (LTS) framework which finds h observations with smallest residuals, and estimate the coefficients with this subset of data, (cid:88) ˆβ = (yi − xiβ) i However, this LTS is known to be an NP-hard problem (Bernholt, 2006), then many studies reformulated this problem and proposed many effective ways to solve it (Rousseeuw and Driessen, 2006; Nguyen and Welsch, 2010; Alfons et al., 2013; Shen et al., 2013; Bhatia et al., 2015; Shen and Sanghavi, 2019). Nevertheless, most of these approaches require a 4 fixed ratio of outliers which have to be roughly estimated in the beginning. This may give a suboptimal fitting result or inaccurate estimation of outliers since using the information of only a fixed proportion of the data reduces the power of the estimator, because the amount of outliers in the real case cannot be presumed and can be either small or large. On the other hand, several studies used the soft thresholding to got the recoverability for parameters from dense corrupted observations (Wright and Ma, 2009; She and Owen, 2011; Nguyen and Tran, 2011). However, some of these methods did not provide its theoretical properties, and others have severe restrictions on the data. To further improve the above approaches, we propose an Adaptive Least Trimmed Square (ALTS) which can automatically adjust the size of outliers based on the information from residuals during each iteration in this study. It can either identify outliers or make robust estimation for the coefficients. And we also provide the theoretical guarantees for its convergence and parameter recovery properties. 2.2 Model formulation We consider the linear regression model can be expressed by, y = Xβ + ε, ε ∼ N (0, σ2I), where X ∈ Rn×p is the fixed predictor matrix, y ∈ Rn is the observed response vector, β ∈ Rp is the coefficient vector, and ε ∈ Rn is a vector of random errors with zero mean and variance of σ2I. To incorporate outliers, we propose the following model y = Xβ + γ + ε, ε ∼ N (0, σ2I), (2.1) 5 where parameter γ = (γ1, ..., γn)(cid:48) is a sparse vector in Rn with γi (cid:54)= 0 indicating the ith observation is an outlier. Under the formulation of (2.1), we want to estimate β with all outliers removed, such that ˆβ = (X(cid:62) S∗XS∗)−1X(cid:62) S∗yS∗, where ˆβ is the Ordinary Least Square (OLS) solution with samples in subset S∗, and S∗ is the set of good points with all zero elements in γ. Let the projection matrix H = X(X(cid:62) S∗, then the residuals r = (r1, ..., rn) could be S∗XS∗)−1X(cid:62) formulated as r = y − X ˆβ = γ + ε − HεS∗. (2.2) From (2.2), the residuals, ri with the corresponding γi (cid:54)= 0, would deviate from zero, which suggests that the set of outliers can be identified through thresholding as follows E = {i : |ri| > k × rmed}, (2.3) where E is the set of detected outliers, k is a tuning parameter controlling the sensitivity of the model, and rmed is the median of {|ri|}n i=1. Let N be the number of true outliers in the data which is the cadinality of set E. First, we can use least squares and formula (2.3) to obtain a rough estimate of the set of outliers and denote its cardinality to be N. Since the model at this moment is inaccurate with contamination of outliers, N is an overestimation of N which can be used to get an underestimate via N = α1N with α1 ∈ (0, 1). With N, we can then update the least square fitting after removing the N samples with the largest absolute value of residuals and obtain an improved estimate of E and the corresponding N. We can improve the model by repeating the procedure, but we need to increase the underestimate of outliers, N, by a factor of α2 with α2 > 1 for each iteration to force the 6 convergence between N and N. In sum, we initialize our algorithm by setting = (X(cid:62)X)−1X(cid:62)y, (0) ˆβ r(0) = y − X ˆβ (0) , which is the OLS solution. For the jth iteration, where j ≥ 1, we update N (j) by med}(cid:12)(cid:12)(cid:12), (j) | > r (j) i  (cid:12)(cid:12)(cid:12){i : |r min((cid:12)(cid:12)(cid:110) (j) N = (cid:111)(cid:12)(cid:12), N (j−1) ), j = 0, j ≥ 1. (2.4) i : |r (j) i | > k · r (j) med where the min(·,·) operator guarantees that N (j), an overestimation of N, is non-increasing. Similarly, we update N (j) through  N (j) = (j)(cid:101), (cid:100)α1N min{(cid:100)α2N (j−1)(cid:101), N j = 0, (j)}, j ≥ 1, (2.5) where (cid:100)x(cid:101) means the ceiling of x ∈ R, α1 ∈ (0, 1) is used to obtain a lower bound for N in the first step, α2 > 1 guarantees the monotonicity of N (j), and the min(·,·) operator (j). Then we update ˆβ and r after removing N (j) outliers guarantees N (j) is smaller than N by ˆβ (j+1) = (X(cid:62) Sj XSj r(j+1) = y − X ˆβ (j+1) , )−1X(cid:62) Sj ySj , where XSj ∈ R(n−N (j))×p and ySj ∈ R(n−N (j)) are submatrix and subvector of X and 7 y with row indices in set Sj = {π(i) : i ≤ (n − N (j))}, here π is a permutation operator π(n)|. We repeat this procedure until N and N on [n] such that |r π(2)| ≤ ··· ≤ |r (j) (j) π(1)| ≤ |r (j) converge, and the elements in set of ˆE at this time is the final estimated outliers. Hence, we hereby report a novel approach, coined as Adaptive Least Trimmed Square (ALTS), to automatically detect and remove contaminating outliers. In the next section, we will provide the theoretical results for the convergence property and the parameter recovery guarantee of ALTS. 2.3 Convergence property and recovery guarantee The prove of convergence relies on two properties of the algorithm. For each iteration, the OLS solution is optimized on the current active subset, and the selected points for the next updated active subset should have the smallest residuals. We first show that the objective function is decreasing during each of the iteration as the following theorem. Theorem 2.3.1. The Adaptive Least Trimmed Square (ALTS) iteration sequence satisfies L(β(j+1)) ≤ L(β(j)), for any j ≥ 0, with objective function n(cid:88) (yi − xiβ)21{γi=0}. i=1 L(β) = In addition, the algorithm is fast and guarantees to converge within finite steps, which is summarized in the following theorem. Theorem 2.3.2. The Adaptive Least Trimmed Square (ALTS) stops in no more than j∗ steps, where j∗ = (cid:22)− log α1 (cid:23) . log α2 8 Here (cid:98)·(cid:99) is the largest integer that is less than or equal to x. To further show that ALTS has a linear convergence rate and its parameter recovery guarantee, we first define the subset strong convexity and smoothness constant for X follow Bhatia et al. (2015). Definition 2.3.1. For matrix X ∈ Rn×p, we define its subset strong convexity and smooth- ness constants c− τ , c+ τ at level τ as minS∈Sτ λmin(XT S XS) ≥ c− τ , maxS∈Sτ λmax(XT S XS) ≤ c+ τ , where Sτ is the set of all subsets with size of (cid:98)nτ(cid:99), and 0 ≤ τ ≤ 1. Theorem 2.3.3. Suppose the corruption rate at jth iteration of ALTS is ρj, the outlier vector (cid:107)γ(cid:107)0 ≤ nρ∗ and ρ0 > ρ∗. Let c− smoothness constants for X at level τ. We assume that η = 4 τ be the subset strong convexity and < 1, then if τ and c+ (cid:115) c+ ρ0 c− 1−ρ0 logα2 1 α1 = O(logη √ (cid:13)(cid:13)(cid:13)γS0 n (cid:13)(cid:13)(cid:13)2 ), (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + C√ n (cid:107)ε(cid:107)2 , with  > 0 and some constant C. Corollary 2.3.1. Suppose the assumptions are same with Theorem 2.3.3 and the random noises follow Gaussian distribution, i.e. ε ∼ N (0, σ2I). Then when n > log 1 δ for some 9 δ > 0, with probability at least 1 − δ, (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + Cσ, with  > 0 and some constant C. We provided the proofs of the above theorems and corollary in Appendix A refer to Bhatia et al. (2015). From the theoretical properties above, we know that the three parameters influence the performance of ALTS, we will discuss the parameter tuning in the next section. 2.4 Parameter tunning for ALTS There are three tuning parameters k, α1, and α2 in ALTS. As introduced in the last section, a relative small α1 and α2 are preferred to make the estimation more accurate. In practice, ALTS is not very sensitive to different values of α1 and α2 if they are taken from an ap- propriate range, so we set them to 0.1 and 1.5 respectively by default. However, k controls the number of outliers and is critical for the performance of ALTS. Thus, we tune k on a case-by-case basis for each dataset to preserve meaningful information. Effects for different tuning parameters are shown in Table 2.2. Since the test group may contain outliers that influence the accuracy of the tuning result, cross-validation is not advised. Instead, we ap- plied an modified Bayesian Information Criterion (BIC) referring to the setting of She and BIC∗(k) = m log + b(log(m) + 1), (2.6) Owen (She and Owen, 2011) as follow: (cid:80)n i=1 1{i(cid:54)∈ ˆE}(yi − ˆyi)2 m 10 where ˆE being the set of estimated outliers, b is number of parameters and equals ˆN + p + 1 with ˆN = | ˆE| being the number of outliers, and m equals n− ˆN. Then, we choose the value of k associated with the smallest BIC∗. 2.5 Simulation study To test the performance of ALTS, we simulated several datasets with different proportions of outliers following the setting in She and Owen (2011) and Alfons et al. (2013). The observations were generated based on the linear regression model (2.1). The predictor matrix is X = (x1, . . . , xn)(cid:48) = UΣ1/2, where U ij ∼ U(0, 20) and Σij = e 1{i(cid:54)=j} with e = 0.5. Consider the proportion of outliers f ∈ {5%, 10%, 20%, 30%}, sample size n = 500, and number of predictors p = 20, we randomly added outliers to the simulated data as follows: i) Vertical outliers: we generated a n dimensional zero vector τ and randomly selected nf elements in τ to be the outliers generated from a non-central t-distribution with 1 degree of freedom and a non-centrality parameter of 30. ii) Leverage points : we took 20% of the contaminated data in i) as leverage points, that is, replacing the corresponding predictors by the samples from N (2max(X), 1). The coefficients βj were sampled from U(0, 1) with j = 1, . . . , p, and we simulated both normally distributed and heavy-tailed random noise ε for the model respectively. Then, based on the framework above, the dependent variable could be obtained by y = Xβ + τ + ε. We simulated each of the setting 50 times and compared our estimation results with the 11 classic Least Trimmed Square, Figure 2.1 shows that ALTS outperforms LTS in most of the cases since the corruption rates were adjusted automatically by the algorithm for different datasets. For the ALTS results, parameter k was tunned by the modified BIC in (2.6), we show that k decreases when the amount of outliers becomes larger. The true positive rates always stay around 1, indicating that the tunning of k is highly effective (Table 2.1). Table 2.1: Tuned k for ALTS with the adjusted BIC. We simulated normally distributed errors and heavy-tailed errors respectively for different proportion of outliers and computed true positive rate and false positive rate to evaluate the tunning result. Percentage of outliers Normal Heavy-tailed True positive rate False positive rate Parameter (mean of k) True positive rate False positive rate Parameter (mean of k) 5% 1 0.005 3.63 1 0.05 3.03 10% 20% 30% 1 0.009 2.43 1 0.05 2.06 1 0.02 1.46 1 0.08 1.35 1 0.05 1.19 1 0.11 1.14 2.6 Discussion Least Trimmed Square is a highly robust regression method, Visek (2006a,b) have shown that its solution is guaranteed consistency. However, it is very computationally expensive, especially for any large dataset. In practice, we only prefer polynomial time algorithms. And to the best of our knowledge, most of the efficient algorithms based on the idea of LTS are hard be proved statistically consistent, so as ALTS, but we have discussed the parameter recovery guarantee in the previous section. Additionally, although a high breakdown point can improve the tolerance of the regressor for bad observations, it may cause loss of informa- tion if the data is not heavily contaminated. Because a fixed ratio of outliers is not helpful to outlier detection for different dataset, a roughly estimated ratio may improve either the false positive rate or the false negative rate. For example, in gene-expression deconvolution, 12 Figure 2.1: Comparing the performance of ALTS and LTS. We simulated different per- centages of outlier ({5%, 10%, 20%, 30%}) with both normally distributed and heavy-tailed random errors. The two approaches are compared based on the sum of squared errors for their estimated coefficients. since the signature matrices are measured in pure environment, the normal tissue usually contains much less outliers than tumor tissue. Some biological meaningful genes would be removed if we set the ratio of outliers same for both cases. Thus, ALTS has been proposed in this study which is computationally efficient and can automatically detect and remove outliers before the estimation of coefficients. It can be applied in many area with different type of data. We will show its further application on cancer data in next chapter. 13 5% outlier10% outlier20% outlier30% outliernormal heavy−tailed0.000.050.100.150.000.050.100.15Proportion of outliersSSEMethodALTSLTS Table 2.2: Parameters of ALTS. To show that ALTS is not sensitive for different values of α1, and α2 in an appropriate range with tuned value of k, we simulated a dataset with sample size n = 500, number of predictors p = 20, normal distributed error and 20% outliers using the same setting of section 2.5 in the paper. Then we ran ALTS with following setting and get the number of detected outliers, true and false positive rate: (1) Take α2 = 1.1, change α1 from 0.1 to 0.5 by 0.05 and tune k using BIC*. (2) Take α1 = 0.1, change α2 from 1.1 to 2 by 0.1 and tune k using BIC*. (3) Take α1 = 0.1, α2 = 1.5, and change k from 1 to 10 by 0.1. We can see that the accuracy of the result stays stable with a well tuned k. 101 102 100 100 100 101 101 103 102 1 1 1 1 1 1 1 1 1 0.0025 0.0050 0.0000 0.0000 0.0000 0.0025 0.0025 0.0075 0.0050 Change α1 from 0.1 to 0.5 by 0.05 Number of outliers True positive rate False positive rate Change α2 from 1.1 to 2 by 0.1 107 101 101 101 101 107 101 101 101 107 Number of outliers 1 1 1 1 1 1 1 1 1 1 True positive rate 0.0175 0.0025 0.0025 0.0025 0.0025 0.0175 0.0025 0.0025 0.0025 0.0175 False positive rate Change k from 1 to 10 by 0.1, (Tuning result is 1.8 by BIC*, and we marked it as red color) Number of outliers True positive rate False positive rate 249 220 199 176 156 136 115 107 101 95 95 94 92 92 91 89 88 86 82 79 77 75 72 66 65 62 62 60 54 50 44 39 35 29 27 25 22 19 16 15 14 10 9 6 6 4 4 4 4 4 2 1 5 5 4 3 2 2 16 15 10 5 5 5 4 3 2 2 16 15 10 5 5 5 4 3 2 2 16 15 10 5 5 5 4 3 2 2 16 15 10 1 1 1 1 1 1 1 1 1 0.95 0.95 0.94 0.92 0.92 0.91 0.89 0.88 0.86 0.82 0.79 0.77 0.75 0.72 0.66 0.65 0.62 0.62 0.60 0.54 0.50 0.44 0.39 0.35 0.29 0.27 0.25 0.22 0.19 0.16 0.15 0.14 0.10 0.09 0.06 0.06 0.04 0.04 0.04 0.04 0.04 0.02 0.01 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.05 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.05 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.05 0.05 0.05 0.04 0.03 0.02 0.02 0.16 0.15 0.10 0.3725 0.3000 0.2475 0.1900 0.1400 0.0900 0.0375 0.0175 0.0025 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 Chapter 3 Fast and Robust Deconvolution of Tumor Infiltrating Lymphocyte from Expression Profiles 3.1 Introduction Immune checkpoint blockade has revolutionized the rational design of neoadjuvant cancer therapies. Compelling evidence suggests that a favorable tumor immune microenviron- ment underpins better clinical responses to radiotherapy, chemotherapy, and immunotherapy (Deng et al., 2014; Nagarsheth et al., 2017; Corrales et al., 2016a). Immunohistochemistry (IHC)-based immunoscores, which quantify the number of CD8+ cytotoxic T lymphocytes and CD45RO+ memory T cells, show better prognostic potential than conventional patho- logical methods in colon cancer patients (Galon et al., 2006; Mlecnik et al., 2016). Hence, harnessing the composition of intra-tumoral immune cell infiltration is a highly promising approach to stratify tumors (Balermpas et al., 2014, 2016; Nguyen et al., 2016; Pages et al., 2009; Wolf et al., 2015; Lei et al., 2016). The current IHC immunoscoring approach has two limitations. First, the interpretation of immune cell subsets varies among pathologists and institutions, thus lacking a consistent standard for the scoring practice. Second, only a lim- 15 ited number of biomarkers can be assessed simultaneously, which prevents a comprehensive annotation of the immune contexture in the tumor microenvironment (TME). Hence, robust methods for genome data-informed cell type quantitation are in urgent need. Immunogenomics presents an unprecedented opportunity to resolve the intra-tumoral im- mune landscape. Cell type deconvolution using leukocyte signature gene expression profiling is a highly promising approach to estimate the global immune cell composition from whole tumor gene expression data (Newman et al., 2015; Aran et al., 2017; Finotello et al., 2017; Tosolini et al., 2016, 2017; Vallania et al., 2018). However, a significant technical obstacle is that the efficacy and accuracy of gene expression deconvolution are limited by the large number of outliers, which are frequently observed in tumor gene expression datasets (Jiang et al., 2004). The first step towards enhancing the overall gene deconvolution algorithms is to improve methods for outliers identification and processing. Those outliers include genes with abnormal expression value which may be caused by measurement error, environmental effect, expression from non-immune cells, or natural fluctuations in certain type of immune cells. Notably, the current immune deconvolution gene signature matrix relies on the profiling of differentially expressed genes among different immune subsets. Frequent contamination of transcripts reading from cancer cells may significantly bias the algorithms. In this study, we report a novel FAst and Robust DEconvolution of Expression Profiles (FARDEEP) method that significantly improves the estimation of coefficients. Let yi be the observed expression value for the ith gene; xi, a p-dimensional vector, be the expected expression of the ith gene for the p different cell types; and X = [x1, . . . , xn](cid:48) be the signature matrix. The gene-expression deconvolution problem can be formulated as follows, yi = x(cid:48) iβ + εi, 16 (3.1) where β ∈ Rp is an unknown parameter corresponding to the compositions of p cell types, and εi is a noise term with a mean of 0. Several methods were proposed to solve this decon- volution problem. To enforce the non-negativity of β in (3.1), several algorithms, such as the Non-Negative Least Square (NNLS), Non-negative Maximum Likelihood (NNML) frame- works and the perturbation model (PERT) were developed. They all rely on the signature matrix (X) derived from Microarray experiments (Lawson and Hanson, 1995; Gong et al., 2011; Gong and Szustakowski, 2013; Li et al., 2016; Mackey et al., 1996; Qiao et al., 2012; Finotello et al., 2017). To extend this work to RNA-seq data, Finotello et al. (Finotello et al., 2017) proposed a constraint linear model with a signature matrix derived from RNA- seq data. Additionally, the gene expression of each cell may vary depending on its microen- vironment and other factors, which will lead to a biased estimation. To address this issue, Microarray Microdissection with Analysis of Differences (MMAD) incorporates the concept of the effective RNA fraction and estimates coefficients using a maximum likelihood ap- proach (Liebner et al., 2014). To further adapt deconvolution to high-dimensional settings, Altboum et al. (Altboum et al., 2014) proposed a penalized regression framework, Digital Cell Quantifier (DCQ), to encourage sparsity for the estimated β using the elastic net (Zou and Hastie, 2005). Cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) uses ν-support vector regression (ν-SVR) to enhance the robustness of gene expression deconvolution. CIBERSORT performs a regression by finding a hyperplane that fits as many data points as possible within a tube whose vertical length is a constant ε (Newman et al., 2015). The ε-tube provides a region in which estimation errors are ignored. This model does not include an intercept to capture contributions of other contents. Ad- ditionally, to increase the computational efficiency, CIBERSORT applies Z-normalization to the data before fitting the regression, which may introduce estimation bias. Based on 17 the CIBERSORT framework, several extensions have been proposed to overcome limitations such as platform inconsistency between signature and mixture matrices and low estimation accuracy for γδ T cell (Tosolini et al., 2017; Vallania et al., 2018; Tosolini et al., 2016). However, the quantitative information of cell proportions of these two approaches is built on CIBERSORT whose performance may be challenged by frequent outliers in whole tumor tissue transcriptomes. To reduce the dependence on the signature matrix, xCell utilizes the concept of single-sample gene set enrichment analysis (ssGSEA) to calculate an immune cell score which could predict the enrichment of immune cells (Aran et al., 2017). Despite its robustness, xCell relies much on the ranking of gene expression value which leads to subop- timal solution for the estimation accuracy. Overall, a robust method that determines both the distribution and absolute volume of tumor-infiltrating lymphocytes (TILs) will further improve the current gene deconvolution pipeline. To handle the heavily contaminated gene expression data and provide absolute cell abun- dance estimation, we developed a robust method based on the Least Trimmed Square (LTS) framework (Rousseeuw, 1984; Rousseeuw and Leroy, 1987). LTS finds h observations with smallest residuals, and the estimator ˆβ is the least squares fit over these h observations. LTS is an NP-hard problem, and Rousseeuw and Driessen (Rousseeuw and Driessen, 2006) proposed a stochastic FAST-LTS algorithm. Nevertheless, it may give a suboptimal fitting result and get much slower when the sample size and dimension of variables become larger and higher since its accuracy relies on the initial random h-subsets and the number of initial subsets. When n is the sample size and p is the number of coefficients, h is suggested to be the smallest integer that is not less than (n + p + 1)/2 to remove as many outliers as possible while keeping an unbias result. Using the information of only half of the data reduces the power of the estimator because the amount of outliers in the real case cannot be presumed 18 and can be small. Xu et al. (Xu et al., 2017) proposed an adaptive least trimmed square which is not limited to the randomness of initial subset but only applied the binary dataset. In this study, we extend the adaptive least trimmed square to introduce a model-free method, which could find the number of outliers automatically based on LTS. FARDEEP provides a flexible framework which is suitable for both Microarray and RNA-seq data using LM22 and Immunostate signature matrices respectively. As evidence of high fidelity and robustness, FARDEEP exhibits superior performance in simulated and real-world datasets. 3.2 Materials and methods Because the abundance of cell types are always non-negative, we replaced the OLS regression in the ALTS procedure with non-negative least square regression (NNLS). By applying the modified ALTS to the deconvolution model, the objective function at each iteration becomes, ˆβ = argmin β (cid:107)yS − XSβ(cid:107)2 2 , subject to β ≥ 0 (3.2) where S is the subset of genes with smallest residuals, and (3.2) can be solved using Lawson- Hanson algorithm (Lawson and Hanson, 1995). The ˆβ from FARDEEP, denoted as TIL subset score, is the direct estimate of the linear model without any normalization and hence reflects the absolute abundance of TILs. In addition, we can derive the relative TILs abundance from the TIL subset scores through ˆβj(cid:80)p k=1 , ˆβj ˜βj = (3.3) where ˆβj is the jth TIL subset score. In practice, the TIL subset score provides impor- 19 tant information of patient’s tumor-infiltrating immune landscape, and we have included a discussion in Appendix B. For the tuning parameters, we assume that the errors follow a log-normal distribution instead of a normal distribution among gene expression datasets as suggest by Beal Beal (2017). Then the modified BIC formula from (2.6) turns to (cid:80)n i=1 1{i(cid:54)∈ ˆE} log2(yi − ˆyi)2 m BIC∗(k) = m log + b(log(m) + 1), (3.4) With all the features above, we developed a robust tool, FARDEEP, for cellular decon- volution. 3.3 Evaluation metrics To test the performance of FARDEEP, we compared our approach with the existing meth- ods using numerical simulations and real datasets. Here, we list the outlier genes detected by FARDEEP for real datasets in the supplemetary table. We use LM22 signature matrix containing 22 immune cell types hematopoietic cells for Microarray data and use quanTIseq signature matrix containing 10 immune cell types for RNA-Seq data. To compare the per- formance of different methods, we report the sum of squared error (SSE), the coefficient of 20 determination denoted as R-squared (R2) and the Pearson correlation (R) defined as follows j=1 p(cid:88) (β∗ j − ˆβj)2, p(cid:88) (β∗ j − ˆβj)2/ (cid:80)p (cid:113)(cid:80)p j=1(β∗ j=1(β∗ j − ¯β∗)2 j=1 p(cid:88) (cid:113)(cid:80)p j − ¯β∗)( ˆβj − ¯ˆβ) j=1 j − ¯β∗)2, (β∗ j=1( ˆβj − ¯ˆβ)2 SSE = R2 = 1 − R = p(cid:88) β∗ j , p(cid:88) j=1 1 p j=1 ˆβj, ¯β∗ = 1 p ¯ˆβ = , where β∗ is the ground truth, and ˆβ is the estimate. 3.4 Simulation study 3.4.1 In silico simulation with varied error types To test the robustness of FARDEEP under different error conditions, we simulated three datasets with normally distributed errors, heavy tailed errors using the same way in section 2.5. Here, we set U ij ∼ U(0, 20), proportion of outliers f ∈ {5%, 10%, 20%, 30%}, sample size n = 500, and number of predictors p = 20, we added random errors and outliers to the simulated data as follows: - Random errors: we generated the random error vector from i) standard normal distri- bution, ii) t-distribution with 3 degrees of freedom. - Vertical outliers: we generated a n dimensional zero vector τ and randomly selected nf elements in τ to be the outliers generated from a non-central t-distribution with 1 degree of freedom and a non-centrality parameter of 30. - Leverage points: we took 20% of the contaminated data as leverage points, that is, 21 replacing the corresponding predictors by the samples from N (2max(X), 1). The non-negative coefficients βj were sampled from U(0, 1), where j = 1, . . . , p. We repli- cated each setting 50 times. As shown in Figures 3.1 and 3.2, FARDEEP outperforms other methods, evidenced by the SSE, R2 and R values. In the Appendix B, we also included another outlier construction scheme with X related outliers and a simulation setting with correlated responses.In both scenarios, FARDEEP dominates other methods in terms of SSE, R2 and R values. 3.4.2 In silico simulation based on leukocyte gene signature matrix file Following the similar procedure as in Newman et al., we randomly generated the abundance of different cells from interval [0, 1] (Newman et al., 2015). Notably, the sum of cell abundance is not necessarily 1. The measurement errors were sampled from 2N (0, (0.1 log2(s))2). To incorporate outliers, we randomly selected i/50 of the data and replaced them with data drawn from 2N (10, (0.3 log2(s))2) where i = 1, 2, . . . , 25 and s is the standard deviation of original mixtures. We repeated the procedure nine times and estimated the cell type abundance using FARDEEP, CIBERSORT (without converting to percentage), NNLS, PERT, and DCQ. As shown in Table 3.1, we found that the SSE range for FARDEEP is 1.51×10−7 to 1.47×10−4, R2 and R keeps being 1 regardless of the number of outliers, while Other methods show significantly larger SSE and smaller R2, R. 22 3.4.3 Synthetic dataset We used the cell line dataset GSE11103 generated by (Abbas et al., 2009) that contains gene expression profiles of four immune cell lines (Jurkat, IM-9, Raji, and THP-1) and four mixtures (MixA, MixB, MixC, and MixD) with various ratios of cells. Before analysis, we quantile normalized the mixture data for 54675 probesets and downloaded the immune gene signature matrix with 584 probesets from CIBERSORT website. Then, we applied five deconvolution methods, including FARDEEP, CIBERSORT (without converting to per- centage), DCQ, NNLS, and PERT, to calculate the sum of squared errors of the estimated abundance of the four immune cell lines. We show that FARDEEP gives the smallest SSE and the largest R2, which indicates the most accurate result (Figure 3.3). 3.4.4 Synthetic dataset with added unknown content Both CIBERSORT and FARDEEP are robust deconvolution methods and show advan- tages in the above datasets, we next sought to compare their performances on mixtures with unknown content. We followed the simulation setting proposed by Newman et al. (Newman et al., 2015) and downloaded the signature gene file from CIBERSORT web- site. The mixture file was constructed from the four immune cell lines data, as mentioned in the previous section, and a colon cancer cell line HCT116 (average of GSM269529 and GSM269530 in GSE10650). Cancer cells were mingled into immune cells at different ratios {0%, 30%, 60%, 90%}. Noise was added by sampling from the distribution 2N (0,(f log2(s))2), in which f ∈ {0%, 30%, 60%, 90%} and s is the standard deviation of original mixtures. By applying FARDEEP and CIBERSORT (without converting to percentage) on 64 mixtures, we found that FARDEEP remains an accurate estimation, while the tumor contents skew 23 the results of CIBERSORT with larger deviation from the ground truth (Figure 3.4). 3.5 Real-data application 3.5.1 Deconvolution performance on immune-cell-rich datasets To evaluate the performance of FARDEEP in immune-cell-rich settings that are less af- fected by outliers, we downloaded and analyzed two gene expression datasets (GSE65135 and GSE20300) (Newman et al., 2015; Shen-Orr et al., 2010) generated from the Affymetrix Microarray, which is the same platform used to generate the signature matrix LM22. The GSE65135 dataset consists of (i) surgical lymph node biopsies of 14 follicular lymphoma pa- tients and (ii) purified B and T cells from the tonsils of 5 healthy controls, and the GSE20300 dataset includes 24 blood samples from pediatric renal transplant patients. Flow cytometry or coulter counter data in these studies, which are presented in relative scales, are treated as ground truth. Thus, we normalized the estimated parameters of each method to the sum of 1 before comparison. As shown in Figures 3.5A - 3.5B for case (i) of GSE65135 and Figures 3.5D - 3.5E for GSE20300, FARDEEP outperformed CIBERSORT in terms of R2, R and SSE, which is consistent with our findings with simulated datasets. For case (ii) of GSE65136, we estimated the immune cell composition for purified B and T cells with purity level exceeding 95% and 98%, respectively. For purified B cells, CIBERSORT tends to return non-zero estimates for T cell and a large proportion of other cell types, while FARDEEP gave almost all zero estimates for T cell and on average reduced the estimation error by 61%. Similarly, for the purified T cell, although CIBERSORT had a better performance compared to purified B cell, FARDEEP still significantly improves the estimation accuracy by reducing on average 48% 24 of the estimation error (Figure 3.5C). Furthermore, as shown in the supplementary table, FARDEEP detected gene CD79A and BCL2A1 as outliers for most samples in case (i) of GSE65135. These two genes are known to have high expression levels in follicular lymphoma (B-cell lymphoma) cells (Eray et al., 2003). Overall, even in specimens that are rich in immune cells without contamination by non- hematopoietic malignancy, FARDEEP still outperforms CIBERSORT in immune cell decon- volution. 3.5.2 Deconvolution performance on RNA-seq datasets In addition to effectively handling Microarray data, FARDEEP can also deconvolve TILs us- ing RNA-seq data when we replace the signature matrix LM22 with quanTIseq, a signature matrix generated from RNA-seq data containing ten different immune cell types (Finotello et al., 2017). We applied CIBERSORT and FARDEEP using signature matrix quanTIseq to peripheral blood mononuclear cell (PBMC) mixtures (GSE64655) generated by Hoek et al. (Hoek et al., 2015), and lymph node bulk samples of 4 melanoma patients from GSE93722 (Racle et al., 2017). Flow cytometry data in these studies are on a relative scale and are treated as ground truth. We normalized the estimated parameters of each method to a rela- tive scale using (3.3) before comparison. The RNA-seq data are usually less noisy compared to Microarray, and PBMC datasets are usually clean with less unknown contents. Therefore, we expect FARDEEP and CIBERSORT will return comparable results, which is the case in Figure 3.6A and 3.6B. However, when dealing with noisier data containing more outliers such as lymph node bulk samples, FARDEEP obtained larger advantage over CIBERSORT as shown in Figures 3.6C and 3.6D. 25 3.6 Survival analysis 3.6.1 Lung Squamous Cell carcinoma datasets and Ovarian Serous Cystadenocarcinoma TME of solid carcinomas are different from a lymph node biopsy or peripheral blood, and the highly variable gene expression in cancer cells challenges the accuracy of immune cell deconvolution. It is well-established that immune infiltration profile serves as a promising prognosticator (Galon et al., 2006; Mlecnik et al., 2016). Hence, we next utilized survival and gene expression data of ovarian cancer (OV) and lung squamous cell carcinoma (LUSC) from The Cancer Genome Atlas (TCGA) database to assess the prognostic relevance of different deconvolution methods. These two datasets were chosen because only LM22 not the RNA-seq based signature matrix quanTIseq includes γδ T cells, and OV and LUSC from TCGA datasets are the only two cancer types with Affymetrix microarray data. Using gene expression data (n = 514 for OV and n = 133 for LUSC), we estimated the immunoscore using ESTIMATE proposed by yoshihara et al. (Yoshihara et al., 2013), TILs proportion using CIBERSORT, as well as TILs subset scores using CIBERSORT (without converting to percentage) and FARDEEP. Cold tumors typically harbor lower numbers of CD8+ T cells, γδ T cells, M1-like macrophages, and NK cells (Lei et al., 2016; Qiao et al., 2017; Binnewies et al., 2018; Corrales et al., 2016b). Thus, we calculated an anti-tumor immune subsets score by the summation of CD8+ T cells, γδ T cells, M1-macrophages, and NK cells. Then, we partitioned the patients into two groups with equal size using the median of either the immunoscore (ESTIMATE) or anti-tumor immune subsets score (CIBERSORT and FARDEEP). We compared the survival curves between the two groups. As shown in 26 Figure 3.7, FARDEEP most effectively separates patients into high- and low- risk groups with the smallest p-value (p-value = 0.0065 and 0.059 for OV and LUSC respectively). Recently, CIBERSORT website supports a beta-version of an absolute mode for cell deconvolution. We also included CIBERSORT absolute mode in this survival analysis and showed that it returned a better result (p-value = 0.037) compared to the relative mode in the OV dataset. FARDEEP shows a stronger performance with a smaller p-value under this setting (Figure 3.8). These results demonstrated that the TIL subset scores could provide additional clinical-relevant information compared to the relative abundance. In addition, we expected the summation of these TIL subset scores would negatively correlate with tumor purity. To prove this hypothesis, we calculated the summation of 22 TIL subset scores for both OV and LUSC datasets and correlated them with the tumor purity estimated from consensus measurement of purity estimations (CPE) (Aran et al., 2015). Even without taking account of stromal cells, as shown in Figure 3.9 the summation of TIL subset scores is negatively correlated with tumor purity. Next, we sought to investigate whether outlier removal reduces contamination by tran- scripts from cancer cells. We first identified those top outlier-genes, which were consistently removed by FARDEEP in the OV dataset and obtained the average expression values of those outlier-genes from OV cell lines in GSE32474 (Wang et al., 2006). As shown in Ta- ble 3.2, most of these outlier-genes have high expression in cancer cell lines. For example, CXCL10 gene encodes an important chemokine to recruit CD8+ T cells and is also highly expressed in ovarian cancer cells. Thus, although some genes in LM22 may play a role in immune cells, they may be also highly expressed and variable among cancer cells. Such cross-contamination likely skews immune deconvolution analysis. As shown in Table 3.2, FARDEEP successfully detected and removed those genes, leading to a more robust and 27 accurate deconvolution analysis. 3.7 Discussion The cancer immune microenvironment has emerged as a critical prognostic dimension that modulates patient responses to neoadjuvant therapy. However, the current clinical TNM staging system does not have a consistent method to stratify cancers based on their im- munogenicity. The recent study shows that the RNA-seq datasets of whole tumors contain valuable prognostic information to assess the cancer-immunity interactions (Newman et al., 2015; Robinson et al., 2017). But the current methods to extract immune signatures are susceptible to the frequent outliers in the datasets, leading to less effective identification of cold tumors. Based on support vector regression, CIBERSORT is one of the most popular robust deconvolution methods. However, this model does not include an intercept to capture possible contribution from other cell types and performs a z-normalization to the data be- fore fitting the regression model, which introduces biases into the output. Discussion of the effect of Z-score normalization for CIBERSORT is included in Appendix B. In this study, we developed a new machine learning tool, FARDEEP, to streamline the removal of outliers and increase the robustness of gene-expression profile deconvolution. Robustness is an indispens- able feature to solve a problem of deconvolution because gene expression data are frequently contaminated by a large amount of outliers. FARDEEP solves the deconvolution problem in a robust way because this tool evaluates all outliers across the datasets and then examines the true immune gene signature using non-negative regression. This feature is especially use- ful to analyze tumors with significant non-hematopoietic tumor components. Interestingly, although FARDEEP and the current robust methods can both tackle immune-cell-rich spec- 28 imens such as lymph node and PBMCs, FARDEEP exhibits improved prognostic potential when dealing more complex datasets with significant carcinoma cell content. Although FARDEEP provides a robust computational algorithm to better solve the gene- expression deconvolution problem with noisy datasets, its performance and application rely on the choice of the signature matrix. To avoid estimation bias, it is important to choose the signature matrix derived from the same platform as the mixture matrix. For example, if dealing with gene expression data measured by Affymetrix HGU133A, we should use LM22, but if dealing with RNA-seq data, the signature matrix quanTIseq is preferred. Overall, here we show that FARDEEP is a powerful and rapid machine learning tool that outperforms existing robust methods for gene deconvolution in datasets with significant heavy-tailed noise. FARDEEP provides a new technology to interrogate cancer immunogenomics and more accurately map the immune landscape of solid tumors. 29 Figure 3.1: SSE of coefficients for different approaches. We simulated different percentage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coefficients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A) random error with standard normal distribution, (B) random error with t-distribution. 30 Figure 3.2: Compare the estimation accuracy of different deconvolution approaches (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well the estimators fit for a straight line ˆβ = β. (A) random error with standard normal distribution, (B) random error with t-distribution. 31 Figure 3.3: Applying different deconvolution approaches on the gene expression data of IM-9, Jurkat, Raji, THP-1 and the mixture of these four immune cell lines with known proportion (MixA, MixB, MixC, MixD). All of the mixtures were performed and measured in tripli- cate. (A) SSE of coefficients for FARDEEP, CIBERSORT, NNLS, PERT, DCQ. (B) The abundance of cell lines estimated from different deconvolution approaches vs. Abundance of cell lines truly mixed. The R2 and R values are also reported at the top of the figures. (C) Deconvolution of individual cell subsets by FARDEEP and CIBERSORT. The correlation coefficients R, the corresponding p-values against the null hypothesis of R = 0, trend lines with 95% confidence intervals are shown in the figures. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. 32 Figure 3.4: Performance comparison between CIBERSORT and FARDEEP on mixtures with unknown content.(A) SSE for various noise and abundance of tumor contents on 4 different mixtures. (B) Abundance of cell lines estimated by CIBERSORT and FARDEEP vs. Abundance of cell lines truly mixed. The R2 and R values are also reported in the top of the figures. (C) Deconvolution of individual cell subsets by FARDEEP and CIBERSORT. The correlation coefficients R, the corresponding p-values against the null hypothesis of R = 0, trend lines with 95% confidence intervals are shown in the figures. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. 33 Figure 3.5: Performance assessment on Microarray data with SSE, R2 and R. The follicular lymphoma dataset is evaluated for (A) Overall performance and (B) individual cell subsets. (C) Normal tonsil dataset with purified B cells and T cells. The blood samples from pediatric renal transplant patients are evaluated for (D) Overall performance and (E) individual cell subsets. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. 34 Figure 3.6: Gene-expression deconvolution performance of FARDEEP and CIBERSORT on RNA-seq data. (A) Overall and (B) individual cell subsets results for 8 PBMC samples col- lected from two vaccinated donors at different time points. (C) Overall and (D) individual cell subsets result for lymph node bulk samples. Correlation coeffient R and the correspoind- ing p-value are missing for CIBERSORT on NK cell because the CIBERSORT estimations are all zero. The black dashed line represents the perfect relationship between the estimate and the true cell abundances with slope 1 and intercept 0. 35 Figure 3.7: Kaplan-Meier survival curves are plotted based on ESTIMATE, FARDEEP- and CIBERSORT- assisted TIL profiling. Log-Rank test was applied to data classified into two groups according to whether immunoscore (ESTIMATE) or collective anti-tumor immune subsets (CIBERSORT and FARDEEP) was above or below the median. (A) 514 patients with ovarian cancer. (B) 133 patients with lung squamous cell carcinoma. 36 Figure 3.8: Kaplan-Meier survival curves are plotted based on CIBERSORT (absolute mode)- assisted TIL profiling. Patients were classified into two groups according to whether immunoscore (ESTIMATE) or collective anti-tumor immune subsets was above or below the median. Log-Rank test was applied to obtain the p-value. (A) 514 patients with ovarian cancer. (B) 133 patients with lung squamous cell carcinoma. Figure 3.9: FARDEEP score of TCGA OV and LUSC datasets are calculated from the summation of 22 TIL subset scores, which show highly negative correlations to the consensus measurement of purity estimations (CPE). 37 p = 0.037Method: CIBERSORT.abs0100020003000400050000.000.250.500.751.00TimeSurvival probabilityhigh.immune.scorelow.immune.scoreAp = 0.78Method: CIBERSORT.abs010002000300040000.000.250.500.751.00TimeSurvival probabilityhigh.immune.scorelow.immune.scoreBOV (R =−0.62)LUSC (R =−0.71)0.50.60.70.80.91.00.40.60.81.00.000.030.060.090.000.010.020.030.04CPEFARDEEP SCORE Table 3.1: Performance of FARDEEP, CIBERSORT, NNLS, PERT and DCQ methods under different simulation settings with outliers. Outliers 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22% 24% 26% 28% 30% 32% 34% 36% 38% 40% 42% 44% 46% 48% 50% In all FARDEEP 1.68e-7 (1.21e-7) 2.20e-7 (1.09e-7) 2.35e-7 (2.17e-7) 2.31e-7 (1.22e-7) 2.18e-7 (2.19e-7) 1.51e-7 (4.86e-8) 2.38e-6 (1.01e-7) 3.70e-7 (3.00e-7) 2.83e-7 (9.38e-8) 4.93e-7 (4.08e-7) 5.99e-7 (3.51e-7) 4.00e-6 (7.16e-6) 1.02e-6 (1.23e-6) 8.14e-7 (9.14e-7) 1.54e-6 (2.04e-6) 5.12e-6 (7.68e-6) 3.03e-6 (3.56e-6) 3.36e-5 (6.32e-5) 4.53e-6 (5.91e-6) 1.01e-5 (1.19e-5) 1.06e-4 (2.96e-4) 3.04e-5 (4.10e-5) 6.44e-6 (7.10e-6) 1.47e-4 (1.78e-4) 5.28e-5 (5.26e-5) FARDEEP 1.00 ∼ 1.00 FARDEEP 1.00 ∼ 1.00 2.72 (4.37) 6.25 (5.17) 2.70 (7.01) 0.35 (0.67) 1.80e1 (5.35e1) 6.03 (1.16) 2.09 (2.35) 1.98e1 (2.43e1) 7.36e1 (2.00e2) Mean of SSE (Standard deviation of SSE) PERT NNLS 3.33e1 (6.29e1) 9.62 (2.25e1) 8.76 (1.06e1) 1.64e2 (3.56e2) CIBERSORT 6.37 (1.50) 5.07 (1.84) 6.32 (1.47) 4.76 (1.42) 6.45 (1.63) 5.32 (1.84) 6.42 (1.47) 5.58 (1.52) 6.34 (1.45) 5.08 (1.25) 6.42 (1.54) 5.95 (2.11) 6.42 (1.48) 5.87 (1.08) 6.55 (1.26) 6.42 (1.44) 6.59 (1.61) 6.51 (1.33) 6.39 (1.41) 5.46 (1.44) 6.43 (1.44) 6.37 (1.92) 6.46 (1.46) 6.00 (1.81) 6.46 (1.41) 6.36 (1.11) 6.60 (1.46) 6.61 (1.74) 6.48 (1.34) 6.59 (1.13) 6.62 (1.51) 6.82 (1.51) 6.43 (1.38) 6.61 (1.66) 6.47 (1.46) 6.60 (1.67) 6.41 (1.44) 6.66 (1.54) 6.53 (1.55) 6.69 (1.87) 6.56 (1.44) 6.90 (1.53) 6.55 (1.38) 6.90 (1.43) 6.41 (1.42) 6.60 (1.28) 6.74 (1.61) 7.10 (1.68) 6.86 (1.58) 6.57 (1.35) Range of R2 (lower bound ∼ upper bound) CIBERSORT PERT -3.87 ∼ -0.47 Range of R (lower bound ∼ upper bound) CIBERSORT PERT 0.58 ∼ 1.00 2.37e1 (3.55e1) 1.94e1 (2.26e1) 6.66e1 (1.08e2) 2.34e1 (3.88e1) 4.36e1 (4.24e1) 1.29e1 (1.60e1) 7.34e1 (1.58e2) 1.18e1 (1.00e1) 3.96e1 (5.08e1) 1.98e1 (2.99e1) 8.03e1 (1.55e2) 2.97e1 (3.01e1) NNLS -7.10e2 ∼ 1.00 NNLS -0.38 ∼ 1.00 -3.86 ∼ -1.13 -0.38 ∼ 0.76 DCQ 3.58 (1.64) 3.17 (0.54) 9.00 (1.83e1) 3.85 (2.17) 2.98 (0.58) 5.45 (5.24) 3.71 (2.77) 4.33 (4.35) 2.02e1 (3.68e1) 2.98 (0.78) 5.13 (4.18) 8.95 (1.66e1) 3.33 (1.92) 6.51 (6.30) 6.58 (6.51) 2.17e1 (3.81e1) 6.42 (8.53) 9.90 (9.98) 3.65 (2.18) 2.37e1 (5.54e1) 3.37 (1.58) 1.11e1 (1.30e1) 7.03 (9.26) 2.14e1 (3.81e1) 6.88 (7.83) DCQ -9.88e1 ∼ 0.45 DCQ -0.36 ∼ 0.86 38 Table 3.2: The outlier genes detected by FARDEEP and their average expression values in 7 ovarian cancer cell lines for TCGA OV dataset. Here we listed all genes with removal frequency larger than 50% among 514 samples. Gene CHI3L1 APOBEC3G CXCL10 IFI44L SLC12A8 RGS1 C1orf54 FZD3 PNOC STXBP6 TRIB2 GSTT1 MS4A6A SPOCK2 TCF7 PPFIBP1 RSAD2 PLEKHF1 FCGR2B HLA-DQA1 LAMP3 NCF2 Frequency NCI-ADR-RES OVCAR-3 OVCAR-4 OVCAR-5 OVCAR-8 SK-OV-3 IGROV1 96% 84% 84% 79% 79% 75% 71% 69% 68% 66% 66% 63% 61% 59% 57% 56% 56% 54% 53% 52% 52% 51% 9.76 654 25.05 37.15 291.4 3.63 80.14 65.76 31.96 537.71 140.69 749.88 29.42 117.32 63.48 391.73 17.54 120.38 7.64 15.65 250.12 183.96 37.4 24.5 66.55 182.44 426.97 3.47 33.86 394.81 4.63 2231.15 568.85 1787.77 31.78 326.4 79.55 1279.24 46.41 2345.08 30.34 27.75 103.46 9.12 1149.51 32.12 15.25 15.44 342.54 2.51 117 216.9 28.14 464.33 108.99 19.09 16.15 94.08 92.49 819.14 33.83 522.76 30.8 32.08 1963.43 87.42 4.32 242.22 26.47 58.11 81.4 3.28 19.46 74.34 25.08 217.93 1910.09 994.12 21.76 213.96 238.59 828.22 86.84 167.33 8.08 6.74 165.71 317.48 4.28 612.56 46.24 61.26 432.2 10.2 98.08 108.69 16.63 240.62 113.24 604.82 40.23 76.59 93.24 277.91 36.56 284.82 14.7 29.95 1186.43 423.67 7.78 98.77 47.85 66.49 77.93 7.74 46.11 477.65 40.56 4.99 439.38 518.46 23.15 88.95 119.39 415.94 30.13 687.72 15.7 19.1 179.73 21.72 28.91 90.38 45.19 4.63 89.91 6.82 98.27 878.79 314.87 6.41 216.98 1176.07 31.04 145.94 80.23 615.33 18.02 322.99 32.79 36.81 366.39 237.07 39 Chapter 4 Deep learning framework on DNA quantification In stressful environments, the ability of plants to modulate gene expression levels enables them to respond. Plants have the most prominent regulation of such changes in the gene expression. One of which is the binding of transcription factors to specific DNA sequences, such as the regulatory elements situated near (i.e. cis) or far (i.e. trans) from the gene. To further develop our skills in modeling the adaptive capabilities of various species in different environments across time, we pursued a competitive molecular biology challenge on predicting gene expression by taking into account the diverse environmental conditions. Our aptitude to predict gene expression responses has been developed by our attempts to identify both transcription and regulatory factors (Uygun et al., 2017; Wilkins et al., 2016). However, we recommend further research on the regulation of these responses. One of the first deep learning models was a multi-layer fully connected neural network that was programmed to predict and use gene expression profiles. To measure these profiles, the founders measured it on a set of 1000 landmark genes, which captured approximately 80% of the total expression variation in humans (Subramanian et al., 2017; Chen et al., 2016). To identify the gene expression in the remaining non-landmark genes, they used this variation in the features. To predict non-landmark gene expression patterns, Ghasedi Dizaji 40 et al. (2018); Wang et al. (2018) developed the generative adversarial networks through the discriminator with landmark gene expression patterns. The information generated through omic information which includes histone modification patterns, nucleosome occupancy, chro- mosome conformation, and transcription factor binding sites leads to the prediction of its gene expression patterns through deep learning (Singh et al., 2016; Read et al., 2019). However, these approaches have two significant drawbacks. First, the generation of input data is expensive. It usually has a higher cost than the direct measurement of the expression levels. Consequently, this limits the relevance of these methodologies in modeling organisms in spite of the publicly available data. Second, bias emerges from the choice in or managing of predictive factors. More specifically, the poor performance of models on genes, which were trained on landmark gene expression patterns, is evident with unusual expression patterns. Fortunately, training models with deep learning based structure allow us to generate the prediction of gene expression directly from raw DNA sequence information, which address the two limitations mentioned above. For instance, an input that Washburn et al. (2019) used for a deep learning model was the 1.5 kilobase (kb) of DNA sequence that surrounded the start and stop sites of gene transcription. This was utilized to predict the presence or absence of gene expression and which of the two orthologous genes would demonstrate a higher level of expression. A convolutional neural network (CNN) is a type of deep learning algorithm that allowed Washburn et al. (2019) to accomplish their investigation. Moreover, CNN learns kernels or pattern finders to detect local patterns and optimize the performance of respective models. Functionally, these kernels are motif finders upon its application on the DNA sequence information. To obtain a better understanding of the model’s recognition and learning of motifs, the probing of these trained kernels is necessary for identifying essential known and novel regulatory elements. 41 Studies for the past decade clarified how the indicator of a regulatory motif’s presence or absence does not suffice in evaluating the whole gene regulation in store. Instead, a concept called regulatory grammar is comprised of the number, location, and orientation of regulatory elements, along with the co-localization of these factors (Weingarten-Gabbay and Segal, 2014). In modeling complex regulatory grammar, a proposal has been made for the use of multiple deep learning strategies, which was created for natural language processing. First, the model proposed by Quang and Xie (2016) combines a convolutional layer with a bi-directional long short-term memory network (BiLSTM). The BiLSTM enables the model to distinguish dependencies that are long-range while maintaining a spatial memory. As a result, DanQ studied complex sequence grammar and predicted sequence features, such as chromatin-profiles, transcription factor binding, DNaseI sensitivity, and histone marks. However, for RNN based methods, the long-term information need to go through all cells sequentially, and it can be easily corrupted if it keep being multiplied by some small values, and we call this problem gradient vanish. Although LSTM has been proposed to address this issue, it would start to make the gradient vanish when dealing with long range sequence, such as DNA sequencing data. To address these limitations, Transformer was developed by Vaswani et al. (2017), it totally abandon the previous RNN structure, and using the self-attention to identify complex long range interactions. Since the self-attention is not processing sequentially, it can perfectly fit long sequences. However, the sequential process that the long-term information undergoes through all the cells is essential for methods that are RNN-based. It can be easily corrupted if it keep being multiplied by some small values, and we call this problem gradient vanish. When addressing long-range sequences, such as DNA sequence data, LSTM would continue making the gra- dient vanish in spite of its ability to address this issue. Then, Transformer was developed 42 by Vaswani et al. (2017) to effectively address these limitations. This approach disregards the previous RNN structure fully and identifies the complex long-range interactions through the self-attention feature which can perfectly fit long sequences because its process is not sequential. In this case, we made a recommendation regarding deep learning-based convolutional and self-attention neural network (DeepCAT) framework. Additionally, we demonstrated the utility of DeepCAT in predicting gene expression responses to 57 environmental stress conditions in the raw sequence data for Arabidopsis thaliana. Finally, we have proved that DeepCAT can be used to efficiently identify novel TFBMs and their higher order interac- tions. 4.1 Materials and Methods 4.1.1 Features and data The gene expression data of Arabidopsis thaliana were downloaded from the AtGenExpress database and processed as in Uygun et al. (2017). And the 57 responses indicates if an Arabidopsis gene was up-regulated or not in shoot tissue under each of 36 abiotic (e.g. cold, heat, osmotic) and 21 biotic (e.g. Pseudomonas syringae, bacterial flagellin) stress conditions. To be more specific, the expression data were processed by Limma (Ritchie et al., 2015) in the R environment, and the genes with log2 fold changes greater than 1 were considered up-regulated and labeled as 1. DNA sequence information was pulled from TAIR10 and included 1-kilobase (kb) up- stream and 500-base pairs (bp) downstream the transcription start site and 500-bp upstream and 1-kb downstream the transcription stop site. This sequence information was one-hot en- 43 coded into a 3002× 4 binary matrix, with columns corresponding to A, C, G and T and rows corresponding to each position in the sequence. Gene were randomly assigned to the training (70%), validation (10%), or testing (20%) set. The validation set was used to determine the optimal number of training iterations. Model performance on the testing set at that iteration was then determined using two metrics: the Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and the Precision Recall-AUC (PR-AUC). 4.1.2 The DeepCAT model design DeepCAT is made up of convolution layer, pooling layer, Transformer encoder layer (Vaswani et al., 2017) and fully connected layer as shown in Figure 4.1. The convolution and pooling layers help extracting features from sequence, and the Transformer encoder layers learn the interaction between different features of the sequence. Finally, the fully connected layer aggregates all information and computes the probability for each of the 57 stress factors through sigmoid activation function. Formally, suppose the kernel size of convolution and pooling layer are kc and kp respec- tively, then the first two layers of the model can be formulated as kc(cid:88) 4(cid:88) Y (0) mk = ReLU( W (k) ij Xm+i−1,j + bk), i=1 j=1 (1) nk = max({Y Y (0) (n−1)kp+1,k, Y (0) (n−1)kp+2,k, . . . , Y (0) (n−1)kp+kp,k}), where X is the input 3002× 4 sequence matrix, Y (0) and Y (1) are the outputs of convolution and pooling layer, W and b are the kernel weight matrix and bias, k is the kernel index, and the activation function ReLU(x) = max(0, x). 44 Then, the output can be considered as a new sequence with 3001−kc kp tokens, and the length of each token is k. Inspired by BERT (Devlin et al., 2018), we add a special token ([CLS]) at the start of the sequence before feed it into the Transformer encoder layer, and the final hidden state of this token can be used as the aggregate sequence representation. This [CLS] has k + 1 dimensions with all zero elements except the (k + 1)th one, and we also pad the (k + 1)th elements for all other tokens with values of zero. Besides, to inject position information of the vectors, the positional encoding was added with sine and cosine functions as described in Vaswani et al. (2017). P E(pos,2i) = sin(pos/100002i/(k+1)), P E(pos,2i+1) = cos(pos/100002i/(k+1)), Y (2) = [CLS]  + P E Y (1) where pos is the position and i is the dimension. After that, the outputs were fed into two deep Transformer encoders which contains multi-head self-attention layers and feed-forward neural network layers. The multi-head self-attention layers can be computed as Y (3) = [H1, ..., Hh]ZO, where h is the number of heads and we set it as four in this study, ZO is a (k + 1) × (k + 1) dimensional projection matrix, and Hi = softmax( Y (2)ZQ i )T i (Y (2)ZK √ d )Y (2)ZV i , 45 ZQ, ZK and ZV are three weight matrices for query, key and value in Transformer with dimension of (k + 1) × d, and d is set to be (k + 1)/h. In addition to attention sub-layers, a fully connected feed-forward network is also included in each of the Transformer encoder layers. It consists of two linear projections with weight matrices W1, W2 and biases b1, b2 as Y (4) = ReLU(Y (3)W1 + b1)W2 + b2. At last, we consider the final hidden state of the special token ([CLS]) as a summary of information from previous layers, and feed it into a fully connected layer with activa- tion function of sigmoid which scale the results into range of zero and one to achieve the classification task, sigmod(x) = 1 1 + e−x . 4.1.3 Training of the model To train the model, we minimized the average of muti-task binary cross entropy loss functions by minibatch with size of 50 using Adam optimizer, and all weights are initialized with Xavier uniform initializer. Dropout with rate of 0.1 in attention layers are also included to regularize the models. The validation ROC score was used for monitoring the model convergence. Our model took around 300 epochs to fully train, and the training time for each epoch is 30 seconds. Our implementation utilizes pytorch backend and NVIDIA K80 GPU was used for train- ing the model. 46 Figure 4.1: Model architecture. The convolution layer acts as a scanner of motifs and produces an output with dimension of (sequence length - kernel size + 1) × (number of kernel). The max-pooling layer reduces the output dimension by extracting the maximum values through each pooling kernel. Then the special token and position encoding matrix are added to the output. The two Transformer encoder layers consider the interactions between each token of the output sequence from previous layers and re-interpret it. Finally, a sigmoid nonlinear transformation is performed to get the prediction scores. 4.1.4 Transfer learning Since the size of our dataset is not quite sufficient, we consider to transfer the stored knowl- edge of motifs into DeepCAT, so that our model can be trained from a better starting point. This part has been introduced into the convolution layer since we have mentioned before that the convolution layer plays a role in scanning motifs, then a good initialization of kernel weight can help our model getting closer to the optimized solution. We have tried two source of data, one is the pre-trained kernel weight of DeepCAT from human dataset (Zhou and Troyanskaya, 2015) which contains 4 millions of training data. Another is the position weight matrices of arabidopsis thaliana from DAP-seq (O’Malley 47 et al., 2016) and CIS-BP (Weirauch et al., 2014) database. 4.1.5 Clustered multi-task learning In contrast to the traditional machine learning methods, the loss function of DeepCAT in section 4.1.2 is based on all 57 stress responses. However, some of the stresses with closer relationship are expected to mutually benefit from each other for prediction, but others may limit their prediction accuracy. Thus, we cluster the stress responses into three groups by k-means clustering, and train the model separately. From the results showing in the next section, some of the responses with poor performance got significantly improved. 4.1.6 Interpreting the model To visualize the pattern learned by DeepCAT, we convert k weight matrices into position weight matrices (PWM) which can be also considered as k motifs refer to the approach in DeepBind (Alipanahi et al., 2015) as follows, • After training the model, we will get k kernel weight matrices from the convolution layer for each sequence, and its corresponding output Y (0) k . • Find position t where the motif show maximum response for Y (0) k , such that t = argmaxmYmk for each kernel, and extract a subsequence with the range of t ∼ t+kc−1 from the raw sequence. • Consider only the sequences with non-zero values in some positions of Y (0) k , we stack the subsequences together and count the nucleotide frequencies. Then PWMs can be calculated by normalizing the nucleotide counts to one. 48 Once we get the position weight matrices, we aligned them to the known motif database using the TOMTOM algorithm (Gupta et al., 2007), a threshold of 0.05 has been used for p-value to measure the similarities. We have validated this kernel conversion method in Appendix C. In addition, to further measure the importance of each kernel to different stresses, we individually replace each kernel with all zero values and track the change of prediction AUC scores. The kernel with sharpest reduction on AUC score is considered to be the most important one for predicting the corresponding stress responses. Finally, we learned the interactions between motifs by extracting weight matrices from the first self-attention layer, that is softmax( Y (2)Z 4.2 Results Q i (Y (2)ZK i )T √ d ) for each head i. 4.2.1 Random initialization for convolution kernel weight In this section, we compared the performance of our model with DanQ, support vector regression (SVR) and Random forest. We first trained our model with parameters k = 319, kc = 26, kp = 13 and h = 4. Our model took about 3000 epochs to get fully trained with learning rate 10−6, and the training time for each epoch is around 10 seconds. For DanQ, we have kept its original setting for parameters. In summary, DeepCAT shows the best average ROC and PR scores in Figure 4.2. 49 Figure 4.2: Performance comparison among DeepCAT, DanQ, SVM and Random forest through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The stress conditions are ordered by the values of DeepCAT on x-axis, and the average of ROC and PR scores for each approach are included in parentheses. 4.2.2 Transfer learning with human data Comparing to the training set in Zhou and Troyanskaya (2015) for human data which is about 4 millions sequences, our data size is not sufficient. Thus, we pre-trained DeepCAT on human data with same parameters and transfer its final convolution kernel weight into our model as a initialization. Figure 4.3 shows that the performance of DeepCAT got improved on both ROC-AUC and PR-AUC scores, indicating that there exists some overlapping motif information for DNA sequencing data among different species. 50 0.60.70.802040StressROC AUCSVM (mean = 0.674)Random forest (mean = 0.679)DanQ (mean = 0.691)DeepCAT (mean = 0.714)A0.00.10.202040StressPR AUCSVM (mean = 0.123)Random forest (mean = 0.120)DanQ (mean = 0.116)DeepCAT (mean = 0.138)B Figure 4.3: The effect of applying pre-trained weight from human data into DeepCAT. (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The straight line y = x represents the equality between the results of randomly initialized DeepCAT and DeepCAT with transfer learning based on human data. 4.2.3 Transfer learning with known motif database We have shown in the last section that a pre-trained kernel weight from large dataset would efficiently improve the model performance, even if among different species. Thus, we expected to see further improvement by involving some pre-knowledge of the Arabidopsis thaliana into the model. Firstly, we summarized 865 known TFBMs of Arabidopsis thaliana from DAP- seq (O’Malley et al., 2016) and CIS-BP (Weirauch et al., 2014) database, then we used hierachical clustering to pick out the most different 100 TFBMs with length from 6 to 23, and placed them into the middle of 100 kernels with randomly generated edges. In addition, 51 0.60.70.80.60.70.8DeepCAT pre−trained human weight (mean = 0.730)DeepCAT (mean = 0.714)ROC AUCA0.00.10.20.30.00.10.20.3DeepCAT pre−trained human weight (mean = 0.151)DeepCAT (mean = 0.138)PR AUCB we also added another 219 random kernels to catch other information. As shown in Figure 4.4, both ROC-AUC and PR-AUC got significantly improved. Thus, the known TFBMs can help DeepCAT learn much more efficiently. Figure 4.4: The effect of applying known TFBMs into DeepCAT. (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The straight line y = x represents the equality between the results of randomly initialized DeepCAT and DeepCAT with transfer learning based on known TFBM database. 4.2.4 Performance of clustered multi-task learning From Figure 4.2 we can see that SVM and Random forest perform better on a few group of the responses, especially the ones with poor scores of the deep learning based model. The reason is that the “shallow" machine learning methods train the model separately for each response, which means that each of the loss function only depends on one single response, so 52 0.60.70.80.60.70.8DeepCAT known TFBM (mean = 0.740)DeepCAT (mean = 0.714)ROC AUCA0.00.10.20.30.40.00.10.20.30.4DeepCAT known TFBM (mean = 0.162)DeepCAT (mean = 0.138)PR AUCB that they will not affect each other during the training process. Thus, those under-performing responses of DeepCAT may have far relationship from others, and got negative impacts as results. To further solve this problem, we grouped the 57 responses using hierarchical clustering with Jaccard’s distance as Figure 4.5. Not surprisingly, the responses related to "heat" con- ditions are grouped separately with others, and most of them are also the under-performing responses. Figure 4.5: Hierarchical clustering with Jaccard’s distance for 57 stress responses. Referring to the results in Figure 4.5, we separately trained the models for these two clusters. The convolution kernel weight matrices are all initialized by the known TFBMs using the same way with section 4.2.3. The results are showing in Figure 4.6, we can see that the multi-task learning in addition to transfer leaning obviously improved the performance of DeepCAT, and most of the ROC-AUC and PR-AUC scores from under-performing responses turned to be higher than the individually trained SVM and Random forest. 53 Figure 4.6: Performance comparison among DanQ, SVM, Random forest, DeepCAT, Deep- CAT with known TFBMs, multi-task DeepCAT with known TFBMs through (A) ROC-AUC scores and (B) PR-AUC scores for 57 stress conditions. The stress conditions are ordered by the values of DeepCAT on x-axis, and the average of ROC and PR scores for each approach are included in parentheses. 4.2.5 Transcription factor binding motifs and their interactions Based on the best results we got in section 4.2.4, we aligned these motifs to known TFBMs (DAP-seq and CIS-BP) using the TOMTOM algorithm (Gupta et al., 2007). In summary, 196 out of 319 motifs have been significantly matched to the known TFBMs with E-value less than 0.05 (Figure 4.7). By using the method introduced in section 4.1.6, we have listed the top important aligned TFBMs for each of the 57 stress conditions in Appendix C. And we considered the top motifs which can not aligned to any known TFBMs as the identified novel 54 0.60.70.802040StressROC AUCSVM (mean = 0.674)Random forest (mean = 0.679)DanQ (mean = 0.691)DeepCAT (mean = 0.714)DeepCAT known TFBMs (mean = 0.740)DeepCAT known TFBMs multi−task (mean = 0.750)A0.00.10.20.302040StressPR AUCSVM (mean = 0.123)Random forest (mean = 0.120)DanQ (mean = 0.116)DeepCAT (mean = 0.138)DeepCAT known TFBMs (mean = 0.162)DeepCAT known TFBMs multi−task (mean = 0.177)B TFBMs which are important for predicting stress responds, we have shown some example sequence logo figures in 4.8. Figure 4.7: Three examples of aligned TFBMs through TOMTOM algorithm. The upper figures are the known TFBMs and the lower ones are generated by convolution kernel weights trained by DeepCAT. The p-values are 7.46e−11, 1.22e−13, 3.27e−11 respectively. 55 Homeodomain_ATHB-14.pwm71Tomtom 27.07.19 21:10012bits1234C5CG6CT7GA8A9T10G11A12T13GT14GA15GC16GAT17ATC18CTG19GC012bits1234TA5678CTAG9GCAT10CTGA11CGTA12CGAT13CATG14CGTA15GCAT16CAGT17GTA18GATC19AT2021222324252627282930313233343536MADS-box_43709.pwm77Tomtom 27.07.19 21:12012bits123GAT45C6TC7CTA8GAT9TA10AT11TA12GACT13GA14G15CTA16CTA17CGTA18012bits12345AT6GCAT7TA8GTAC9ATC10GCTA11TA12GCTA13TA14GTA15GCAT16TGA17CTAG18CTA19GCTA20CGTA21222324252627282930313233343536Myb-SANT_MYB15.pwm85Tomtom 27.07.19 21:15012bits1TAC2ATC3A4C5C6GAT7A8C9C10ACG11CGT12GTC13TCAG012bits12345678910GATC11GATC12GCTA13GATC14GATC15GCAT16GCTA17GTAC18GATC1920GACT2122G2324252627282930313233343536 Figure 4.8: Three examples of identified novel TFBMs. The upper figures come from the initialized random motifs and the lower ones are generated by convolution kernel weights trained by DeepCAT. 56 Chapter 5 Conclusion and Future work The contributions of our work can be summarized in two respects. From the respect of statistical methodology, we proposed Adaptive Least Trimmed Square, which could adjust the proportion of outliers at each iteration, and finally automatically identify and remove outliers before the coefficients estimation. This a hard threshold based robust regression and outlier detection method. The theoretical guarantees are also provided in this study. From the application perspective, we developed FARDEEP based on Adaptive Least Trimmed Square to solve gene-expression deconvolution problem, the algorithm perfectly address the challenge that tumor micro-environmental data are very noisy so that the tumor specific genes may affect the estimation accuracy. In addition, we also proposed a deep learning based structure which can identify the complex patterns in transcription factor binding sites for DNA sequences and use it to predict the responses of plants in different stressful environments. The combination of convolution layer and self-attention layer can help the model identify both motifs and their long-term dependencies for sequences with any length. In summary, our work mainly contributes to the area of developing statistical methodologies and machine learning algorithms to solve biological problems. In the future, we plan to extend our outlier detection and robust regression framework onto non-linear model, so that it can be widely used in other practical problems. In addition, for DeepCAT, we have only focused on the prediction of whether gene expression would be 57 up-regulated or not under different stress environments. More interestingly, we may also want to know the quantity of fold change for gene expression level. Thus, we will further adapt DeepCAT for a regression problem by modifying some of its current layers. In conclusion, these left problems will be investigated in our future work. 58 APPENDICES 59 APPENDIX A Supplementary text for Chapter 2 Proof of Theorem 2.3.1 The Adaptive Least Trimmed Square (ALTS) iteration sequence satisfies L(β(j+1)) ≤ L(β(j)), for any j ≥ 0, with objective function n(cid:88) (yi − xiβ)21{γi=0}. i=1 L(β) = Proof. For jth iteration, L(β(j)) = (cid:88) ≥ (cid:88) i∈Sj−1 i∈Sj−1 (yi − xiβ(j))2 (yi − xiβ(j+1))2 ≥ (cid:88) i∈Sj (yi − xiβ(j+1))2 = L(β(j+1)). 60 Proof of Theorem 2.3.2 The Adaptive Least Trimmed Square (ALTS) stops in no more than j∗ steps, where (cid:22)− log α1 (cid:23) log α2 . j∗ = Here (cid:98)·(cid:99) is the largest integer that is less than or equal to x. Proof. Assume that j∗ steps have been taken in ALTS, αj∗ 2 α1n 2 0 = αj∗ 2 α1N αj∗ 2 α1 ≤ 1, ≤ n 2 , which leads to the result. Proof of Theorem 2.3.3 Suppose the corruption rate at jth iteration of ALTS is ρj, the outlier vector (cid:107)γ(cid:107)0 ≤ nρ∗ and ρ0 > ρ∗. Let c− τ be the subset strong convexity and smoothness constants for X at level τ. We assume that η = 4 τ and c+ ), n = O(logη < 1, then if logα2 1 α1 √ (cid:13)(cid:13)(cid:13)γS0 (cid:13)(cid:13)(cid:13)2 c+ ρ0 c− 1−ρ0 (cid:115) (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + C√ n (cid:107)ε(cid:107)2 , with  > 0 and some constant C. 61 Proof. For the jth iteration, we have (cid:13)(cid:13)(cid:13)XSj (β∗ − β(j+1)) (cid:13)(cid:13)(cid:13)2 −(cid:13)(cid:13)(cid:13)γSj + εSj Thus, Since (β∗ − β(j+1)) (cid:13)(cid:13)(cid:13)XSj (cid:13)(cid:13)(cid:13)XSj (cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2 (β∗ − β(j+1)) ≤ ≤ 2 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)2 ≥(cid:113) 2(cid:113) c− 1−ρj (cid:13)(cid:13)(cid:13)γSj c− 1−ρj (cid:13)(cid:13)(cid:13)2 + εSj (β∗ − β(j+1)) + γSj − XSj − XSj β(j+1)(cid:13)(cid:13)(cid:13)2 β∗(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 + εSj . = = (cid:13)(cid:13)(cid:13)2 ≤(cid:13)(cid:13)(cid:13)XSj (cid:13)(cid:13)(cid:13)ySj ≤(cid:13)(cid:13)(cid:13)ySj (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2 2(cid:113) c− 1−ρj (cid:13)(cid:13)(cid:13)2 + εSj ≤ . + εSj , then we have (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)2 ( (cid:13)(cid:13)(cid:13)εSj (cid:13)(cid:13)(cid:13)2 ). (A.1) + For jth iteration, the residuals were ordered as |r permutation of n elements, and we define ˜Sj = {π(i), i ≤ n(1 − ρ0)}. Then we have π(2)| ≤ ··· ≤ |r π(1)| ≤ |r π(n)|, where π is a (j) (j) (j) (cid:13)(cid:13)(cid:13)(cid:13)y ˜Sj+1 (cid:13)(cid:13)(cid:13)(cid:13)X ˜Sj+1 − X ˜Sj+1 (β∗ − β(j+1)) + γ ˜Sj+1 β(j+1) + ε ˜Sj+1 (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)2 ≤(cid:13)(cid:13)(cid:13)yS∗ − XS∗β(j+1)(cid:13)(cid:13)(cid:13)2 ≤(cid:13)(cid:13)(cid:13)XS∗(β∗ − β(j+1)) + εS∗ , (cid:13)(cid:13)(cid:13)2 . Define the sets Ej+1 = ˜Sj+1 \ S∗ and Tj+1 = S∗ \ ˜Sj+1, then (cid:13)(cid:13)(cid:13)XEj+1 (β∗ − β(j+1)) + γEj+1 + εEj+1 (β∗ − β(j+1)) + εTj+1 (cid:13)(cid:13)(cid:13)2 , (cid:13)(cid:13)(cid:13)2 ≤(cid:13)(cid:13)(cid:13)XTj+1 62 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)εEj+1 (cid:13)(cid:13)(cid:13)XTj+1 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)XTj+1 + + (β∗ − β(j+1)) + εTj+1 (β∗ − β(j+1)) + (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)εTj+1 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γEj+1 (β∗ − β(j+1)) + εEj+1 (β∗ − β(j+1)) ≤ 2 (cid:13)(cid:13)(cid:13)2 ≤(cid:13)(cid:13)(cid:13)XEj+1 ≤(cid:13)(cid:13)(cid:13)XEj+1 (cid:13)(cid:13)(cid:13)2 (cid:113) (cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj+1 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γEj+1 (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj+1 ≤ 4 c+ ρ0 = ρ0 c− 1−ρj + + 2(cid:107)ε(cid:107)2 . Since , we apply (A.1) and get (cid:118)(cid:117)(cid:117)(cid:116) c+ (cid:118)(cid:117)(cid:117)(cid:116) c+ ρ0 c− 1−ρj ρ0 c− 1−ρj + 4 + (4 (cid:13)(cid:13)(cid:13)εSj (cid:13)(cid:13)(cid:13)2 + 2(cid:107)ε(cid:107)2 + 2)(cid:107)ε(cid:107)2 (cid:118)(cid:117)(cid:117)(cid:116) c+ (cid:118)(cid:117)(cid:117)(cid:116) c+ (cid:118)(cid:117)(cid:117)(cid:116) c+ ρ0 c− 1−ρj ρ0 c− 1−ρj (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 ≤ 4 ≤ 4 + 3(cid:107)ε(cid:107)2 . (A.2) Then we apply (A.2) to (A.1) (cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2 (cid:113) Since √ ≥ O( c− 1−ρ0 ≤ 2(cid:113) c− 1−ρj 2(cid:113) c− 1−ρ0 ≤ 2ηj(cid:113) c− 1−ρ0 ≤ ( ( (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γSj (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)γ ˜Sj (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γS0 (cid:13)(cid:13)(cid:13)εSj (cid:13)(cid:13)(cid:13)2 ) + + (cid:107)ε(cid:107)2) 2(cid:113) c− 1−ρ0 + 3(1 − ηj) 1 − η ( + 1)(cid:107)ε(cid:107)2 . √ (cid:13)(cid:13)(cid:13)γS0 n (cid:13)(cid:13)(cid:13)2 ), n) for large enough n, then when j = O(logη (cid:13)(cid:13)(cid:13)β∗ − β(j+1)(cid:13)(cid:13)(cid:13)2 ≤  + C√ n (cid:107)ε(cid:107)2 . 63 Thus, if logα2 1 α1 = O(logη n ), √ (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)γS0 (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + C√ n (cid:107)ε(cid:107)2 . Proof of Corollary 2.3.1 Suppose the assumptions are same with Theorem 2.3.3 and the random noises follow Gaus- sian distribution, i.e. ε ∼ N (0, σ2I). Then when n > log 1 at least 1 − δ, δ for some δ > 0, with probability (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + Cσ, with  > 0 and some constant C. Proof. If ε ∼ N (0, σ2I), then (cid:107)ε(cid:107)2 σ2 ∼ χ2 2 n. Follow the lemma 1 from Laurent and Massart (2000) 2 ≤ nσ2 + 2σ2 nlog1 δ + 2σ2log1 δ 5nσ(cid:1) > 1 − δ. Thus, combine the result with Theorem 1, we √ For any n > log 1 √ 2 − nσ2 > 2 (cid:107)ε(cid:107)2 nσ2 (cid:114) (cid:114) log1 δ (cid:33) ≤ δ, ≥ 1 − δ. + 2σ2log1 δ (cid:33) (cid:32) P (cid:107)ε(cid:107)2 (cid:32) δ , P(cid:0)(cid:107)ε(cid:107)2 < P 64 can get (cid:13)(cid:13)(cid:13)β∗ − ˆβALTS (cid:13)(cid:13)(cid:13)2 ≤  + Cσ, with probability at least 1 − δ. 65 APPENDIX B Supplementary text for Chapter 3 Effect of Z-score normalization CIBERSORT employed a Z-score normalization on both y (gene expression profile of tumor) and X (expression values of signature genes) to stabilize the Support Vector Regression algorithm. To assess the effect of Z-score normalization, we performed a brief mathematical derivation together with several simulations as follows: y = Xβ + , (B.1) where  is the noise. First, CIBERSORT performs Z-score on both y and X to get y − ¯y1 , sy X − ¯X11T sX , y∗ = X∗ = (cid:80)p where ¯y = (cid:80)n i=1 yi/n; ¯X = (cid:80)n i=1 j=1 xij/(np); sy and sX are the sample standard deviation of {yi}n i=1 and {xij} respectively. Then, CIBERSORT utilizes Support Vector Regression (SVR) on y∗ and X∗ to obtain the coefficients. Note that in practice sy and ¯y differs from sX and ¯X respectively. Thus, the SVR coefficients based on y∗ and X∗ are much less likely to equal those based on y and X. 66 To demonstrate the point, we simulated several mixtures based on the first 9 columns of LM22 signature matrix with β ∈ R9 = (2, 1.5, 1.2, 1, 0.8, 0.6, 0.5, 0.3, 0.2)(cid:48) and random noise  ∼ 2N (0,(0.3 log2(s))2), where s is the standard deviation of the original mixtures. As shown in Figure B.1, the estimates from SVR with Z-score normalization showed skewness from the truths which are denoted as red dashed lines in the figures. Thus, Z-score normalization in CIBERSORT blocks its way to getting the absolute cell abundance. Also, although the SVR without Z-score normalization performed well in this simulation, it is very slow and may not converge for the high-dimensional signature matrix, and the variables with greater numeric ranges will dominate those with smaller numeric ranges as discussed in Chang and Lin (2011); Hsu et al. (2003). This is also the reason that we only include the first 9 columns of LM22 in this simulation. 67 Figure B.1: Effect of Z-score normalization. Red dashed line indicates the true absolute abundance of each cell. SVR is the common Support vector regression without normalization and SVR.znormalization is the method used in CIBERSORT. 68 T.cells.CD8T.cells.follicular.helperT.cells.regulatory.TregsT.cells.CD4.memory.activatedT.cells.CD4.memory.restingT.cells.CD4.naiveB.cells.memoryB.cells.naivePlasma.cells0.250.500.751.001.250.20.40.60.80.050.100.150.200.51.01.52.00.20.40.60.10.20.30.40.81.20.10.20.30.40.50.250.500.751.00Estimated valueSVR (SSE = 0.09)SVR.znormalization (SSE = 691.07) Importance of TIL subset scores The TIL subset score is a good indicator for the absolute abundance of TIL and is an im- portant scale to evaluate the immune level for cancer patients. To demonstrate this, we simulate two mixtures without noise using the pure immune cell lines (Jurkat, IM-9, Raji and THP-1) expression profiles from GSE11103 and two different cancer cell lines (Colon can- cer: HCT116, Melanoma: MDA-MB-435) expression profile from GSE10650 and GSE32474 respectively as below, MIX1 = 0.2cancer cell line + 0.2Jurkat + 0.2IM-9 + 0.2Raji + 0.2THP-1 MIX2 = 0.8cancer cell line + 0.05Jurkat + 0.05IM-9 + 0.05Raji + 0.05THP-1 where cancer cell line can be either HCT116 or MDA-MB-435. Then we applied CIBERSORT and FARDEEP with the signature matrix generated by Jurkat, IM-9, Raji, and THP-1 which could be downloaded from CIBERSORT website. The outputs are shown below in Table B.1. CIBERSORT gives only the relative proportion as around 0.25 for both mixtures even if mixture 2 has less immune contents than the mixture 1. In contrast, FARDEEP shows its ability to accurately estimate the exact abundance of immune cells through TIL subset scores because they are the direct estimate from the linear model. Also, we can derive relative abundance (fraction) among immune cells from TIL subset scores by ˆβj(cid:80)p k=1 , ˆβj ˜βj = where ˆβj is the jth TIL subset score. As shown in Table B.1, the relative abundance results derived from TIL subset scores still outperform CIBERSORT because 69 a. although most of the probesets in the signature matrix have high expression in the immune cell lines and low expression in the cancer cell line (HCT116 and MDA-MB- 435), several signature probesets are highly expressed in HCT116 or MDA-MB-435 (Figure B.2) which likely introduce an estimation bias. FARDEEP automatically re- moves those outlier probesets before estimating the abundance of the immune cells, as shown in Figure B.2, hence delivering a better estimate; b. compared to CIBERSORT, FARDEEP includes an extra intercept in the regression model, which could capture the background effect from non-immune content, such as tumor cells. Interestingly, even with different ratios of cancer cells in MIX1 and MIX2 regardless of the cancer types, FARDEEP detected the same set of outliers from these two mixtures, but there are 137 and 151 outliers from HCT116 mixture and MDA-MB-435 mixture respectively from which 90 outliers are shared. This result suggests that different cancer cell line has distinct high expressed genes, and the outlier-genes strongly depend on the sample type. 70 Table B.1: This table shows the results from CIBERSORT and FARDEEP on two mixtures (MIX1 and MIX2) with HCT116 and MDA-MB-435 respectively. FARDEEP (relative prop) was gotten by normalize the default outputs to sum to 1. Method Truth Sample MIX1 MIX2 CIBERSORT MIX1 MIX2 MIX1 MIX2 MIX1 MIX2 FARDEEP FARDEEP (relative prop) Jurkar 0.2 0.05 0.2512 0.2682 0.2004 0.0515 0.2505 0.2588 HCT116 IM-9 0.2 0.05 0.2493 0.2404 0.1995 0.0479 0.2494 0.2410 Raji 0.2 0.05 0.2493 0.2387 0.1999 0.0498 0.2500 0.2502 THP-1 0.2 0.05 0.2503 0.2527 0.1999 0.0497 0.2500 0.2500 Jurkar 0.2 0.05 0.2507 0.2801 0.2013 0.0552 0.2511 0.2667 MDA-MB-435 Raji IM-9 0.2 0.2 0.05 0.05 0.2480 0.2514 0.2153 0.2649 0.2000 0.2002 0.0509 0.0498 0.2494 0.2497 0.2459 0.2409 THP-1 0.2 0.05 0.2499 0.2398 0.2002 0.0510 0.2498 0.2465 Figure B.2: The y-axis are the sorted expression value of HCT116 and MDA-MB-435 cell line for 584 probesets. The probes which have been detected as outlier by FARDEEP are marked with color red. 71 HCT116: MIXTURE 1IndexExpression value010020030040050060001000020000300004000050000HCT116: MIXTURE 2IndexExpression value010020030040050060001000020000300004000050000MDA−MB−435: MIXTURE 1IndexExpression value0100200300400500600010000200003000040000MDA−MB−435: MIXTURE 2IndexExpression value0100200300400500600010000200003000040000 X related outliers To evaluate the performance of FARDEEP under a more complex outlier construction, we simulated datasets 50 times follow the setting in section of “in silico simulation with varied error types”, but let the outliers follow a non-central t-distribution with 1 degree of freedom and a non-centrality parameter being 10 · max(xi) for the ith gene, where xi is the ith row of X. In this way, amplitude of the outliers for the ith gene will depend on xi. The results Figure B.3, Figure B.4 and Table B.2 show that FARDEEP dominates the performances of other methods and keeps good accuracy of outlier detection. Table B.2: Tuned k for FARDEEP with the adjusted BIC. We simulated normally distributed errors and heavy-tailed errors respectively for different proportion of outliers and computed true positive rate and false positive rate to evaluate the tunning result. Percentage of outliers Normal Heavy-tailed True positive rate False positive rate Parameter (mean of k) True positive rate False positive rate Parameter (mean of k) 5% 10% 20% 30% 1 1 1 0.007 2.53 0.02 1.32 1 0.06 1.09 0.05 1.16 1 0.06 2.05 1 0.06 1.24 1 0.13 1.10 1 0.17 1.04 Figure B.3: SSE of coefficients for different approaches based on simulations with X related outliers. We simulated different percentage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coefficients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. (A) random error with standard normal distribution, (B) random error with t-distribution. 72 20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEPA20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEPB Figure B.4: Compare the estimation accuracy of different deconvolution approaches for simulations with X related outliers (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are the estimators fit for a straight line ˆβ = β. (A) random error with standard normal distribution, (B) random error with t-distribution. 73 0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −237.42 , R = 0.06 )DCQ (R2 = −5.32 , R = 0 )PERT (R2 = −2.57 , R = 0.07 )CIBERSORT (R2 = 0.98 , R = 0.99 )FARDEEP (R2 = 0.99 , R = 1 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −422.33 , R = 0 )DCQ (R2 = −20.23 , R = 0.01 )PERT (R2 = −2.86 , R = 0.01 )CIBERSORT (R2 = −44.74 , R = 0.1 )FARDEEP (R2 = 0.99 , R = 0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −1440.76 , R = 0.05 )DCQ (R2 = −55.48 , R = 0.03 )PERT (R2 = −2.84 , R = 0 )CIBERSORT (R2 = −118.72 , R = 0.01 )FARDEEP (R2 = 0.99 , R = 0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −1390.76 , R = −0.02 )DCQ (R2 = −49.41 , R = 0 )PERT (R2 = −2.91 , R = 0.01 )CIBERSORT (R2 = −157.65 , R = 0.05 )FARDEEP (R2 = 0.99 , R = 0.99 )30% OutliersA0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −353.51 , R = −0.01 )DCQ (R2 = −21.43 , R = 0.03 )PERT (R2 = −2.45 , R = −0.01 )CIBERSORT (R2 = 0.94 , R = 0.97 )FARDEEP (R2 = 0.98 , R = 0.99 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −310.28 , R = 0 )DCQ (R2 = −11.78 , R = 0.03 )PERT (R2 = −2.71 , R = 0 )CIBERSORT (R2 = −43.6 , R = 0.03 )FARDEEP (R2 = 0.98 , R = 0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −879.42 , R = 0.03 )DCQ (R2 = −40.91 , R = 0 )PERT (R2 = −2.76 , R = 0.04 )CIBERSORT (R2 = −122.88 , R = 0.09 )FARDEEP (R2 = 0.98 , R = 0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −302541.27 , R = −0.04 )DCQ (R2 = −41420.95 , R = −0.02 )PERT (R2 = −2.63 , R = 0.02 )CIBERSORT (R2 = −168.49 , R = 0.05 )FARDEEP (R2 = 0.97 , R = 0.99 )30% OutliersB Discussion of correlated genes Gene-expression deconvolution problem can be expressed as a linear regression model (B.1). Sometimes the expression values among different genes are highly correlated. To evaluate the performance of FARDEEP under the violation of independent errors. We have simulated datasets 50 times with different proportion of outliers exactly following the simulation setting in section “in silico simulation with varied error types” of the paper except that the error term  ∼ N (0, Σ), where Σij =  1 i = j, 0.7 1 ≤ i, j ≤ 20 and i (cid:54)= j, 0.5 21 ≤ i, j ≤ 40 and i (cid:54)= j, 0 others. Figure B.5, B.6 and Table B.3 show that FARDEEP outperforms other approaches and keeps its accuracy under the correlated random errors. Table B.3: We simulated correlated errors with different proportion of outliers for FARDEEP, and computed true positive rate and false positive rate to evaluate the tunning result. Percentage of outliers True positive rate False positive rate Parameter (mean of k) 0.009 3.53 5% 10% 20% 30% 1 1 0.01 2.48 1 0.02 1.45 1 0.06 1.19 74 Figure B.5: SSE of coefficients for different approaches based on simulations with corre- lated random errors. We simulated different percentage of outliers ({5%, 10%, 20%, 30%}) and compared the SSE for coefficients applying NNLS, DCQ, PERT, CIBERSORT, and FARDEEP. 75 20% outlier30% outlier5% outlier10% outlier−50510−50510−50510−50510Proportion of outlierslog2(SSE)NNLSDCQPERTCIBERSORTFARDEEP Figure B.6: Compare the estimation accuracy of different deconvolution approaches for simulations with correlated random errors (the values in parentheses are R2 and R). Based on {5%, 10%, 20%, 30%} percentage of outliers, we computed R2 to evaluate how well are the estimators fit for a straight line ˆβ = β. 76 0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −8.42 , R = 0.19 )DCQ (R2 = −2.91 , R = 0.08 )PERT (R2 = −2.58 , R = 0 )CIBERSORT (R2 = 0.96 , R = 0.98 )FARDEEP (R2 = 0.98 , R = 0.99 )5% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −12.43 , R = 0.11 )DCQ (R2 = −2.46 , R = 0.01 )PERT (R2 = −2.33 , R = 0.02 )CIBERSORT (R2 = 0.86 , R = 0.94 )FARDEEP (R2 = 0.98 , R = 0.99 )10% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −46.98 , R = 0.03 )DCQ (R2 = −2.72 , R = −0.02 )PERT (R2 = −2.37 , R = 0.02 )CIBERSORT (R2 = 0.68 , R = 0.86 )FARDEEP (R2 = 0.98 , R = 0.99 )20% Outliers0.000.250.500.751.000.000.250.500.751.00Estimated coefficientsTrue coefficientsNNLS (R2 = −26.77 , R = 0.02 )DCQ (R2 = −1.94 , R = 0.03 )PERT (R2 = −2.23 , R = 0.03 )CIBERSORT (R2 = 0.6 , R = 0.84 )FARDEEP (R2 = 0.97 , R = 0.99 )30% Outliers APPENDIX C Supplementary text for Chapter 4 Transformation between kernel and motifs We have introduced the approach of conversion between convolution kernel weight and po- sition weight matrix (PWM) in section 4.1.6. Here, we use the known transcription factor binding motifs (TFBM) database of arabidopsis thaliana to valid this method. In other words, we compare the known PWMs with its converted result by sequence logo figure. We have ran 865 TFBMs, and we randomly show ten of them in Figure C.1. The results indicate that our transformation approach is feasible. Figure C.1: Sequence logo figures to validate the transformation between convolution kernels and motifs. (A) original TFBMs, (B) converted TFBMs. 77 0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsB Figure C.1 (cont’d) 78 0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB Figure C.1 (cont’d) 79 0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.0123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsB0.00.51.01.52.0123456789101112131415161718192021222324252627282930313233343536BitsA0.00.51.01.5123456789101112131415161718192021222324252627282930313233343536BitsB Most important motifs for each stress condition Top2 Top1 MYBrelated_tnt.TRP1_col_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm S1Falike_tnt.AT3G09735_col_a_m1.pwm MYBrelated_tnt.TRP1_col_m1.pwm S1Falike_tnt.AT3G09735_col_a_m1.pwm S1Falike_tnt.AT3G09735_col_a_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm MYBrelated_tnt.At4g01280_col_a_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm MYBrelated_tnt.RVE1_col_a_m1.pwm MYB_tnt.MYB73_col_a_m1.pwm C2H2_tnt.At5g04390_col200_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm Top3 MADS-box_43709.pwm bZIP_tnt.bZIP48_col_a_m1.pwm MYBrelated_tnt.At5g08520_col_b_m1.pwm bHLH_MYC4.pwm C2H2_tnt.JKD_col_a_m1.pwm bHLH_MYC4.pwm Stress Heat_0.25 Heat_0.5 Heat_1 Heat_3 Heat_3.1 Heat_3.21 Heat_3.3 Heat_3.9 DC3000_2 DC3000_24 DC3000_6 Flg22_1 Flg22_4 GST.NPP1_1 WRKY_tnt.WRKY8_col_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm GST.NPP1_4 WRKY_tnt.WRKY29_col_a_m1.pwm HrcC._2 WRKY_tnt.WRKY29_col_a_m1.pwm HrcC._24 C2H2_tnt.At5g66730_col_m1.pwm HrcC._6 WRKY_tnt.WRKY55_col_a_m1.pwm HrpZ_1 HrpZ_4 WRKY_tnt.WRKY29_col_a_m1.pwm P.infestans_12 C2H2_tnt.At5g66730_col_m1.pwm P.infestans_24 WRKY_tnt.WRKY8_col_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm P.infestans_6 C2H2_tnt.At5g66730_col_m1.pwm Psph_2 WRKY_tnt.WRKY11_col_a_m1.pwm Psph_24 Psph_6 WRKY_tnt.WRKY29_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm avrRpm1_2 C2H2_tnt.At5g66730_col_m1.pwm avrRpm1_24 C2H2_tnt.At5g66730_col_m1.pwm avrRpm1_6 C2H2_tnt.At5g66730_col_m1.pwm Cold4C_12 C2C2gata_tnt.GATA15_col_b_m1.pwm Cold4C_24 C2H2_tnt.At5g66730_col_m1.pwm Cold4C_3 Cold4C_6 HB_tnt.HDG7_col_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm Drought_3 AP2EREBP_tnt.At1g22810_col_m1.pwm Drought_6 C2H2_tnt.At5g66730_col_m1.pwm Genotoxic_12 C2H2_tnt.At5g66730_col_m1.pwm Genotoxic_6 C2H2_tnt.At5g66730_col_m1.pwm Osmotic_1 Osmotic_12 C2H2_tnt.At5g66730_col_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm Osmotic_24 C2H2_tnt.At5g66730_col_m1.pwm Osmotic_3 C2H2_tnt.At5g66730_col_m1.pwm Osmotic_6 bZIP_tnt.bZIP16_col_v3a_m1.pwm Salt_12 C2H2_tnt.At5g66730_col_m1.pwm Salt_24 Salt_3 BZR_tnt.At4g18890_col_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm Salt_6 AP2EREBP_tnt.SHN3_col_a_m1.pwm UV.B_0.25 C2H2_tnt.At5g66730_col_m1.pwm UV.B_1 WRKY_tnt.WRKY11_col_a_m1.pwm UV.B_12 WRKY_tnt.WRKY55_col_a_m1.pwm UV.B_24 UV.B_3 WRKY_tnt.WRKY29_col_a_m1.pwm UV.B_6 WRKY_tnt.WRKY29_col_a_m1.pwm Wounding_0.25 C2H2_tnt.At5g66730_col_m1.pwm Wounding_0.5 MYBrelated_tnt.TRP1_col_m1.pwm Wounding_1 Wounding_12 MADS-box_43709.pwm Wounding_3 C2H2_tnt.At5g66730_col_m1.pwm GATA_GATA3.pwm AP2EREBP_tnt.ERF11_col_b_m1.pwm C2H2-ZF_AT3G53600.pwm GATA_GATA3.pwm MYBrelated_tnt.TRP1_col_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm G2like_tnt.KAN2_col_a_m1.pwm C2H2_tnt.JKD_col_a_m1.pwm Myb-SANT_MYB15.pwm Trihelix_tnt.AT5G05550_col_a_m1.pwm MYBrelated_tnt.At4g01280_col_a_m1.pwm NAC_tnt.ANAC042_col_a_m1.pwm HB_tnt.HDG7_col_a_m1.pwm C2H2-ZF_AT3G53600.pwm HB_tnt.HDG7_col_a_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm C2C2dof_tnt.OBP3_col_a_m1.pwm AP2EREBP_tnt.At1g22810_col_m1.pwm G2like_tnt.KAN2_col_a_m1.pwm C2H2_tnt.At5g04390_col200_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm C2H2-ZF_AT3G53600.pwm MYBrelated_tnt.TRP1_col_m1.pwm C2C2dof_tnt.OBP3_col_a_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm C2H2_tnt.At5g04390_col200_a_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm bHLH_tnt.BIM2_col_v3b_m1.pwm AP2EREBP_tnt.ERF3_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm bZIP_tnt.bZIP68_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm bZIP_tnt.bZIP68_col_a_m1.pwm MADS-box_43709.pwm GATA_GATA3.pwm bHLH_BHLH121.pwm C2H2_tnt.At5g66730_col_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm GATA_GATA3.pwm bZIP_tnt.bZIP16_col_v3a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY8_col_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm C2C2gata_tnt.GATA15_col_b_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm MYBrelated_tnt.TRP1_col_m1.pwm Trihelix_tnt.AT5G05550_col_a_m1.pwm C2H2-ZF_AT3G53600.pwm AP2EREBP_tnt.At1g22810_col_m1.pwm AP2_CRF3.pwm MADS-box_43709.pwm HB_tnt.HDG7_col_a_m1.pwm G2like_tnt.KAN2_col_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm AP2EREBP_tnt.At1g22810_col_m1.pwm MYB_tnt.MYB73_col_a_m1.pwm MADS-box_43709.pwm bZIP_tnt.bZIP68_col_a_m1.pwm BZR_tnt.At4g18890_col_a_m1.pwm bZIP_tnt.bZIP68_col_a_m1.pwm MYBrelated_tnt.At4g01280_col_a_m1.pwm S1Falike_tnt.AT3G09735_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm WRKY_tnt.WRKY29_col_a_m1.pwm C2H2_tnt.At5g66730_col_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY11_col_a_m1.pwm WRKY_tnt.WRKY55_col_a_m1.pwm GATA_GATA3.pwm 80 BIBLIOGRAPHY 81 BIBLIOGRAPHY Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z., and Clark, H. F. (2009). Decon- volution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLOS ONE, 4, e6098. Alfons, A., Croux, C., and Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7, 226–248. Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8), 831–838. Altboum, Z., Steuerman, Y., David, E., Barnett-Itzhaki, Z., Valadarsky, L., Keren-Shaul, H., Meningher, T., Mendelson, E., Mandelboim, M., Gat-Viks, I., and Amit, I. (2014). Digital cell quantification identifies global immune cell dynamics during influenza infection. Molecular Systems Biology, 10, 720. Aran, D., Sirota, M., and Butte, A. J. (2015). Systematic pan-cancer analysis of tumour purity. Nature Communications, 6, 8971. Aran, D., Hu, Z., and Butte, A. J. (2017). xcell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology, 18, 220. Balermpas, P., Michel, Y., Wagenblast, J., Seitz, O., Weiss, C., Rodel, F., Rodel, C., and Fokas, E. (2014). Tumour-infiltrating lymphocytes predict response to definitive chemora- diotherapy in head and neck cancer. British Journal of Cancer, 110, 501–509. Balermpas, P., Rodel, F., Rodel, C., Krause, M., Linge, A., Lohaus, F., Baumann, M., Tinhofer, I., Budach, V., Gkika, E., Stuschke, M., Avlar, M., Grosu, A.-L., Abdollahi, A., Debus, J., Bayer, C., Stangl, S., Belka, C., Pigorsch, S., Multhoff, G., Combs, S. E., Monnich, D., Zips, D., and Fokas, E. (2016). CD8+ tumour-infiltrating lymphocytes in relation to HPV status and clinical outcome in patients with head and neck cancer after postoperative chemoradiotherapy: A multicentre study of the German cancer consortium radiation oncology group (DKTK-ROG). International Journal of Cancer, 138, 171–181. Beal, J. (2017). Biochemical complexity drives log-normal variation in genetic expression. Engineering Biology, 1(1), 55–60. Bernholt, T. (2006). Robust Estimators are Hard to Compute. Technical Report 52/2005, Univ. Dortmund. 82 Bhatia, K., Jain, P., and Kar, P. (2015). Robust regression via hard thresholding. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’15, pages 721–729, Cambridge, MA, USA. MIT Press. Binnewies, M., Roberts, E. W., Kersten, K., Chan, V., Fearon, D. F., Merad, M., Coussens, L. M., Gabrilovich, D. I., Ostrand-Rosenberg, S., Hedrick, C. C., Vonderheide, R. H., Pittet, M. J., Jain, R. K., Zou, W., Howcroft, T. K., Woodhouse, E. C., Weinberg, R. A., and Krummel, M. F. (2018). Understanding the tumor immune microenvironment (time) for effective therapy. Nature Medicine, 24, 541–550. Chang, C.-C. and Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Chen, Y., Li, Y., Narayan, R., Subramanian, A., and Xie, X. (2016). Gene expression inference with deep learning. Bioinformatics, 32(12), 1832–1839. Corrales, L., McWhirter, S. M., Dubensky, T. W., and Gajewski, T. F. (2016a). The host STING pathway at the interface of cancer and immunity. The Journal of Clinical Investigation, 126(7), 2404–2411. Corrales, L., McWhirter, S. M., Thomas W. Dubensky, J., and Gajewski, T. F. (2016b). The host sting pathway at the interface of cancer and immunity. The Journal of Clinical Investigation, 126, 2404–2411. Deng, L., Liang, H., Xu, M., Yang, X., Burnette, B., Arina, A., Li, X.-D., Mauceri, H., Beck- ett, M., Darga, T., Huang, X., Gajewski, T. F., Chen, Z. J., Fu, Y.-X., and Weichselbaum, R. R. (2014). STING-Dependent Cytosolic DNA Sensing Promotes Radiation-Induced Type I Interferon-Dependent Antitumor Immunity in Immunogenic Tumors. Immunity, 41(5), 843–852. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]. arXiv: 1810.04805. Eray, M., Postila, V., Eeva, J., Ripatti, A., Karjalainen-Lindsberg, M.-L., Knuutila, S., An- dersson, L. C., and Pelkonen, J. (2003). Follicular Lymphoma Cell Lines, an In Vitro Model for Antigenic Selection and Cytokine-Mediated Growth Regulation of Germinal Centre B Cells. Scandinavian Journal of Immunology, 57(6), 545–555. Finotello, F., Mayer, C., Miranda, N., and Trajanoski, Z. (2017). quantiseq: quantifying immune contexture of human tumors. bioRxiv, page doi: https://doi.org/10.1101/223180. Galon, J., Costes, A., Sanchez-Cabo, F., Kirilovsky, A., Mlecnik, B., Lagorce-Pages, C., Tosolini, M., Camus, M., Berger, A., Wind, P., Zinzindohoue, F., Bruneval, P., Cugnenc, P.-H., Trajanoski, Z., Fridman, W.-H., and Pages, F. (2006). Type, density, and location 83 of immune cells within human colorectal tumors predict clinical outcome. Science, 313, 1960–1964. Ghasedi Dizaji, K., Wang, X., and Huang, H. (2018). Semi-Supervised Generative Adver- sarial Network for Gene Expression Inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pages 1435– 1444, New York, NY, USA. ACM. event-place: London, United Kingdom. Gong, T. and Szustakowski, J. (2013). Deconrnaseq: a statistical framework for decon- volution of heterogeneous tissue samples based on mrna-seq data. Bioinformatics, 29, 1083–1085. Gong, T., Hartmann, N., Kohane, I. S., Brinkmann, V., Staedtler, F., Letzkus, M., Bongio- vanni, S., and Szustakowski, J. D. (2011). Optimal deconvolution of transcriptional profil- ing data using quadratic programming with application to complex clinical blood samples. PLOS ONE, 6, e27156. Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., and Noble, W. S. (2007). Quantifying similarity between motifs. Genome Biology, 8(2), R24. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (2005). Robust statistics: the approach based on influence functions. Wiley. Google-Books-ID: NhbvAAAA- MAAJ. Hoek, K. L., Samir, P., Howard, L. M., Niu, X., Prasad, N., Galassie, A., Liu, Q., Allos, T. M., Floyd, K. A., Guo, Y., Shyr, Y., Levy, S. E., Joyce, S., Edwards, K. M., and Link, A. J. (2015). A cell-based systems biology assessment of human blood to monitor immune responses after influenza vaccination. PLoS One, 10, e0118528. Hsu, C.-W., Chang, C.-C., and Lin, C.-J. (2003). A practical guide to support vector classification. Technical report. Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16, 1370–1386. Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5), 1302–1338. Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. Lei, Y., Xie, Y., Tan, Y. S., Prince, M. E., Moyer, J. S., Nor, J., and Wolf, G. T. (2016). Telltale tumor infiltrating lymphocytes (til) in oral, head & neck cancer. Oral oncology, 61, 159–165. 84 Li, B., Severson, E., Pignon, J.-C., Zhao, H., Li, T., Novak, J., Jiang, P., Shen, H., Aster, J. C., Rodig, S., Signoretti, S., Liu, J. S., and Liu, X. S. (2016). Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biology, 17, 174. Liebner, D. A., Huang, K., and Parvin, J. D. (2014). Microarray microdissection with anal- ysis of differences is a computational tool for deconvoluting cell type-specific contributions from tissue samples. Bioinformatics, 30, 682–689. Mackey, M. D., Mackey, D. J., Higgins, H. W., and Wright, S. W. (1996). Chemtax - a program for estimating class abundances from chemical markers:application to hplc measurements of phytoplankton. Marine Ecology Progress Series, 144, 265–283. Mlecnik, B., Bindea, G., Angell, H. K., Maby, P., Angelova, M., Tougeron, D., Church, S. E., Lafontaine, L., Fischer, M., Fredriksen, T., Sasso, M., Bilocq, A. M., Kirilovsky, A., Obenauf, A. C., Hamieh, M., Berger, A., Bruneval, P., Tuech, J.-J., Sabourin, J.-C., Pessot, F. L., Mauillon, J., Rafii, A., Laurent-Puig, P., Speicher, M. R., Trajanoski, Z., Michel, P., Sesboue, R., Frebourg, T., Pages, F., Valge-Archer, V., Latouche, J.-B., and Galon, J. (2016). Integrative analyses of colorectal cancer show immunoscore is a stronger predictor of patient survival than microsatellite instability. Immunity, 44, 698–711. Nagarsheth, N., Wicha, M. S., and Zou, W. (2017). Chemokines in the cancer microen- vironment and their relevance in cancer immunotherapy. Nature Reviews. Immunology, 17(9), 559–572. Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M., and Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12, 453–457. Nguyen, N., Bellile, E., Thomas, D., McHugh, J., Rozek, L., Virani, S., Peterson, L., Carey, T. E., Walline, H., Moyer, J., Spector, M., Perim, D., Prince, M., McLean, S., Bradford, C. R., Taylor, J. M. G., Wolf, G. T., Head, and Investigators, N. S. P. (2016). Tumor infiltrating lymphocytes and survival in patients with head and neck squamous cell carcinoma. Head Neck, 38, 1074–1084. Nguyen, N. H. and Tran, T. D. (2011). Exact Recoverability From Dense Corrupted Ob- servations via $\ell _{1}$-Minimization. IEEE Transactions on Information Theory, 59, 2017–2035. Nguyen, T. D. and Welsch, R. (2010). Outlier detection and least trimmed squares ap- proximation using semi-definite programming. Computational Statistics & Data Analysis, 54(12), 3212–3226. O’Malley, R. C., Huang, S.-S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli, M., Gallavotti, A., and Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell, 165(5), 1280–1292. 85 Pages, F., Kirilovsky, A., Mlecnik, B., Asslaber, M., Tosolini, M., Bindea, G., Lagorce, C., Wind, P., Marliot, F., Bruneval, P., Zatloukal, K., Trajanoski, Z., Berger, A., Fridman, W.-H., and Galon, J. (2009). In situ cytotoxic and memory t cells predict outcome in patients with early-stage colorectal cancer. Journal of clinical oncology, 27, 5944–5951. Qiao, J., Tang, H., and Fu, Y. X. (2017). DNA sensing and immune responses in cancer therapy. Current opinion in immunology, 45, 16–20. Qiao, W., Quon, G., Csaszar, E., Yu, M., Morris, Q., and Zandstra, P. W. (2012). Pert : A method for expression deconvolution of human blood samples from varied microenviron- mental and developmental conditions. PLOS Computational Biology, 8, e1002838. Quang, D. and Xie, X. (2016). DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research, 44(11), e107–e107. Racle, J., de Jonge, K., Baumgaertner, P., Speiser, D. E., and Gfeller, D. (2017). Simulta- neous enumeration of cancer and immune cell types from bulk tumor gene expression data. eLife, 6. Read, D. F., Cook, K., Lu, Y. Y., Roch, K. L., and Noble, W. (2019). Predicting gene expression in the human malaria parasite Plasmodium falciparum. bioRxiv, page 431049. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. Robinson, D., Wu, Y.-M., Lonigro, R., Vats, P., Cobain, E., Everett, J., Cao, X., Rabban, E., Kumar-Sinha, C., Raymond, V., Schuetze, S., Alva, A., Siddiqui, J., Chugh, R., Worden, F., Zalupski, M. M., Innis, J., Mody, R. J., Tomlins, S. A., Lucas, D., Baker, L., Ramnath, N., Schott, A., Hayes, D., Vijai, J., Offit, K., Stoffel, E., Roberts, S., Smith, D., Kunju, L., Talpaz, M., Cieslik, M., and Chinnaiyan, A. M. (2017). Integrative clinical genomics of metastatic cancer. Nature, 548, 297–303. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Rousseeuw, P. J. and Driessen, K. V. (2006). Computing lts regression for large data sets. Data Mining and Knowledge Discovery, 12, 29–45. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons. She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106, 626–639. 86 Shen, F., Shen, C., Hengel, A. v. d., and Tang, Z. (2013). Approximate Least Trimmed Sum of Squares Fitting and Applications in Image Analysis. IEEE Transactions on Image Processing, 22(5), 1836–1847. Shen, Y. and Sanghavi, S. (2019). Regression. Iterative Least Trimmed Squares for Mixed Linear Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M., and Butte, A. J. (2010). Cell type–specific gene expression differences in complex tissues. Nature Methods, 7, 287–289. Singh, R., Lanchantin, J., Robins, G., and Qi, Y. (2016). DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17), i639–i648. Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould, J., Davis, J. F., Tubelli, A. A., Asiedu, J. K., Lahr, D. L., Hirschman, J. E., Liu, Z., Donahue, M., Julian, B., Khan, M., Wadden, D., Smith, I. C., Lam, D., Liberzon, A., Toder, C., Bagul, M., Orzechowski, M., Enache, O. M., Piccioni, F., Johnson, S. A., Lyons, N. J., Berger, A. H., Shamji, A. F., Brooks, A. N., Vrcic, A., Flynn, C., Rosains, J., Takeda, D. Y., Hu, R., Davison, D., Lamb, J., Ardlie, K., Hogstrom, L., Greenside, P., Gray, N. S., Clemons, P. A., Silver, S., Wu, X., Zhao, W.-N., Read-Button, W., Wu, X., Haggarty, S. J., Ronco, L. V., Boehm, J. S., Schreiber, S. L., Doench, J. G., Bittker, J. A., Root, D. E., Wong, B., and Golub, T. R. (2017). A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6), 1437–1452.e17. Tosolini, M., Algans, C., Pont, F., Ycart, B., and Fournie, J.-J. (2016). Large-scale mi- croarray profiling reveals four stages of immune escape in non-Hodgkin lymphomas. On- coimmunology, 5(7). Tosolini, M., Pont, F., Poupot, M., Vergez, F., Nicolau-Travers, M.-L., Vermijlen, D., Sarry, J.-E., Dieli, F., and Fournie, J.-J. (2017). Assessment of tumor-infiltrating tcrvγ9vδ2 γδ lymphocyte abundance by deconvolution of human cancers microarrays. Oncoimmunology, 6, e1284723. Uygun, S., Seddon, A. E., Azodi, C. B., and Shiu, S.-H. (2017). Predictive Models of Spatial Transcriptional Response to High Salinity. Plant Physiology, 174(1), 450–464. Vallania, F., Tam, A., Lofgren, S., Schaffert, S., Azad, T. D., Bongen, E., Alsup, M., Alonso, M., Davis, M., Engleman, E., and Khatri, P. (2018). Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases. Nature Communications, 9, 4735. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. In I. Guyon, U. V. Luxburg, 87 S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998–6008. Curran Associates, Inc. Visek, J. (2006a). The least trimmed squares. Part I: Consistency. Kybernetika, 42(1), 1–36. Visek, J. (2006b). The least trimmed squares part II: sqrtn-consistency. Kybernetika, 42, 181–202. Wang, H., Huang, S., Shou, J., Su, E. W., Onyia, J. E., Liao, B., and Li, S. (2006). Comparative analysis and integrative classification of NCI60 cell lines and primary tumors using gene expression profiling data. BMC Genomics, 7, 166. Wang, X., Ghasedi Dizaji, K., and Huang, H. (2018). Conditional generative adversarial network for gene expression inference. Bioinformatics, 34(17), i603–i611. Washburn, J. D., Mejia-Guerra, M. K., Ramstein, G., Kremling, K. A., Valluru, R., Buckler, E. S., and Wang, H. (2019). Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences, 116(12), 5542–5549. Weingarten-Gabbay, S. and Segal, E. (2014). The grammar of transcriptional regulation. Human Genetics, 133(6), 701–711. Weirauch, M. T., Yang, A., Albu, M., Cote, A. G., Montenegro-Montero, A., Drewe, P., Najafabadi, H. S., Lambert, S. A., Mann, I., Cook, K., Zheng, H., Goity, A., van Bakel, H., Lozano, J.-C., Galli, M., Lewsey, M. G., Huang, E., Mukherjee, T., Chen, X., Reece- Hoyes, J. S., Govindarajan, S., Shaulsky, G., Walhout, A. J. M., Bouget, F.-Y., Ratsch, G., Larrondo, L. F., Ecker, J. R., and Hughes, T. R. (2014). Determination and inference of eukaryotic transcription factor sequence specificity. Cell, 158(6), 1431–1443. Wilkins, O., Hafemeister, C., Plessis, A., Holloway-Phillips, M.-M., Pham, G. M., Nico- tra, A. B., Gregorio, G. B., Jagadish, S. V. K., Septiningsih, E. M., Bonneau, R., and Purugganan, M. (2016). EGRINs (Environmental Gene Regulatory Influence Networks) in Rice That Function in the Response to Water Deficit, High Temperature, and Agricultural Environments. The Plant Cell, 28(10), 2365–2384. Wolf, G. T., Chepeha, D. B., Bellile, E., Nguyen, A., Thomas, D., McHugh, J., of Michi- gan Head, T. U., and Program, N. S. (2015). Tumor infiltrating lymphocytes (til) and prognosis in oral cavity squamous carcinoma: a preliminary study. Oral oncology, 51, 90–95. Wright, J. and Ma, Y. (2009). Dense error correction via l1-minimization. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3033–3036. 88 Xu, Q., Yan, M., Huang, C., Xiong, J., Huang, Q., and Yao, Y. (2017). Exploring outliers in crowdsourced ranking for QoE. Proceedings of the 25th ACM Multimedia. Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres-Garcia, W., Treviño, V., Shen, H., Laird, P. W., Levine, D. A., Carter, S. L., Getz, G., Stemke-Hale, K., Mills, G. B., and Verhaak, R. G. W. (2013). Inferring tumour purity and stromal and immune cell admixture from expression data. Nature Communications, 4, 2612. Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12(10), 931–934. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. 89