Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares

Gene-expression deconvolution is used to quantify different types of cells in a mixed population. It provides a highly promising solution to rapidly characterize the tumor-infiltrating immune landscape and identify cold cancers. However, a major challenge is that gene-expression data are frequently contaminated by many outliers that decrease the estimation accuracy. Thus, it is imperative to develop a robust deconvolution method that automatically decontaminates data by reliably detecting and removing outliers. We developed a new machine learning tool, Fast And Robust DEconvolution of Expression Profiles (FARDEEP), to enumerate immune cell subsets from whole tumor tissue samples. To reduce noise in the tumor gene expression datasets, FARDEEP utilizes an adaptive least trimmed square to automatically detect and remove outliers before estimating the cell compositions. We show that FARDEEP is less susceptible to outliers and returns a better estimation of coefficients than the existing methods with both numerical simulations and real datasets. FARDEEP provides an estimate related to the absolute quantity of each immune cell subset in addition to relative percentages. Hence, FARDEEP represents a novel robust algorithm to complement the existing toolkit for the characterization of tissue-infiltrating immune cell landscape. The source code for FARDEEP is implemented in R and available for download at https://github.com/YuningHao/FARDEEP.git.


Introduction
Immune checkpoint blockade has revolutionized the rational design of neoadjuvant cancer therapies. Compelling evidence suggests that a favorable tumor immune microenvironment underpins better clinical responses to radiotherapy, chemotherapy, and immunotherapy [1][2][3]. Immunohistochemistry (IHC)-based immunoscores, which quantify the number of CD8 + cytotoxic T lymphocytes and CD45RO + memory T cells, show better prognostic potential than conventional pathological methods in colon cancer patients [4,5]. Hence, harnessing the composition of intra-tumoral immune cell infiltration is a highly promising approach to stratify tumors [6][7][8][9][10][11]. The current IHC immunoscoring approach has two limitations. First, the interpretation of immune cell subsets varies among pathologists and institutions, thus lacking a consistent standard for the scoring practice. Second, only a limited number of biomarkers can be assessed simultaneously, which prevents a comprehensive annotation of the immune contexture in the tumor microenvironment (TME). Hence, robust methods for genome datainformed cell type quantitation are in urgent need.
Immunogenomics presents an unprecedented opportunity to resolve the intra-tumoral immune landscape. Cell type deconvolution using leukocyte signature gene expression profiling is a highly promising approach to estimate the global immune cell composition from whole tumor gene expression data [12][13][14][15][16][17]. However, a significant technical obstacle is that the efficacy and accuracy of gene expression deconvolution are limited by the large number of outliers, which are frequently observed in tumor gene expression datasets [18]. The first step towards enhancing the overall gene deconvolution algorithms is to improve methods for outliers identification and processing. Those outliers include genes with abnormal expression value which may be caused by measurement error, environmental effect, expression from nonimmune cells, or natural fluctuations in certain type of immune cells. Notably, the current immune deconvolution gene signature matrix relies on the profiling of differentially expressed genes among different immune subsets. Frequent contamination of transcripts reading from cancer cells may significantly bias the algorithms. In this study, we report a novel FAst and Robust DEconvolution of Expression Profiles (FARDEEP) method that significantly improves the estimation of coefficients.
Let y i be the observed expression value for the ith gene; x i , a p-dimensional vector, be the expected expression of the ith gene for the p different cell types; and X = [x 1 , . . ., x n ] 0 be the signature matrix. The gene-expression deconvolution problem can be formulated as follows, where β 2 R p is an unknown parameter corresponding to the compositions of p cell types, and ε i is a noise term with a mean of 0. Several methods were proposed to solve this deconvolution problem. To enforce the non-negativity of β in (0.1), several algorithms, such as the Non-Negative Least Square (NNLS), Non-negative Maximum Likelihood (NNML) frameworks and the perturbation model (PERT) were developed. They all rely on the signature matrix (X) derived from Microarray experiments [14,[19][20][21][22][23][24]. To extend this work to RNAseq data, Finotello et al. [14] proposed a constraint linear model with a signature matrix derived from RNA-seq data. Additionally, the gene expression of each cell may vary depending on its microenvironment and other factors, which will lead to a biased estimation. To address this issue, Microarray Microdissection with Analysis of Differences (MMAD) incorporates the concept of the effective RNA fraction and estimates coefficients using a maximum likelihood approach [25]. To further adapt deconvolution to high-dimensional settings, Altboum et al. [26] proposed a penalized regression framework, Digital Cell Quantifier (DCQ), to encourage sparsity for the estimated β using the elastic net [27]. Cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) uses ν-support vector regression (ν-SVR) to enhance the robustness of gene expression deconvolution. CIBERSORT performs a regression by finding a hyperplane that fits as many data points as possible within a tube whose vertical length is a constant ε [12]. The ε-tube provides a region in which estimation errors are ignored. This model does not include an intercept to capture contributions of other contents. Additionally, to increase the computational efficiency, CIBERSORT applies Z-normalization to the data before fitting the regression, which may introduce estimation bias. Based on the CIBERSORT framework, several extensions have been proposed to overcome limitations such as platform inconsistency between signature and mixture matrices and low estimation accuracy for γδ T cell [15][16][17]. However, the quantitative information of cell proportions of these two approaches is built on CIBERSORT whose performance may be challenged by frequent outliers in whole tumor tissue transcriptomes. To reduce the dependence on the signature matrix, xCell utilizes the concept of single-sample gene set enrichment analysis (ssGSEA) to calculate an immune cell score which could predict the enrichment of immune cells [13]. Despite its robustness, xCell relies much on the ranking of gene expression value which leads to suboptimal solution for the estimation accuracy. Overall, a robust method that determines both the distribution and absolute volume of tumor-infiltrating lymphocytes (TILs) will further improve the current gene deconvolution pipeline.
To handle the heavily contaminated gene expression data and provide absolute cell abundance estimation, we developed a robust method based on the Least Trimmed Square (LTS) framework [28,29]. LTS finds h observations with smallest residuals, and the estimatorβ is the least squares fit over these h observations. LTS is an NP-hard problem, and Rousseeuw and Driessen [30] proposed a stochastic FAST-LTS algorithm. Nevertheless, it may give a suboptimal fitting result and get much slower when the sample size and dimension of variables become larger and higher since its accuracy relies on the initial random h-subsets and the number of initial subsets. When n is the sample size and p is the number of coefficients, h is suggested to be the smallest integer that is not less than (n + p + 1)/2 to remove as many outliers as possible while keeping an unbias result. Using the information of only half of the data reduces the power of the estimator because the amount of outliers in the real case cannot be presumed and can be small. Xu et al. [31] proposed an adaptive least trimmed square which is not limited to the randomness of initial subset but only applied the binary dataset. In this study, we extend the adaptive least trimmed square to introduce a model-free method, which could find the number of outliers automatically based on LTS. FARDEEP provides a flexible framework which is suitable for both Microarray and RNA-seq data using LM22 and Immunostate signature matrices respectively. As evidence of high fidelity and robustness, FARDEEP exhibits superior performance in simulated and real-world datasets.

Model formulation
The usual linear deconvolution model can be expressed as below, where y 2 R n is the observed expression data for n immune subset signature genes, X 2 R n�p denotes a mean gene expression signature matrix for p different cell types, β 2 R p contains each unknown cell type abundance, and ε 2 R n is a vector of random errors with zero mean and variance of σ 2 I. To incorporate outliers, we propose the following model ð0:2Þ where parameter τ = (τ 1 , . . ., τ n ) 0 is a sparse vector in R n with τ i 6 ¼ 0 indicating the ith gene is an outlier. Under the formulation of (0.2), letβ ols ¼ ðX > XÞ À 1 X > y be the Ordinary Least Square (OLS) estimate and H = X(X > X) −1 X > be the projection matrix. The residuals r = (r 1 , . . ., r n ) using OLS could be formulated as r ¼ y À Xβ ols ¼ ðI À HÞτ þ ðI À HÞ�: ð0:3Þ with mean of (I − H)τ and variance of σ 2 (I − H).

Adaptive least trimmed square
From (0.3), the residuals, r i with the corresponding τ i 6 ¼ 0, would deviate from zero, which suggests that the set of outliers can be identified through thresholding as follows where E is the set of detected outliers, k is a tuning parameter controlling the sensitivity of the model, and r med is the median of fjrj i g n i¼1 . We denote the number of elements in set E as |E| and let N be the number of true outliers in the data. First, we can use least squares and formula (0.4) to obtain a rough estimate of E denoted asÊ. Let the cardinality ofÊ be � N . Since the model at this moment is inaccurate with contamination of outliers, � N is an overestimation of N which can be used to get an underestimate via N ¼ a 1 � N with α 1 2 (0, 1). With N , we can then update the least square fitting after removing the N samples with the largest absolute value of residuals and obtain an improved estimate of E and the corresponding � N . We can improve the model by repeating the procedure, but we need to increase the underestimate of outliers, N , by a factor of α 2 with α 2 > 1 for each iteration to force the convergence between � N and N . In sum, we initialize our algorithm by settinĝ β ð0Þ ¼ ðX > XÞ À 1 X > y; where the min(�, �) operator guarantees that � N ðjÞ , an overestimation of N, is non-increasing. Similarly, we update N ðjÞ through where dxe means the ceiling of x 2 R, α 1 2 (0, 1) is used to obtain a lower bound for N in the first step, α 2 > 1 guarantees the monotonicity of N ðjÞ , and the min(�, �) operator guarantees N ðjÞ is smaller than � N ðjÞ . Then we updateβ and r after removing N ðjÞ outliers bŷ We repeat this procedure until N and � N converge. Hence, we hereby report a novel approach, coined as adaptive Least Trimmed Square (aLTS), to automatically detect and remove contaminating outliers. Our aLTS is an extension of the iterative LTS algorithm proposed by Xu et al. [31] which is designed for binary output such as the comparison between two images or videos.

FARDEEP
Because the abundance of cell types are always non-negative, we replaced the OLS regression in the aLTS procedure with non-negative least square regression (NNLS). By applying the modified aLTS to the deconvolution model (0.2) and solving the following problem, β ¼ argmin β ky À Xβk 2 2 ; subject to β � 0 using Lawson-Hanson algorithm [19], we developed a robust tool, FARDEEP, for cellular deconvolution summarized in Algorithm 1.
One unique advantage of FARDEEP is that it is fast and guarantees to converge within finite steps, which is summarized in the following theorem.

5:
solving the NNLS problem after removing N ðjÞ genes with largest residuals, and updateβ ðjÞ , r (j) ; 6: until � N ¼ N. Output: Coefficientsβ, Number of outliersN , Index of outliers Theorem 1 Algorithm 1 (FARDEEP) stops in no more than j � steps, where Here b�c is the largest integer that is less than or equal to x. Proof. It follows from the fact that the sequence f � N ðjÞ g is non-increasing, and fN ðjÞ g is a geometrically increasing sequence that is bounded by the smallest component of f � N ðjÞ g. Specifically, assume that j � steps have been taken in FARDEEP, then j has approached j � − 1, and which leads to the result. Theβ from FARDEEP, denoted as TIL subset score, is the direct estimate of the linear model without any normalization and hence reflects the absolute abundance of TILs. In addition, we can derive the relative TILs abundance from the TIL subset scores through ð0:7Þ whereb j is the jth TIL subset score. In practice, the TIL subset score provides important information of patient's tumor-infiltrating immune landscape, and we have included a discussion in S2 Text.

Parameter tuning
There are three tuning parameters k, α 1 , and α 2 in FARDEEP. Since α 1 is only used in the first iteration, a relatively small α 1 is preferred to ensure that FARDEEP does not remove too many outliers at the first step. In practice, FARDEEP is not sensitive to different values of α 1 , and α 2 , so we set them to 0.1 and 1.5 respectively by default. However, k controls the number of outliers in each iteration and is critical for the performance of FARDEEP. Thus, we tune k on a case-by-case basis for each sample to preserve meaningful fluctuations of gene expression levels. Effects for different tuning parameters are shown in S1 Table. Since the test group may contain outliers that influence the accuracy of the tuning result, cross-validation is not advised. Instead, we applied the Bayesian Information Criterion (BIC) and assume that the errors follow a log-normal distribution instead of a normal distribution among gene expression datasets as suggest by Beal [32]. We define the modified BIC referring to the setting of She and Owen [33]: whereÊ being the set of detected outliers, b is number of parameters and equalsN þ p þ 1 withN ¼ jÊj being the number of outliers, and m equals n ÀN . Then, we choose the value of k associated with the smallest BIC � .

Results
To test the performance of FARDEEP, we compared our approach with the existing methods using numerical simulations and real datasets. Here, we list the outlier genes detected by FAR-DEEP for real datasets in S4 Table. We use LM22 signature matrix containing 22 immune cell types hematopoietic cells for Microarray data and use quanTIseq signature matrix containing 10 immune cell types for RNA-Seq data. To compare the performance of different methods, we report the sum of squared error (SSE), the coefficient of determination denoted as Rsquared (R 2 ) and the Pearson correlation (R) defined as follows bÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where β � is the ground truth, andβ is the estimate.

In silico simulation with varied error types
To test the robustness of FARDEEP under different error conditions, we simulated three datasets refer to the setting in [33,34] with normally distributed errors, heavy tailed errors. The observations were generated based on the linear regression model (0.2). The predictor matrix is X = (x 1 , . . ., x n ) 0 = UΣ 1/2 , where U ij � Uð0; 20Þ and Σ ij ¼ r I fi6 ¼jg with ρ = 0.5. Consider the proportion of outliers f 2 {5%, 10%, 20%, 30%}, sample size n = 500, and number of predictors p = 20, we added random errors and outliers to the simulated data as follows: • Random errors: we generated the random error vector from i) standard normal distribution, ii) t-distribution with 3 degrees of freedom.
• Vertical outliers: we generated a n dimensional zero vector τ and randomly selected nf elements in τ to be the outliers generated from a non-central t-distribution with 1 degree of freedom and a non-centrality parameter of 30.
• Leverage points: we took 20% of the contaminated data as leverage points, that is, replacing the corresponding predictors by the samples from N ð2maxðXÞ; 1Þ.
The coefficients β j were sampled from Uð0; 1Þ, where j = 1, . . ., p. Based on the framework above, the dependent variable could be obtained by We simulated each model 50 times. As shown in Figs 1 and 2, FARDEEP outperforms other methods, evidenced by the SSE, R 2 and R values.
To check FARDEEP's accuracy of outlier detection, we simulated {5%;10%;20%;30%} outliers using the same method as above for a model with both normally distributed and heavytailed noise. As shown in Table 1, the tuning parameter k decreases when the amount of outliers becomes larger, and the true positive rates always stay around 1, indicating that the tunning of k is highly effective.
In the supplementary material S3 Text, we also included another outlier construction scheme with X related outliers and a simulation setting with correlated responses. In both scenarios, FARDEEP dominates other methods in terms of SSE, R 2 and R values.

In silico simulation based on leukocyte gene signature matrix file
Following the similar procedure as in Newman et al., we randomly generated the abundance of different cells from interval [0, 1] [12]. Notably, the sum of cell abundance is not necessarily 1. The measurement errors were sampled from 2 N ð0; ð0:1 log 2 ðsÞÞ 2 Þ . To incorporate outliers, we randomly selected i/50 of the data and replaced them with data drawn from 2 N ð10; ð0:3 log 2 ðsÞÞ 2 Þ where i = 1, 2, . . ., 25 and s is the standard deviation of original mixtures.
We repeated the procedure nine times and estimated the cell type abundance using FAR-DEEP, CIBERSORT (without converting to percentage), NNLS, PERT, and DCQ. As shown in S2 Table, we found that the SSE range for FARDEEP is 1.51 × 10 −7 to 1.47 × 10 −4 , R 2 and R keeps being 1 regardless of the number of outliers, while Other methods show significantly larger SSE and smaller R 2 , R.

Synthetic dataset
We used the cell line dataset GSE11103 generated by Abbas et al. [35] that contains gene expression profiles of four immune cell lines (Jurkat, IM-9, Raji, and THP-1) and four mixtures (MixA, MixB, MixC, and MixD) with various ratios of cells. Before analysis, we quantile normalized the mixture data for 54675 probesets and downloaded the immune gene signature matrix with 584 probesets from CIBERSORT website. Then, we applied five deconvolution  methods, including FARDEEP, CIBERSORT (without converting to percentage), DCQ, NNLS, and PERT, to calculate the sum of squared errors of the estimated abundance of the four immune cell lines. We also compared with CIBERSORT absolute mode, which is a beta version in CIBERSORT website (S1 Fig). Since the CIBERSORT absolute mode is a beta version and leads to suboptimal results compared with CIBERSORT, we only focused on the comparisons with CIBERSORT. We show that FARDEEP gives the smallest SSE and the largest R 2 , which indicates the most accurate result (Fig 3).

Synthetic dataset with added unknown content
Both CIBERSORT and FARDEEP are robust deconvolution methods and show advantages in the above datasets, we next sought to compare their performances on mixtures with unknown content. We followed the simulation setting proposed by Newman et al. [12] and downloaded the signature gene file from CIBERSORT website. The mixture file was constructed from the four immune cell lines data, as mentioned in the previous section, and a colon cancer cell line HCT116 (average of GSM269529 and GSM269530 in GSE10650). Cancer cells were mingled into immune cells at different ratios {0%, 30%, 60%, 90%}. Noise was added by sampling from the distribution 2 N ð0;ðf log 2 ðsÞÞ 2 Þ , in which f 2 {0%, 30%, 60%, 90%} and s is the standard deviation of original mixtures. By applying FARDEEP and CIBERSORT (without converting to percentage) on 64 mixtures, we found that FARDEEP remains an accurate estimation, while the tumor contents skew the results of CIBERSORT with larger deviation from the ground truth (Fig 4).

Deconvolution performance on immune-cell-rich datasets
To evaluate the performance of FARDEEP in immune-cell-rich settings that are less affected by outliers, we downloaded and analyzed two gene expression datasets (GSE65135 [12] and GSE20300 [36]) generated from the Affymetrix Microarray, which is the same platform used to generate the signature matrix LM22. The GSE65135 dataset consists of (i) surgical lymph node biopsies of 14 follicular lymphoma patients and (ii) purified B and T cells from the tonsils of 5 healthy controls, and the GSE20300 dataset includes 24 blood samples from pediatric renal transplant patients. Flow cytometry or coulter counter data in these studies, which are presented in relative scales, are treated as ground truth. Thus, we normalized the estimated parameters of each method to the sum of 1 before comparison.
As shown in Fig 5A and 5B for case (i) of GSE65135 and Fig 5D and 5E for GSE20300, FARDEEP outperformed CIBERSORT in terms of R 2 , R and SSE, which is consistent with our findings with simulated datasets. For case (ii) of GSE65136, we estimated the immune cell composition for purified B and T cells with purity level exceeding 95% and 98%, respectively. For purified B cells, CIBERSORT tends to return non-zero estimates for T cell and a large    proportion of other cell types, while FARDEEP gave almost all zero estimates for T cell and on average reduced the estimation error by 61%. Similarly, for the purified T cell, although CIBERSORT had a better performance compared to purified B cell, FARDEEP still significantly improves the estimation accuracy by reducing on average 48% of the estimation error ( Fig 5C). Furthermore, as shown in S4 Table, FARDEEP detected gene CD79A and BCL2A1 as outliers for most samples in case (i) of GSE65135. These two genes are known to have high expression levels in follicular lymphoma (B-cell lymphoma) cells [37]. Overall, even in specimens that are rich in immune cells without contamination by nonhematopoietic malignancy, FARDEEP still outperforms CIBERSORT in immune cell deconvolution.

Deconvolution performance on RNA-seq datasets
In addition to effectively handling Microarray data, FARDEEP can also deconvolve TILs using RNA-seq data when we replace the signature matrix LM22 with quanTIseq, a signature matrix generated from RNA-seq data containing ten different immune cell types [14]. We applied CIBERSORT and FARDEEP using signature matrix quanTIseq to peripheral blood mononuclear cell (PBMC) mixtures (GSE64655) generated by Hoek et al. [38], and lymph node bulk samples of 4 melanoma patients from GSE93722 [39]. Flow cytometry data in these studies are on a relative scale and are treated as ground truth. We normalized the estimated parameters of each method to a relative scale using (0.7) before comparison. The RNA-seq data are usually less noisy compared to Microarray, and PBMC datasets are usually clean with less unknown contents. Therefore, we expect FARDEEP and CIBERSORT will return comparable results, which is the case in Fig 6A and 6B. However, when dealing with noisier data containing more outliers such as lymph node bulk samples, FARDEEP obtained larger advantage over CIBER-SORT as shown in Fig 6C and 6D. Ovarian serous cystadenocarcinoma and lung squamous cell carcinoma datasets TME of solid carcinomas are different from a lymph node biopsy or peripheral blood, and the highly variable gene expression in cancer cells challenges the accuracy of immune cell deconvolution. It is well-established that immune infiltration profile serves as a promising prognosticator [4,5]. Hence, we next utilized survival and gene expression data of ovarian cancer (OV) and lung squamous cell carcinoma (LUSC) from The Cancer Genome Atlas (TCGA) database to assess the prognostic relevance of different deconvolution methods. These two datasets were chosen because only LM22 not the RNA-seq based signature matrix quanTIseq includes γδ T cells, and OV and LUSC from TCGA datasets are the only two cancer types with Affymetrix microarray data. Using gene expression data (n = 514 for OV and n = 133 for LUSC), we estimated the immunoscore using ESTIMATE proposed by yoshihara et al. [40], TILs proportion using CIBERSORT, as well as TILs subset scores using CIBERSORT (without converting to percentage) and FARDEEP. Cold tumors typically harbor lower numbers of CD8 + T cells, γδ T cells, M1-like macrophages, and NK cells [11,[41][42][43]. Thus, we calculated an anti-tumor immune subsets score by the summation of CD8 + T cells, γδ T cells, M1-macrophages, and NK cells. Then, we partitioned the patients into two groups with equal size using the median of either the immunoscore (ESTIMATE) or anti-tumor immune subsets score (CIBERSORT and FARDEEP). We compared the survival curves between the two groups. As shown in Fig 7, FARDEEP most effectively separates patients into high-and low-risk groups with the smallest p-value (p-value = 0.0065 and 0.059 for OV and LUSC respectively). Recently, CIBERSORT website supports a beta-version of an absolute mode for cell deconvolution. We also included CIBERSORT absolute mode in this survival analysis and showed that it returned a better result (p-value = 0.037) compared to the relative mode in the OV dataset. FARDEEP shows a stronger performance with a smaller p-value under this setting (S2 Fig). These results demonstrated  that the TIL subset scores could provide additional clinical-relevant information compared to the relative abundance.
In addition, we expected the summation of these TIL subset scores would negatively correlate with tumor purity. To prove this hypothesis, we calculated the summation of 22 TIL subset scores for both OV and LUSC datasets and correlated them with the tumor purity estimated from consensus measurement of purity estimations (CPE) [44]. Even without taking account of stromal cells, as shown in S3 Fig. the summation of TIL subset scores is negatively correlated with tumor purity.
Next, we sought to investigate whether outlier removal reduces contamination by transcripts from cancer cells. We first identified those top outlier-genes, which were consistently removed by FARDEEP in the OV dataset and obtained the average expression values of those outlier-genes from OV cell lines in GSE32474 [45]. As shown in S3 Table, most of these outlier-genes have high expression in cancer cell lines. For example, CXCL10 gene encodes an important chemokine to recruit CD8 + T cells and is also highly expressed in ovarian cancer cells. Thus, although some genes in LM22 may play a role in immune cells, they may be also highly expressed and variable among cancer cells. Such cross-contamination likely skews immune deconvolution analysis. As shown in S3 Table, FARDEEP successfully detected and removed those genes, leading to a more robust and accurate deconvolution analysis.

Discussion
The cancer immune microenvironment has emerged as a critical prognostic dimension that modulates patient responses to neoadjuvant therapy. However, the current clinical TNM staging system does not have a consistent method to stratify cancers based on their immunogenicity. The recent study shows that the RNA-seq datasets of whole tumors contain valuable prognostic information to assess the cancer-immunity interactions [12,46]. But the current methods to extract immune signatures are susceptible to the frequent outliers in the datasets, leading to less effective identification of cold tumors. Based on support vector regression, CIBERSORT is one of the most popular robust deconvolution methods. However, this model does not include an intercept to capture possible contribution from other cell types and performs a z-normalization to the data before fitting the regression model, which introduces biases into the output. Discussion of the effect of Z-score normalization for CIBERSORT is included in S1 Text. In this study, we developed a new machine learning tool, FARDEEP, to streamline the removal of outliers and increase the robustness of gene-expression profile deconvolution. Robustness is an indispensable feature to solve a problem of deconvolution because gene expression data are frequently contaminated by a large amount of outliers. FAR-DEEP solves the deconvolution problem in a robust way because this tool evaluates all outliers across the datasets and then examines the true immune gene signature using non-negative regression. This feature is especially useful to analyze tumors with significant non-hematopoietic tumor components. Interestingly, although FARDEEP and the current robust methods can both tackle immune-cell-rich specimens such as lymph node and PBMCs, FARDEEP exhibits improved prognostic potential when dealing more complex datasets with significant carcinoma cell content.
Although FARDEEP provides a robust computational algorithm to better solve the geneexpression deconvolution problem with noisy datasets, its performance and application rely on the choice of the signature matrix. To avoid estimation bias, it is important to choose the signature matrix derived from the same platform as the mixture matrix. For example, if dealing with gene expression data measured by Affymetrix HGU133A, we should use LM22, but if dealing with RNA-seq data, the signature matrix quanTIseq is preferred. Overall, here we show that FARDEEP is a powerful and rapid machine learning tool that outperforms existing robust methods for gene deconvolution in datasets with significant heavy-tailed noise. FAR-DEEP provides a new technology to interrogate cancer immunogenomics and more accurately map the immune landscape of solid tumors.
Supporting information S1 Table. Parameters of FARDEEP. To show that FARDEEP is not sensitive for different values of α 1 , and α 2 with tuned value of k, we simulated a dataset with sample size n = 500, number of predictors p = 20, normal distributed error and 20% outliers using the same setting of in silico simulation in the paper. Then we ran FARDEEP with following setting and get the number of detected outliers, true and false positive rate: (1) Take α 2 = 1.1, change α 1 from 0.1 to 0.5 by 0.05 and tune k using BIC � . (2) Take α 1 = 0.1, change α 2 from 1.1 to 2 by 0.1 and tune k using BIC � . (3) Take α 1 = 0.1, α 2 = 1.5, and change k from 1 to 10 by 0.1. We can see that the accuracy of the result stays stable with a well tuned k. (PDF) S2