An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets

Background Host immune response is coordinated by a variety of different specialized cell types that vary in time and location. While host immune response can be studied using conventional low-dimensional approaches, advances in transcriptomics analysis may provide a less biased view. Yet, leveraging transcriptomics data to identify immune cell subtypes presents challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space characterized by low sample numbers with noisy and missing values. To address these challenges, we explore using machine learning methods to select gene subsets and estimate gene coefficients simultaneously. Results Elastic-net logistic regression, a type of machine learning, was used to construct separate classifiers for ten different types of immune cell and for five T helper cell subsets. The resulting classifiers were then used to develop gene signatures that best discriminate among immune cell types and T helper cell subsets using RNA-seq datasets. We validated the approach using single-cell RNA-seq (scRNA-seq) datasets, which gave consistent results. In addition, we classified cell types that were previously unannotated. Finally, we benchmarked the proposed gene signatures against other existing gene signatures. Conclusions Developed classifiers can be used as priors in predicting the extent and functional orientation of the host immune response in diseases, such as cancer, where transcriptomic profiling of bulk tissue samples and single cells are routinely employed. Information that can provide insight into the mechanistic basis of disease and therapeutic response. The source code and documentation are available through GitHub: https://github.com/KlinkeLab/ImmClass2019. Electronic supplementary material The online version of this article (10.1186/s12859-019-2994-z) contains supplementary material, which is available to authorized users.

a sample [5]. Recent advances in technology, like mass cytometry or multispectral imaging, have expanded the number of molecular markers, but the number of markers used for discriminating among cell types within a sample remains on the order of 10 1. 5 .
In the recent years, quantifying tumor immune contexture using bulk transcriptomics or single-cell RNA sequencing data (scRNA-seq) has piqued the interest of the scientific community [6][7][8][9][10]. Advances in transcriptomics technology, like RNA sequencing, provide a much higher dimensional view of which genes are expressed in different immune cells (i.e., on the order of 10 3 ) [11]. Conceptually, inferring cell types from data using an expanded number of biologically relevant genes becomes more tolerant to non-specific noise and non-biological differences among samples and platforms. In practice, cell types can be identified using gene signatures, which are defined as sets of genes linked to common downstream functions or inductive networks that are co-regulated [12,13], using approaches such as Gene Set Enrichment Analysis (GSEA) [12]. However, as microarray data can inflate detecting low abundance and noisy transcripts and scRNA-seq data can have a lower depth of sequencing, opportunities for refining methods to quantify the immune contexture using gene signatures still remain.
Leveraging transcriptomics data to identify immune cell types presents analytic challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space that is characterized by low sample numbers with noisy and missing values. Typically, the number of cell samples is in the range of hundreds or less, while the number of profiled genes is in the tens of thousands [14]. Yet, only a few number of genes are relevant for discriminating among immune cell subsets. Datasets with a large number of noisy and irrelevant genes decrease the accuracy and computing efficiency of machine learning algorithms, especially when the number of samples are very limited. Hence, feature selection algorithms may be used to reduce the number of redundant genes [15]. Using feature selection methods enable developing gene signatures in different biomedical fields of study [16]. There are many proposed feature selection methods that can select gene sets that enable classifying samples with high accuracy. In recent years, regularization methods have became more popular, which efficiently select features [17] and also control for overfitting [18]. As a machine learning tool, logistic regression is considered to be a powerful discriminative method [18]. However, logistic regression alone is not applicable for high-dimensional cell classification problems [19]. On the other hand, hybrid methods, like regularized logistic regression, have been successfully applied to high-dimensional problems [20]. Regularized logistic regression selects a small set of genes with the strongest effects on the cost function [17]. A regularized logistic regression can be also be applied with different regularization terms. The most popular regularized terms are LASSO, Ridge [21], and elastic-net [22], which impose the l1 norm, l2 norm, and linear combination of l1 norm and l2 norm regularization, respectively, to the cost function. It has been shown that, specifically in very high dimensional problems, elastic-net outperforms LASSO and Ridge [17,22].
In this study, we focused on two-step regularized logistic regression techniques to develop immune cell signatures and immune cell and T helper cell classifiers using RNA-seq data for the cells highlighted in bold in Fig. 1. The first step of the process included a pre-filtering phase to select the optimal number of genes and implemented an elastic-net model as a regularization method for gene selection in generating the classifiers. The pre-filtering step reduced computational cost and increased final accuracy by selecting the most discriminative and relevant set of genes. Finally, we illustrate the value of the approach in annotating gene expression profiles obtained from singlecell RNA sequencing. The second step generated gene signatures for individual cell types using selected genes from first step and implemented a binary regularized logistic regression for each cell type against all other samples.

Results
We developed classifiers for subsets of immune cells and T helper cells separately with two main goals. First, we aimed to annotate RNA-seq data obtained from an enriched cell population with information as to the immune cell identity. Second, we developed gene signatures for different immune cells that could be used to quantify the prevalence from RNA-seq data obtained from a heterogeneous cell population. Prior to developing the classifiers, the data was pre-processed to remove genes that have low level of expression for most of samples (details can be found in Methods section) and normalized to increase the homogeneity in samples from different studies and to decrease dependency of expression estimates to transcript length and GC-content. Genes retained that had missing values for some of the samples were assigned a value of -1. Next, regularized logistic regression (elastic-net) was performed and the optimal number of genes and their coefficients were determined.

Generating and validating an immune cell classifier
In developing the immune cell classifier, we determined the optimal number of genes in the classifier by varying the lambda value used in the regularized logistic regression of the training samples and assessing performance. To quantify the performance using different lambdas, a dataset was generated by combining True-Negative samples, which were created using a bootstrapping approach that randomly resampled associated genes and their corresponding value from the testing datasets to create a synthetic dataset of similar size and complexity, with the original testing data, which were untouched during training and provided True-Positive samples. The accuracy of predicting the True-Positive samples were used to generate Receiver Operating Characteristic (ROC) curves (Fig. 2a). Performance using each lambda was quantified as the Area Under the ROC Curve (AUC).
The optimal lambda for immune cell classifier was the smallest value (i.e., highest number of genes) that maximized the AUC. Functionally, this lambda value represents the trade-off between retaining the highest number of informative genes (i.e., classifier signal) for developing the gene signature in the second step, while not adding non-informative genes (i.e., classifier noise). Consequently, we selected a lambda value of 1e-4 (452 genes) for the immune cell classifier, where the selected genes and their coefficients are shown in Additional file 1: Table S1.
To explore correlations between the weights of selected genes with their expression level, we generated heatmaps shown in Fig. 2, panels b and c. A high level of gene expression is reflected as a larger positive coefficient in a classifier model, while low or absent expression results in a negative coefficient. This is interpreted as, for example, if gene A is not in cell type 1, the presence of this gene in a sample decreases the probability for that sample to be cell type 1. For instance, E-cadherin (CDH1) was not detected in almost all monocyte samples and thus has a negative coefficient. Conversely, other genes are only expressed in certain cell types, which results in a high positive coefficient. For instance, CYP27B1, INHBA, IDO1, NUPR1, and UBD are only expressed by M1 macrophages and thus have high positive coefficients.
The differential expression among cell types suggests that the set of genes included in the classifier model may also be a good starting point for developing gene signatures, which is highlighted in Fig. 2d. Here, we focused on the expression of the 452 genes included in the classifier model and the correlations between samples clustered based on cell types. The off-diagonal entries in the correlation matrix are colored by euclidean distance with the color indicating similarity or dissimilarity using pink and blue, respectively. Color bars along the axes also highlight the cell types for the corresponding RNA-seq samples. As expected, RNA-seq samples from the same cell type were highly similar. More interestingly, correlation between different cell types can also be seen, like high similarity between CD4+ and CD8+ T cell samples, CD8+ T cell and NK cell samples, and monocyte and dendritic cell samples. Collectively, these heatmaps illustrate that the selected genes are a highly condensed but are still a representative set of genes that include the main characteristics of the immune cell types. It is also notable to compare the clustering result of cell types based on their coefficients in the classifier shown in Fig. 2b with similarity matrix in Fig. 2d. Since in the classifier coefficients are forcing the model to separate biologically close cell types (like CD4+ T cell and CD8+ T cell), the clustering results suggest that the coefficient vectors are equally dissimilar (Fig. 2b). However, in the case of their expression values, their similarity remains (Fig. 2d).

Evaluating the immune cell classifier using scRNA-seq datasets
To evaluate the proposed classifier in immune cell classification, two publicly accessible datasets generated by scRNA-seq technology were used [23,24]. The first dataset included malignant, immune, stromal and endothelial cells from 15 melanoma tissue samples [23]. We focused on the immune cell samples, which included 2761 annotated samples of T cells, B cells, Mphi and NK cells, and 294 unresolved samples. The immune cells in this study were recovered by flow cytometry by gating on CD45 positive cells. Annotations were on the basis of expressed marker genes while unresolved samples were from the CD45-gate and classified as non-malignant based on inferred copy number variation (CNV) patterns (i.e., CNV score <0.04).
Following pre-processing to filter and normalize the samples similar to the training step, the trained elasticnet logistic regression model was used to classify cells into one of the different immune subsets based on the reported scRNA-seq data with the results summarized in Fig. 3a. The inner pie chart shows the prior cell annotations reported by [23] and the outer chart shows the corresponding cell annotation predictions by our proposed classifier. Considering T cells as either CD4+ T cell or CD8+ T cell, the overall similarity between annotations provided by [23] and our classifier prediction is 96.2%. The distribution in cells types contained within the unresolved samples seemed to be slightly different than the annotated samples as we predicted the unresolved samples to be mainly CD8+ T cells and B cells.
The only cell type with low similarity between our classifier predictions and prior annotations was NK cells, where we classified almost half of samples annotated previously as NK cells as CD8+ T cell. Discriminating between these two cell types is challenging as they share many of the genes related to cytotoxic effector function and can also be subclassified into subsets, like CD56bright and CD56dim NK subsets [25]. To explore this discrepancy, we compared all annotated samples based on their CD8 score and NK score provided by the classifier, as shown in Fig. 3b. Although the number of NK cell samples are relatively low, it seems that the NK samples consist of two groups of samples: one with a higher likelihood of being a NK cell and a second with almost equal likelihood for being either CD8+ T cell or NK cell. We applied principal component analysis (PCA) to identify genes associated with this difference and used Enrichr for gene set enrichment [26,27]. Using gene sets associated with the Human Gene Atlas, the queried gene set was enriched for genes associated with CD56 NK cells, CD4+ T cell and CD8+ T cell. Collectively, the results suggests that the group of cells with similar score for NK and CD8 in the classifier model are Natural Killer T cells.
We also analyzed a second dataset that included 317 epithelial breast cancer cells, 175 immune cells and 23 non-carcinoma stromal cells, from 11 patients diagnosed with breast cancer [24]. We only considered samples annotated previously as immune cells, which were annotated as T cells, B cells, and myeloid samples by clustering the gene expression signatures using non-negative factorization. The scRNA-seq samples were similarly preprocessed and analyzed using the proposed classifier, with the results shown in Fig. 4. The inner pie chart shows the prior cell annotations reported by [24] and the outer chart shows the corresponding predicted cell annotation by our proposed classifier. Considering T cells as either CD4+ T cell or CD8+ T cell, 94.4% of reported T cells are predicted as the same cell type and other 5.6% is predicted to be DC or NK cells. However, for reported B cells and myeloid cells, we predicted relatively high portion of samples to be T cells ( 15.7% of B cells and 40% of myeloid cells). The rest of the myeloid samples were predicted to be macrophages or dendritic cells. Collectively, our proposed classifier agreed with many of the prior cell annotations and annotated many of the samples that were previously unresolved.

Developing a classifier for T helper cell subsets
To further apply this methodology to transcriptomic data, a separate classifier for distinguishing among T helper cells was developed using a similar approach to the immune cell classifier. We explored different values of the regression parameter lambda to find the optimal number of genes for this new dataset and visualized the performance of different lambdas by generating True-Negative samples using a bootstrapping approach whereby synthetic datasets were created by randomly resampling testing datasets. Original testing data that were completely untouched during training were used as True-Positive samples. The resulting True-Negative and True-Positive samples were used to generate ROC curves (Fig. 5a) and the AUC was used to score each lambda value. Generally, the lambda values for T helper cell classifier represents the trade-off between retaining genes and keeping the AUC  Table S1. The gene list was refined subsequently by developing a gene signature. Similar to the immune cell classifier, the coefficients of the selected genes for the T helper cell classifier correlated with their expression levels, as seen by comparing the heatmaps shown in Fig. 5, panels b and c. For instance, FUT7 has been expressed in almost all T helper cell samples except for iTreg that result in a negative coefficient for this cell type. In addition, there are sets of genes for each cell type that have large coefficients only for certain T helper cell subsets, like ALPK1, TBX21, IL12RB2, IFNG, RNF157 for Th1 that have low expression in other cells. As illustrated in Fig. 5d, the genes included in the classifier don't all uniquely associate with a single subset but collectively enable discriminating among T helper cell subsets. Interestingly, the T helper subsets stratified into two subgroups where naive T helper cells (Th0) and inducible T regulatory (iTreg) cells were more similar than effector type 1 (Th1), type 2 (Th2), and type 17 (Th17) T helper cells. Similar to the immune cell classifier, we also noted that the clustering of the classifier coefficients is different from what similarity matrix shows in Fig. 5d because the classifier coefficients aim to create a "classifying distance" among closely related cell types.
Finally by comparing the results of immune cell classifier with that of the T helper classifier, the intensity of differences among cell types can be seen in Figs. 2c and 5c. In the first figure you can find completely distinct set of genes in each cell type. Meanwhile, the gene sets in the second figure are not as distinct which could be due to the low number of samples or high biological similarity between T helper cell types.

Application of the classifiers
Clinical success of immune checkpoint inhibitors (ICI) for treating cancer coupled with technological advances in assaying the transcriptional signatures in individual cells, like scRNA-seq, has invigorated interest in characterizing the immune contexture within complex tissue microenvironments, like cancer. However as illustrated by the cell annotations reported by [24], identifying immune cell types from noisy scRNA-seq signatures using less biased methods remains an unsolved problem. To address this problem, we applied our newly developed classifiers to characterize the immune contexture in melanoma and explored differences in immune contexture that associate with immune checkpoint response. Of note, some patients  [24], which annotated scRNA-seq results by clustering by gene ontology terms using likelihood ratio test. Using the gene expression profile reported for each scRNA-seq sample, a new cell annotation was determined based on the closest match with the alternative cell signatures determined using elastic-net logistic regression, which is summarized in the outer pie chart with melanoma respond to ICIs durably but many others show resistance [28]. Specifically, we annotated immune cells in the melanoma scRNA-seq datasets [23,29] using our classifiers separately for each patient sample and ordered samples based on the treatment response, with the results shown in Fig. 6a, b. We used the percentage of cell type in each tumor sample as it was more informative and meaningful than using absolute cell numbers. It is notable that untreated and NoInfo samples likely include both ICI-resistant and ICI-sensitive tumors.
In comparing samples from resistant tumors to untreated tumors, we found interestingly that there are samples with high prevalence of NK in untreated tumors (Mel53, Mel81, and Mel82) while no samples in resistant tumors have a high prevalence of NK cells. The mentioned untreated tumors also have no or very low number of Th2 cells in their populations. In addition, untreated tumors have a more uniform distribution of immune cell types in contrast to ICI-resistant ones, which could reflect a therapeutic bias in immune cell prevalence in the tumor microenvironment due to ICI treatment.
Next, we combined the annotation data from both classifiers and applied PCA and clustering analysis, as shown in Fig. 6, panels c and d. Using scrambled data to determine principal components and their associated eigenvalues that are not generated by random chance (i.e., a negative control), we kept the first and second principal components that capture 68% and 21% of the total variance, respectively, and neglected other components that fell below the negative control of 8.4%. As it shown in Fig. 6c In hierarchical clustering, the optimal number of clusters was selected based on calculation of different cluster indices using the NbClust R package [30] which mainly identified two or three clusters as the optimal number. In considering three groupings of the hierarchical clustering results shown in Fig. 6d, seven out of eight ICIresistant samples clustered in first two clusters while the third cluster mainly contained untreated samples. The comparison of results from PCA and clustering analyses shows that the first cluster contained samples with extreme low value of PC1 which itself divided into two groups; one with extreme low value of PC2 and the other with higher amount of PC2. The second cluster located in highest amount of PC1 and lowest amount of PC2. All remained samples were clustered as third group, which were predominantly untreated samples. The difference in clustering suggests dissimilarities between ICI-resistant and untreated samples and the possibility of having ICIsensitive tumors in untreated samples. D single cell, gene signatures are commonly used to determine the prevalence of immune cell subsets within transcriptomic profiles of bulk tissue samples using deconvolution methods, called in silico cytometry [31].

Developing gene signatures
Leveraging the classifier results, we generated corresponding gene signatures using binary elastic-net logistic regression. Specifically, classifier genes with non-zero coefficients were used as initial features of the models, which were then regressed to the same training and testing datasets as used for developing the classifiers. Lambda values were selected for each immune and T helper cell subset based on similar method of lambda selection for classifiers and their values and corresponding AUC are shown in Additional file 2: Table S2. Finally, all generated signatures are summarized in Additional file 3: Table S3. We visualized the expression levels of the remaining set of genes, which at least occur in one gene signature, in Fig. 7. The expression of genes retained in immune cell signatures (Fig. 7a) and T helper cell signatures (Fig. 7b) were clustered by similarity in expression (rows) and by similarity in sample (columns). same cell type were mainly clustered together. The only exception is for macrophages (Mφ and M2) which can be attributed to high biological similarity and a low number of technical replicates for these cell types.
In general, the gene sets generated from the logistic regression model performed well with far fewer requisite genes in the testing set, a desirable result for a gene set intended to be used for immunophenotyping. In Fig. 8, the results of the benchmarking are shown separated by comparative gene set. Both the CIBERSORT and Single-Cell derived gene sets contain an average of 64 and 135 genes, respectively, while the logistic regression gene set contains an average of just 19. The new logistic regression gene set performed comparably to the existing contemporary gene sets and far exceeded the performance of the manually curated gene set used previously [6]. The benchmarking results indicate that the logistic regression gene sets are an improvement in efficacy over compact gene sets, such as those that are manually annotated or hand-picked. Meanwhile, the logistic regression gene sets also demonstrate an optimization of broader gene sets that contain too many genes for deep specificity when used in further analysis. The inclusion of too many genes in a set can dilute the real data across a constant level of noise, while including too few lacks the power to draw conclusions with high confidence. The logistic regression gene sets demonstrate a balance of these two issues through its highly refined selection of genes that can be fine-tuned using its lambda parameter.

Discussion
Recent developments in RNA sequencing enable a high fidelity view of the transcriptomic landscape associated with host immune response. Despite considerable progress in parsing this landscape using gene signatures, gaps remain in developing unbiased signatures for individual immune cell types from healthy donors using high dimensional RNA-seq data. Here, we developed two classifiers -one for immune cell subsets and one for T helper cell subsets -using elastic-net logistic regression with cross validation. The features of these classifiers were used as a starting point for generating gene signatures that captured with fifteen binary elastic-net logistic regression models the most relevant gene sets to distinguish among different immune cell types without including too much noise.
Gene signatures in previous studies have been developed and used mainly as a base for deconvoluting the tumor microenvironment to find the presence of immune cells from bulk RNA measures. Therefore, as the first step, determining cell-specific gene signatures critically influences the results of deconvolution methods [32]. Newman et al. defined gene signatures for immune cells using two-sided unequal variances t-test as base matrix for CIBERSORT [8]. In another study, Li et al. in developing TIMER, generated gene signatures for six immune cell types with selecting genes with expression levels that have a negative correlation with tumor purity [9]. More recently, Racle et al. developed a deconvolution tool based on RNA-seq data (EPIC) by pre-selecting genes based on ranking by fold change and then selected genes by manually curating and comparing the expression levels in blood and tumor microenvironment [10]. Finally, quanTIseq (the most recently developed tool for deconvolution) was developed for RNA-seq data based on the gene signatures generated by quantizing the expression levels into different bins and selecting high quantized genes for each cell type that have low or medium expression in other cell types [7]. Although all methods obtained high accuracy based on their developed signatures, a more rigorous and unbiased gene signature developed by RNA-seq data and precise feature selection methods can further improve the accuracy and validate the process for downstream analyses.
In addition, to identify cell types based on their transcriptome, clustering techniques have been used in many studies [33,34]. However, there are high variability levels of gene expression even in samples from the same cell type. Moreover, transcriptomics data has high dimensions (tens of thousands) and this is too complicated for clustering techniques as only few number of genes are discriminative. To overcome these problems some studies used supervised machine learning methods like Support Vector Machine (SVM) [35,36]. However, to the best of our knowledge, this paper is the first to apply two-step regularized logistic regression on RNA-seq transcriptomic of immune cells. This method increases the chance to capture the most discriminative set of genes for each cell type based on the power of an elastic-net [22]. In addition, using a two-step elastic net logistic regression enabled eliminating the most irrelevant genes while keeping the highest number of possible significant genes in the first step and more deeply selecting among them in the second step to generate robust gene signatures for immune cells.
Moreover, contemporary methods have only considered a limited number of immune cell types, and specifically T helper subsets as individual cell types have been neglected [23,24,29] in comprehensive studies. Therefore, the other novel aspect of this study is the separation of models for immune cells and T helper cells and development of gene signatures for a large number of immune cell types (fifteen different immune cell types) including different T helper cell subsets. The ability to identify a greater number of immune cell types enables studying immune system in different diseases in more depth. As we used publicly available RNA-seq datasets for immune cells and T helper cells, we acknowledge that our developed classifiers and gene signatures may be still constrained by the limited number of samples specifically for T helper cells. As more data describing the transcriptome of immune cells will become accessible, one can update the classifiers and gene signatures. Despite the limited number of samples used in the approach, the developed classifiers can even be applied to completely untouched and large datasets [23,24] that have been generated using scRNA-Seq technology which creates noisier data.

Conclusions
Here, we developed an immune cell classifier and classifier for T helper cell subsets along with gene signatures to distinguish among fifteen different immune cell types. Elastic-net logistic regression was used to generate classifiers with 10-fold cross-validation after normalizing and filtering two separate RNA-seq datasets that were generated using defined homogeneous cell populations. Subsequently, we generated gene signatures using a second step of binary regularized logistic regression applied to the RNA-seq data using previously selected classifier genes. As an external validation, the resulting classifiers accurately identified the type of immune cells in scRNA-seq datasets. Our classifiers and gene signatures can be considered for different downstream applications. First, the classifiers may be used to detect the type of immune cells in under explored bulk tissue samples profiled using RNA-seq and to verify the identity of immune cells annotated with low confidence. Second, the gene signatures could be used to study tumor micro-environments and the inter-dependence of immune response with cancer cell phenotypes, which is emerging to be an important clinical question.

Data acquisition
RNA-seq datasets for 15 different immune cell types including T helper cells, were obtained from ten different studies [37][38][39][40][41][42][43][44][45][46], which were publicly accessible via the Gene Expression Omnibus [47]. The list of samples is provided as Additional file 4: Table S4. The cell types were divided into two groups: immune cells that include B cells, CD4+ and CD8+ T cells, monocytes (Mono), neutrophils (Neu), natural killer (NK) cells, dendritic cells (DC), macrophage (Mφ), classically (M1) and alternatively (M2) activated macrophages, and the T helper cells that include Th1, Th2, Th17, Th0, and Regulatory T cells (Treg). The goal was to train the gene selection model on immune cell types, and CD4+ T cell subsets (T helper cells), separately. If these two groups of cells are analyzed together, many of the genes that potentially could be used to discriminate among T helper cell subsets might be eliminated as they overlap with genes associated with CD4+ T cells.
In short, a total of 233 samples were downloaded and divided into two sets of 185 and 48 samples, for immune cells and T helper cells, respectively. Moreover, immune cell samples were further divided into 108 training and 77 testing samples. Training and testing numbers for T helper samples were 31 and 17, respectively. Training and testing data include samples from all studies. For a verification dataset, scRNA-seq data derived from CD45+ cell samples obtained from breast cancer [24] and melanoma [23] were used with GEO accession numbers of GSE75688 and GSE72056, respectively.

Data normalization
The expression estimates provided by the individual studies were used, regardless of the underlying experimental and data processing methods (Additional file 4: Table S4). For developing individual gene signatures and cell classification models, we did not use raw data due to sample heterogeneity such as different experimental methods and data processing techniques used by different studies as well as differences across biological sources. Rather, we applied a multistep normalization process before training models. To eliminate obvious insignificant genes from our data, for immune cell samples, genes with expression values higher than or equal to five counts, in at least five samples were kept, otherwise, they were eliminated from the study. However, for T helper samples, due to fewer number of samples, four samples with values higher than or equal to five counts were enough to be considered in the study. After first step of filtering, the main normalization step was used to decrease dependency of expression estimates to transcript length and GC-content [48,49]. For all four sets of samples, including training and testing samples for immune cells and for T helper cells, expression estimates were normalized separately by applying withinLaneNormalization and between-LaneNormalization functions from EDASeq package [50] in the R programming language (R 3.5.3), to remove GCcontent biases and between-lane differences in count distributions [50]. After normalization, the second step of filtration, which was similar to the first step, was applied to eliminate genes with insignificant expression.

Missing values
In contrast to previous studies that only considered intersection genes [51] and to avoid deleting discriminative genes, we kept genes with high expression as much as possible. However, for most of genes, values for some samples were not reported. Hence, to deal with these missing values, we used an imputation method [52] and instead of mean imputation we set a dummy constant since mean imputation in this case is not meaningful and can increase error. Specifically, we generated a training set for each group of cell types, by duplicating the original training set 100 times and randomly eliminating ten percent of expression values. We next set -1 for all these missing values (both original missing values and those we eliminated) as a dummy constant because all values are positive and it is easier for the system to identify these values as noise. This approach makes the system learn to neglect a specific value (-1) and treat it like noise, instead of learning it as a feature of the samples.

Classifier training and testing
Considering the few number of training samples in comparison with the high dimensions (15453 genes in immune cell samples and 9146 genes in the T helper samples) and to avoid both over fitting the model and adding noise to the prediction model, we used regularization with logistic regression to decrease the total number of genes and select the most discriminative set of genes. To perform gene selection, we trained a lasso-ridge logistic regression (elastic-net) model, which automatically sets the coefficients of a large number of genes to zero and prunes the number of genes as features of the classifier. We cross-validated the model by implementing cv.glmnet function with nfold=10 from glmnet package [21] in R programming language, using training sets for both groups of cell types. We normalized the gene expression values using a log2 transform over training sets to decrease the range of values that can affect the performance of the model (log2(counts+1)). In order to find the optimal number of genes, we tried seven different lambdas and tested the results over the testing samples (cv.glmnet(family="multinomial", alpha=0.93, thresh=1e-07, lambda=c(0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001), type.multinomial="grouped", nfolds=10)). To select the optimal value for lambda, True-Negative samples were generated using a bootstrapping approach that randomly samples testing datasets to create a synthetic dataset with similar size and complexity but without underlying biological correlation, then we generated ROC curves and considered original testing datasets as True-Positive samples.

Developing gene signatures
Genes selected by the classifier models were used as initial sets to build gene signatures. In this case, we trained a new binary elastic-net model for each cell type by considering a certain cell type as one class and all other cell types as another class. The training and testing samples used to build gene signatures were the training and testing samples used in developing the classifiers with the difference being that they only contained the selected genes. Similar steps including dealing with missing values, applying log2 and visualization by ROC to select optimal number of genes were applied for each cell type. This two-step gene selection approach has the advantage that it eliminates a large number of undiscriminating genes at the first and finally select few number of genes for each cell type.

Benchmarking
Fisher exact testing was used for each gene set to characterize true and systematically scrambled data as a measure of performance of the gene set as a means of distinguishing between cell subtypes. In order to establish negative control values for determining specificity, a bootstrapping approach was used [53], where data was scrambled by randomly resampling with replacement expression values by gene as well as by patient to create a synthetic dataset with a similar size and complexity of the original dataset. The threshold for expression binarization for Fisher exact testing was selected based on gene expression histograms of the data to separate the measured expression from background noise levels, with 2.48 being used as the threshold (after log2 normalization). One-thousand iterations (N boot ) were processed and compiled in order to produce ROC curves with 95% confidence intervals shaded about the averaged ROC curve for each gene set's performance. A bootstrapping approach for generating a negative control sample is appropriate when a sufficiently large bootstrap sample (i.e., N boot ≥ 1000) and the original dataset is sufficiently diverse (i.e., N data ≥ 30) [54]. The tested gene sets were the logistic regression gene set, the CIBERSORT gene set [8], the single cell gene set [29], and the manually curated gene set that had been used previously [6].