DeepGSEA: explainable deep gene set enrichment analysis for single-cell transcriptomic data

Abstract Motivation Gene set enrichment (GSE) analysis allows for an interpretation of gene expression through pre-defined gene set databases and is a critical step in understanding different phenotypes. With the rapid development of single-cell RNA sequencing (scRNA-seq) technology, GSE analysis can be performed on fine-grained gene expression data to gain a nuanced understanding of phenotypes of interest. However, with the cellular heterogeneity in single-cell gene profiles, current statistical GSE analysis methods sometimes fail to identify enriched gene sets. Meanwhile, deep learning has gained traction in applications like clustering and trajectory inference in single-cell studies due to its prowess in capturing complex data patterns. However, its use in GSE analysis remains limited, due to interpretability challenges. Results In this paper, we present DeepGSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE. DeepGSEA learns the ability to capture GSE information through our designed classification tasks, and significance tests can be performed on each gene set, enabling the identification of enriched sets. The underlying distribution of a gene set learned by DeepGSEA can be explicitly visualized using the encoded cell and cellular prototype embeddings. We demonstrate the performance of DeepGSEA over commonly used GSE analysis methods by examining their sensitivity and specificity with four simulation studies. In addition, we test our model on three real scRNA-seq datasets and illustrate the interpretability of DeepGSEA by showing how its results can be explained. Availability and implementation https://github.com/Teddy-XiongGZ/DeepGSEA


A Mitigation for Overpowered Tests
A statistical test is considered overpowered if it is extremely sensitive in detecting small effects that might not be practically significant.This issue is particularly common in the context of large sample sizes.In terms of the Mann-Whitney U (MWU) test, effect sizes such as Rank-Biserial Correlation (ρ RB ) are often computed to measure the magnitude of the difference, independent of sample size, and the importance of a specific effect size value should depend on the context of the study and the field of research.Therefore, according to the requirement of effect sizes, we can carefully design the sample size in the experiment to guarantee that the identified significant gene sets make biological sense with significant magnitude in group-wise differences.
In the context of MWU tests, when sample sizes are large, the distribution of the Mann-Whitney U statistic U can be approximated by the normal distribution where n 1 and n 2 are the sample sizes of the two tested groups.We can determine the threshold for the Mann-Whitney U statistic U at the significance level α as where Φ is the cumulative distribution function of the standard normal distribution and Φ −1 is its inverse.Here we take the one-side alternative corresponding to the test we use in practice (Formula 19).In the study of GSE analysis, if we set the minimal value of ρ RB for the MWU test as r, for any U < Uα, there should be Formula 23 can be satisfied for ∀U < Uα if and only if According to Happ et al. (2019), the balanced design (n 1 = n 2 ) is optimal or close to optimal in most cases, which guides us to simplify Formula 24 by letting n 1 = n 2 = n.Thus, we can reformulate the constraint as which gives This result provides us a practical guide for the design of the sample size in the experiments with pre-defined significance level α and minimal Rank-Biserial Correlation r for the effect size.Defining the upper bound of n in Formula 26 as nmax, we present some examples for the value of nmax in Table 2.For DeepGSEA, we can downsample cells for each phenotype from the test set of the scRNA-seq data according to nmax, which guarantees the effective magnitude of group differences for identified significant gene sets.In our experiments, we set α = 0.05, r = 0.25, and downsampled 30 cells from each phenotype group in the test set for statistical tests.

B Pseudocode of DeepGSEA
Algorithm 1 presents the pseudocode of using DeepGSEA for GSE analysis.A 5-fold validation is implemented for all experiments in our study.
X t , y t , q t ← the t-th fold of X, y, q 5: if q is None then 6: // Train the model with M on samples in the remaining K − 1 folds using L (Formula 17) 7: else if q is not None then 8: // Train the model with M on samples in the remaining K − 1 folds using L ′ (Formula 18) 9: end if 10: The gene set similarities of all test cells on the T gene sets 11: for j in 1, 2, • • • , T do 12: p j,t ← Run Mann-Whitney U test and Fisher's method on s j,t (Formula 20) 13: end for 14: end for 15: for j in 1, 2, • • • , T do 16: p j ← Run Pearson's method on p j,1 , p j,2 , • • • , p j,K 17: end for 18:

C.1 Design of Simulated Datasets
"HALLMARK_COMPLEMENT" is selected as the enriched gene set for the sensitivity test in our simulation study.For the specificity test, we select "HALLMARK_NOTCH_SIGNALING", which has no shared genes with "HALLMARK_COMPLEMENT".Following Bibby et al. (2022), we first use Splatter to estimate simulation parameters from their real scRNA-seq data 2 .The definitions of some key parameters are shown as follows3 • batchCell: the number of cells in each batch.In our simulations, it corresponds to the number of all cell samples in a simulated scRNA-seq dataset.
• group.prob:a vector giving the probability that cells will be assigned to groups.In our study, group.prob is a vector with non-negative entries whose sum is 1. • de.prob: this parameter controls the probability that a gene will be selected to be differentially expressed.
• de.facLoc: Differential expression factors are produced from a log-normal distribution.Changing these parameters can result in more or less extreme differences between groups.
C.2 Description and Pre-processing of Real-world Datasets Glioblastoma4 (Zhao et al., 2019) is a scRNA-seq dataset that contains the transcriptomic data of cells from a chromosomal unstable glioblastoma cancer stem cell line (cancer).It also includes cells from a diploid neural stem cell line (neural) as normal controls.Since this Glioblastoma dataset contains only 59 cancer cells and 75 neural ones, it is used to evaluate how DeepGSEA performs on scRNA-seq datasets with a small sample size.Influenza5 (Ramos et al., 2019) is a scRNA-seq dataset with gene profiles of A549 cells undergoing either a mock infection (4505 cells) or infection by the influenza strain PR8-NS1-GFP at MOIs of 0.2 (7377 cells) and 2.0 (6758 cells).We use this dataset to evaluate our model's performance when dealing with multiple cell conditions.Downsampling of the dataset is performed to accelerate the GSE analysis and to evaluate the performance of DeepGSEA on a subset of cells.Specifically, 500 cells are sampled from each group.
The Alzheimer6 (Zeng et al., 2023) dataset has scRNA-seq profiles of 8186 cells collected from the brain tissues of mice with Alzheimer's disease, with another 8506 cells from the control group.Cell type information is provided in this dataset so it can be used to test how DeepGSEA manages to leverage the additional knowledge of cell types to better examine and interpret the enrichment of gene sets.
Since the real-world datasets in our experiments have been processed differently prior to publication, we perform different preprocessing for different datasets.For the Glioblastoma dataset, as it was already log-transformed, we only scale it to unit variance and zero mean and truncate values greater than 10.The cell counts in the Influenza dataset have been normalized, and we take log-transformation and scaling as the preprocessing of it.We do not perform any additional preprocessing for the Alzheimer dataset, because the provided dataset has already been standardized.

C.3 Implementation Details of GSE Methods
In our experiments, DeepGSEA models are built with a three-layer shared encoder, and the gene set heads are one-layer fully connected neural networks.The hidden dimension h_dim is set as 64 and the dimension of latent embeddings z_dim is set as 32.The source code of DeepGSEA is available at https://github.com/Teddy-XiongGZ/DeepGSEA.For the compared baselines, we implement AUCell, GSVA, ssGSEA, Vision, and Z scoring with scTPA (Zhang et al., 2020).SCPA (Bibby et al., 2022) and scGSEA (Franchini et al., 2023) are tested using their official implementations.The original scGSEA is designed to decompose the normalized gene counts into latent factors with Non-negative Matrix Factorization (NMF), and to return p-values for each factor in GSE analysis.To compare it with other approaches which return only one p-value, we choose the minimal p-value among all factors as the result of its enrichment analysis on one gene set.GSEA is implemented with GSEApy (Fang et al., 2023).

D Discussion on Model Efficiency
In this section, we explore the efficiency and scalability of DeepGSEA for analyzing large-scale scRNA-seq datasets.In the context of scRNA-seq, the growing numbers of gene sets and cell samples are the two typical issues that GSE methods need to handle.As shown in Fig. 1, DeepGSEA benefits from its backbone-head architecture to reduce the parameter sizes during encoding.For example, in the Alzheimer dataset which contains 2766 genes and 1722 GO terms as gene sets, the backbone encoder in DeepGSEA includes 181k parameters and its gene set heads contain 2.1k × 1722 = 3.6M parameters.Notably, while the total parameter count of DeepGSEA increases linearly with the number of gene sets, this increase is specifically dependent on the parameter count in the linear prediction heads rather than across the entire encoding network.
Additionally, managing the sample size in scRNA-seq data is a critical factor for efficiency in large-scale GSE analysis.Downsample a subset of cells for the test can effectively mitigate this scalability problem, which has been a commonly employed practice in existing literature (Bibby et al., 2022).As described in Appendix C.2, we downsample 500 cells from each group in the Influenza, which significantly accelerates the process of analysis without sacrificing the effectiveness of the GSE analysis, as shown in Appendix E.3.

E.1 Additional Specificity Test on Simulated Data
In Section 4.2 we demonstrate that the specificity of DeepGSEA is comparable to other methods by analyzing a single irrelevant gene set that shares no genes with the enriched gene set.To further assess its specificity on a broader scale, we randomly sample gene sets of varying sizes from the background noises.Let G represent the set of all genes in the simulated data and R the set of relevant genes in the enriched gene set.We randomly sample 100 gene sets of size S from G\R, where S varies through the values {30, 40, 50, 60, 70, 80, 90, 100}.To ensure the integrity of this specificity test, we reduce the "de.prob" parameter for the background data to 0.04.This adjustment ensures that the gene sets identified in this test are indicative of false discoveries rather than non-trivial enrichments.We evaluate DeepGSEA against all methods discussed in Section 4.2, excluding scGSEA due to its excessive computational time requirements for this experiment.The false positive rates (FPRs) and true negative rates (TNRs) for these approaches are presented in Table 4.Although DeepGSEA does not have the lowest FPR score, its specificity is on par with that of other commonly used GSE methods, confirming its applicability in real-world settings.Moreover, its enhanced sensitivity, as detailed in Section 4.2, further underscores its utility in GSE analysis.

E.3 DeepGSEA's Interpretation on Influenza Data.
DeepGSEA's ability to interpret gene sets with visualization can be beneficial in the analysis of data with multiple phenotypes, as it is able to show how a gene set is enriched across different groups.Fig. 9(a) and 9(b) present visualized high-dimensional spaces of "translation", with the gene set features and gene set embeddings of cells (in Fig. 1) as the data points, respectively.While cells are mixed in the original high-dimensional space of pre-processed gene expression, they can be well separated in the latent space by utilizing the gene set information.Though three clusters are clearly identified in Fig.
(b), we can observe that the mock group and the MOI0.2 group are close to each other with some intersections while MOI2.0 is distant from both of them in the latent space, indicating that genes in cells infected by the influenza virus at MOI of 2.0 are more differentially expressed compared to the mock ones than those in cells infected by the virus at MOI of 0.2.The visualization of another enriched gene set is shown in Fig. 9(c).Unlike "translation" on which cells of the three groups can be distinguished, the pathway "rho gtpases activate cit" only separates the mock group from the other two in the latent space, while cells in the two infected groups are mixed and cannot be separated using this gene set.
Additionally, the phenotype similarities of each cell can be visualized on the same figure with colors indicating the estimated similarities.Fig. 10 shows similarities of all cells to each phenotype.We can see that cells around each prototype are correctly recognized by the model with high similarity to the corresponding phenotype, whereas cells located at the boundary of "Mock" and "MOI0.2"have less similarity to both groups, as they are less typical and intermixed.The visualization of DeepGSEA's similarity estimation provides us with an intuitive understanding of what underlying distributions are learned by the model.

E.4 DeepGSEA's Classification on Real scRNA-seq Datasets
As Formula 5 shows the final prediction is given based on a simple linear combination of information from different gene sets, the performance of cell classification can reflect how well the model captures phenotype knowledge from gene sets.Table 5 presents DeepGSEA's performance of phenotype classification of cells in the test sets (Formula 5) on different datasets with various gene set databases.We can observe that DeepGSEA does effectively capture the knowledge of phenotypes from gene sets and distinguish different cell groups, as evidenced by an auROC score over 0.89 on all datasets.For the Glioblastoma dataset, both GO and Pathway are informative enough to explain variations in gene expression of cells with different phenotypes.However, models with GO or Pathway behave differently when analyzing phenotypes in Influenza and Alzheimer datasets.While Pathway is better suited to the Influenza dataset than GO, DeepGSEA with GO performs better than the one with Pathway on the Alzheimer dataset.Generally, DeepGSEA's performance in classifying a dataset using a given gene set database can reflect the appropriateness of that database for the analysis.

E.5 Comparison of GSE Discoveries on Real scRNA-seq Datasets
As Table 5 suggests using Pathway to analyze phenotype differences in the Alzheimer dataset is the most challenging one among all tasks considered, we test some representative GSE analysis methods on this task and compare their differences in GSE discoveries.13.3 5.9 12.1 9.5 Table 6 shows the number of enriched gene sets out of 1718 candidates identified by different methods.It can be observed from the table that ssGSEA and GSVA are sensitive to the group differences in large-scale datasets as most of the gene sets considered are identified as significant by these two approaches.In comparison, DeepGSEA and SCPA provide more useful outputs when dealing with large data, with a reasonable number of identified gene sets for further analysis.The technique used in SCPA is to downsample each group if the size is over 500.However, it may affect its performance in GSE analysis for the patterns may not be explicitly presented in the sampled data.Also, we tried but failed to run SCPA without subsampling on the dataset due to the extremely long time.By learning data patterns from the whole training set and performing statistical tests on the subset of the test data with the proposed mitigation method for overpowered testing, DeepGSEA can both learn adequate knowledge using abundant samples from the dataset and identify a reasonable number of significant gene sets with a guarantee for the minimal difference magnitude, which can be useful in real-world applications.For example, "collagen degradation" is a gene set identified by DeepGSEA with a p-value of 0.049.It has been validated that the dysregulation of collagen metabolism has a great impact on human health, which is further related to Alzheimer's disease (Tang et al., 2020).However, the gene set has a p-value of 1 in the test from SCPA, indicating that the method completely failed to identify the significance of "collagen degradation" in this study.

E.6 Ablation Study on Model Design
We perform ablation studies on the Alzheimer dataset as it is the most challenging one in our study and can reflect how the results will change with variations to the design.As we mentioned in the experiment part, DeepGSEA is trained with cell type labels given by the Alzheimer dataset.In order to evaluate the performance of DeepGSEA on the dataset without information of cell types, we trained another model using Formula 17 instead of Formula 18 in model tuning, which we term as DeepGSEA no_label .In addition, DeepGSEAone_proto is trained as a model variant that learns only one prototype for the distribution of each phenotype.A model variant called DeepGSEAone_stage is trained to test the performance of the model without the two-stage training paradigm mentioned in the methodology.DeepGSEAone_gene_set is a model trained only with the gene set "generation of neuron", which is used to test if the model performance benefits from the proposed backbone-head architecture.All model variants except DeepGSEAone_gene_set are trained with GO as the source of gene sets.
Table 7 shows the classification performance of the model variants, where we examine each model's ability to classify cells based on either the GO term "generation of neurons" or all gene sets.Without the additional biological knowledge of cell types, DeepGSEA no_label and DeepGSEAone_proto have comparable performance to the full version of DeepGSEA in terms of phenotype classification.However, Fig. 11(a) and 11(b) show that DeepGSEA no_label tends to group cells of the same phenotype together in the latent space but remove their characteristic of cell types.Such a phenomenon is also presented in Fig. 11(c) and 11(d), where the distributions of different phenotypes are further squashed in the latent space as there is only one prototype for each phenotype.
In contrast, the cell types are more well reflected in the visualized latent spaces of DeepGSEAone_stage (Fig. 12(b)) and DeepGSEAone_gene_set (12(d)), since cell type annotations are utilized in their training.But according to Table 7, DeepGSEAone_stage underperforms the origin model in terms of using all gene sets for prediction as the weights to combine gene sets may be incorrectly learned when the gene set encoders are not well trained.DeepGSEAone_gene_set performs even worse than all other model variants on the classification in both cases, suggesting that the backbone-head architecture in DeepGSEA does help the model to better encode each gene set by sharing the knowledge about other gene sets.

E.7 Ablation Study on Hyperparameter Tuning
We also perform ablation studies on the selection of hyperparameters λ 3 , λ 4 , λ 5 , which are used to regularize the latent space and improve the quality of learned prototypes.The Influenza dataset is chosen, as it is simpler than the Alzheimer dataset so that we can clearly identify differences in the learned prototypes when perturbing the hyperparameters.We set the number of prototypes for each phenotype as 2 to examine the effect of different λ 3 on the learned latent space.Our default setting for these hyparameters is λ 3 = λ 4 = λ 5 = 1.Fig. 13 shows how the distances between prototypes change according to the selection of λ 3 .A λ 3 with either too small or too large value will result in unwanted outcomes.The results for variations on λ 4 are presented in Fig. 14, which shows the cells are moving forward to the corresponding prototypes when λ 4 gets larger, though the estimated Gaussian distributions may be quite unnatural if λ 4 is too large.Fig. 15 shows the change of learned latent spaces with respect to different λ 5 .As the loss related to λ 5 guides the movement of prototypes, by comparing different visualized latent spaces in Fig. 15, we can observe that prototypes of the "Mock" group are becoming more distant from the ones of the "MOI0.2"group when λ 5 becomes larger.In addition, by comparing λ 5 = 0 with λ 5 ̸ = 0, we notice that the prototypes tend to move to centers of regions with a higher cell density.However, if λ 5 is too large, the model may fail to capture subtle differences between cells at the boundaries of two subpopulations, as the prototypes are largely determined by the center of the whole group, which can result in a missing of details at the boundaries.

Algorithm 1
The algorithm of DeepGSEA for the analysis of gene set enrichment Input The scRNA-seq profiles of n cells X = {x 1 , • • • , xn} and corresponding group label y = {y 1 , • • • , yn}.Additional labels for the cells q = {q 1 , • • • , qn}.Pathway masks for T gene sets M = {m 1 , • • • , m T }.Number of folds K. Output The adjusted p-values for the enrichment of the T gene sets {p 1 adj

Fig. 8 :
Fig.8: Heatmap of log gene counts of associated genes in the pathway (columns) for cells (rows) that are close to each prototype in the latent space of "acid secretion", where blue corresponds to low expression and yellow corresponds to high expression.

Fig. 12 :
Fig. 12: Visualized latent spaces learned by different model variants with color annotations: (a) phenotype distribution in DeepGSEAone_stage, (b) cell type distribution in DeepGSEAone_stage, (c) phenotype distribution in DeepGSEAone_gene_set, (d) cell type distribution in DeepGSEAone_gene_set.

Table 2 .
Examples of nmax in different cases.α is the pre-defined significance level.r is the minimal accepted value of Rank-Biserial Correlation.

Table 3 .
Default parameters for simulated background data.

Table 4 .
False positive rates (FPRs) and true negative rates (TNRs) for different methods on randomly sampled gene sets with various sizes.

Table 5 .
DeepGSEA's phenotype classification performance on real-world datasets (average on five folds).The metrics for classification reflect how well the gene sets in the database can be used to explain phenotypes of interest.

Table 6 .
Number of enriched gene sets from Pathway identified by different GSE methods on the Alzheimer dataset.

Table 7 .
Classification performance (auROC) of different model variants on the Alzheimer dataset.