DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data

Abstract Background Single-cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic datasets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Findings Here, we present DrivAER, a machine learning approach for the identification of driving transcriptional programs using autoencoder-based relevance scores. DrivAER scores annotated gene sets on the basis of their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. DrivAER iteratively evaluates the information content of each gene set with respect to the outcome variable using autoencoders. We benchmark our method using extensive simulation analysis as well as comparison to existing methods for functional interpretation of scRNA-seq data. Furthermore, we demonstrate that DrivAER extracts key pathways and transcription factors that regulate complex biological processes from scRNA-seq data. Conclusions By quantifying the relevance of annotated gene sets with respect to specified outcome variables, DrivAER greatly enhances our ability to understand the underlying molecular mechanisms.


sequencing data
We thank the four reviewers for their valuable comments on our manuscript. In the revision, we have addressed each comment and revised the manuscript accordingly. Our point-by-point responses to reviewers are provided in the Supplementary Materials (in blue, quotations from the revised manuscript in green). The edited text is marked in red in the revised manuscript. Background: Single-cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps.
Findings: Here, we present DrivAER, a machine learning approach for the identification of Driving transcriptional programs using AutoEncoder based Relevance scores. DrivAER scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. DrivAER iteratively evaluates the information content of each gene set with respect to the outcome variable using autoencoders. We benchmark our method using extensive simulation analysis as well as comparison to existing methods for functional interpretation of scRNA-seq data. Furthermore, we demonstrate that DrivAER extracts key pathways and transcription factors that regulate complex biological processes from scRNA-seq data.
Conclusions: By quantifying the relevance of annotated gene sets with respect to specified outcome variables, DrivAER greatly enhances our ability to understand the underlying molecular mechanisms.

Keywords
Autoencoder, machine learning, manifold interpretation, single-cell RNA sequencing, transcription factor 3 Findings Background Single cell RNA sequencing (scRNA-seq) experiments dissect biological processes or complex tissues at the cellular and molecular levels (Trapnell 2015;Hwang, Lee, and Bang 2018). Due to the high complexity and large number of observations, one critical step in scRNAseq analysis is dimension reduction (Luecken and Theis 2019). Dimension reduction projects the high dimensional expression matrix into a low dimensional space, also called data manifold or cellular map, which captures the underlying biological processes (Moon et al. 2018). A number of methods have been used for manifold learning in scRNA-seq data (Coifman et al. 2005;Haghverdi, Buettner, and Theis 2015;van der Maaten and Hinton 2008;Lopez et al. 2018;Eraslan et al. 2019;McInnes et al. 2018).
Biological meaning can be extracted from the data manifold following in-depth analysis.
After cells are stratified into separate groups or along a continuum, differential expression analysis is performed. Gene set enrichment analysis represents one of the most popular approaches to interpreting lists of differentially expressed genes and has been frequently used on bulk RNA-seq data (Subramanian et al. 2005;Wu and Smyth 2012;Goeman and Bühlmann 2007). More recent work has adapted this approach to scRNA-seq data (Finak et al. 2015).
Additional tools for biological interpretation of scRNA-seq data focus on the identification of latent variation that is aligned with gene sets (Fan et al. 2016;Buettner et al. 2017;Martignetti et al. 2016;Risso et al. 2018).
However, choosing the best parameters to identify differentially expressed genes across diverse scRNA-seq datasets is still an open challenge (Wang et al. 2019). Moreover, subtle transcriptional signals driven by a specific set of genes may not be sufficiently reflected in the global data manifold. Therefore, there is a need for methods that facilitate biological 4 interpretation without performing differential expression analysis that capture subtle transcriptional signals driven by knowledge-based annotated gene sets.
Here, we present DrivAER, a method for the identification of Driving transcriptional programs based on AutoEncoder derived Relevance scores. Transcriptional programs (TPs) are sets of genes sharing biological properties (Heimberg et al. 2016) such as genes sharing transcription factor binding motifs or genes involved in the same biological pathway (Kanehisa 2000;Subramanian et al. 2005). TPs have been annotated extensively and DrivAER infers TP relevance scores for existing gene set annotations with respect to specified outcomes of interest. These outcomes can represent extrinsic phenotypes, such as disease status, or intrinsic phenotypes derived from the data itself, such as pseudotemporal trajectories.
Relevance scores allow researchers to rank TPs and help explain the underlying molecular mechanisms.
We evaluated DrivAER by application to two publicly available scRNA-seq datasets and comparison to two competing methods called VISION (DeTomaso et al. 2019) and PAGODA (Fan et al. 2016). Our results demonstrate that DrivAER correctly extracts well-known regulators from complex scRNA-seq datasets profiling interferon stimulation and blood development.
Moreover, DrivAER outperforms existing methods when subtle transcriptional signals are present. Our user-friendly tool integrates smoothly downstream of the popular scRNA-seq analysis framework Scanpy (Wolf, Angerer, and Theis 2018).

DrivAER correctly identifies interferon response
DrivAER is based on one assumption: the data manifold of relevant TPs shares information with the outcome of interest. Irrelevant TPs, on the other hand, will generate data manifolds where the cells fall randomly with respect to the outcome of interest. DrivAER builds 5 upon our Deep Count Autoencoder (DCA) method (Eraslan et al. 2019), which has been shown to achieve high scalability for large scRNA-seq data (Sun et al. 2019). DrivAER iteratively applies DCA to the raw counts of each annotated TP-specific gene set to generate a twodimensional data manifold in an unsupervised manner (Fig. 1ab). Next, we associate the resulting manifold coordinates with the outcome of interest using random forest models (Fig.   1c). We interpret the random forest accuracy as relevance score, which quantifies the amount of information that is shared between the TP-specific data manifold and the outcome of interest ( Fig. 1d).
To demonstrate the ability of DrivAER to perform correct manifold interpretation, we reanalyzed two publicly available scRNA-seq data sets. The first data set by Kang et al (Kang et al. 2018) described a transcriptional response to interferon stimulation (Fig. 1e). As a proof of principle, we asked if DrivAER could recapitulate this biology and extract the interferon signature as the driving transcriptional program defining the T cell data manifold (Fig. 1f). We applied DrivAER to the subset of T cells and evaluated all 50 Hallmark gene sets from MolSigDB (Liberzon et al. 2015) with respect to interferon stimulation (Table S1). Indeed, the "INTERFERON_GAMMA_RESPONSE" gene set received the highest relevance score (Fig. 1g) out of all 50 gene sets included in the analysis. Visualization of the T cell DCA embedding derived from the "INTERFERON_GAMMA_RESPONSE" gene set showed clear separation by condition ( Fig. 1h), implicating that this gene set is the main driving force separating the stimulated and unstimulated T cells. As a negative control, we show the DCA embedding for one of the lowest scoring gene sets "PROTEIN_SECRETION" (Fig. 1i). For this gene set, the cells cluster randomly with respect to the stimulation status. The heatmap in Figure 1j shows the expression levels of T cells for the "INTERFERON_GAMMA_RESPONSE" gene set. It is important to note that the cells (columns) are ordered by stimulation status and that the DCA coordinates are strongly associated with the stimulation status. Most "INTERFERON_GAMMA_RESPONSE" genes are upregulated in stimulated compared to unstimulated cells. Expression of genes in the "PROTEIN_SECRETION" gene set shows a random pattern (Fig. S1).
To further manifest the biological meaning of the DCA embedding, we visualized the expression of interferon marker IFIT2 in the "INTERFERON_GAMMA_RESPONSE" (Fig. 1k) and "PROTEIN_SECRETION" (Fig. 1l)  Therefore, the expression levels of the target genes represent a better proxy of TF activity compared to the expression level of the TF itself (Schacht et al. 2014).
To demonstrate the utility of DrivAER, we use a collection of TF-target annotations to infer TF activity and reanalyzed a hematopoietic differentiation dataset by Paul et al (Paul et al. 2016). The authors identified and described the main blood development trajectories including differentiation from stem cells towards erythrocytes and monocytes (Fig. 2ab). Next, we calculated two independent pseudotemporal trajectories for erythrocyte and monocyte differentiation (Fig. 2cd). We then applied DrivAER to identify TFs that are relevant for 7 erythrocyte and monocyte differentiation using the entire collection of motif gene sets contained in MolSigDB (Xie et al. 2005). Among all 495 gene sets included in the analysis, DrivAER identified the GATA TF family as the most relevant in the erythrocyte trajectory ( Fig. 2e, Table   S1). The DCA embedding derived from the "GATA_C" gene set showed strong clustering by pseudotime, demonstrating that GATA target gene expression is highly coordinated along this trajectory (Fig. 2f). Indeed, expression levels of GATA targets showed strong association with both pseudotime and DCA coordinates (Fig. 2g).
Of note, targets showed both up and down-regulation. A fraction of targets increased in expression along the trajectory while a smaller fraction decreased. When integrating annotation from the TTRUST database (Han et al. 2018) with the "GATAAGR_GATA_C" gene set, Fli1 expression was predicted to be repressed by TF GATA1 and, correspondingly, we observed a negative correlation along the trajectory between these two genes ( Fig. S2).
Among the most relevant TFs in the monocyte trajectory was PU1 (Fig. 2h, Table S1), which also showed strong association between the DCA embedding ( showed increased expression along their developmental trajectories, many other TFs exhibited a similar pattern, making it difficult to pinpoint the driving regulator (Fig. S3). Taken together, our findings demonstrate that DrivAER robustly explains the molecular mechanisms underlying complex biological processes.

Benchmarking DrivAER
To further assess and thoroughly benchmark our method in a controlled setting, we performed extensive simulation analysis. We used the Splatter (Zappia, Phipson, and Oshlack 8 2017) framework to simulate scRNA-seq data consisting of two groups of cells with subtle transcriptional differences where only 10% of genes were differentially expressed between the two groups. Subsequently, we generated different gene sets which varied in the number of truly differentially expressed (DE) genes (Fig. 3a). The gene sets ranged from sets without any truly DE genes (DE fraction = 0) to sets consisting of all truly DE genes (DE fraction = 1).
Visualization of all genes in reduced dimensions using UMAP showed no clear separation between the two cell groups (Fig. 3b). However, dimension reduction restricted to truly DE genes using DCA showed separation between the cells (Fig. 3c), indicating that while the signal may be weak across all genes, targeted dimension reduction of specific genes successfully recovered the underlying cellular manifold.
To evaluate methodological aspects underlying DrivAER, we performed the following analyses. With respect to the dimension reduction task, we compared DCA with principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP) and tdistributed stochastic neighbor embedding (tSNE). Across the gene sets that vary in the fraction of truly DE genes, DCA overall achieved the highest relevance scores (Fig. 3d). At low fractions of DE genes, the alternative dimension reduction methods slightly outperformed DCA (Fig. S5).
Therefore, we implemented PCA, UMAP and tSNE based dimension reduction into the DrivAER framework. Users have the option to select any of these four dimension reduction methods for their DrivAER analysis.
Next, we compared random forest and support vector machines for the classification task. We did not observe any significant differences in performance between these two methods, indicating that random forest models represent an appropriate choice for this task (Fig. 3e). Moreover, we evaluated the impact of various hidden layer configurations during the DCA dimension reduction underlying DrivAER. We applied DrivAER using varying bottleneck layer sizes to the collection of simulated gene sets. The performance did not differ substantially across the three configurations, indicating that DrivAER is robust to various hidden layer 9 configurations (Fig. 3f). Even when the gene set contained only 20% of truly DE genes, the relevance score was significantly higher than that of random gene sets over 10 bootstraps (One-sided t-test, P < 0.05, Fig. 3g Figure   3b) instead of gene set specific manifolds (i.e. Figure 3c), it is less likely to capture subtle transcriptional differences. When using DrivAER, on the other hand, random gene sets achieved a relevance score around 0.5. This corresponds to the likelihood of taking a random guess with two classes. Correspondingly, gene sets consisting of all truly DE genes approached relevance scores close to 1. Therefore, relevance scores for categorical phenotype can be readily interpreted.
For additional comparison, we applied VISION and PAGODA to the interferon stimulation and blood development data sets (Fig. S4). All three methods clearly identified the correct TPs involved in interferon stimulation. In the erythrocyte trajectory, the GATA_C gene set achieved high scores using DrivAER and VISION but not PAGODA. For the monocyte trajectory, only DrivAER generated high relevance scores for PU1 related gene sets.

Discussion and conclusions
While autoencoders have been applied for unsupervised dimension reduction in bulk (Tan et al. 2015;Chen et al. 2016;Tan et al. 2017) and scRNA-seq data (Geddes et al. 2019; Lin, Mukherjee, and Kannan 2020), DrivAER makes use of autoencoders with a different goal.
By iterative application, DrivAER scores gene sets based on their relevance instead of trying to identify potential signatures that may not be captured in databases. Thus, while using autoencoders for unsupervised dimension reduction intrinsically, our method aims to rank gene sets in a supervised fashion.
Unlike VISION, DrivAER does not require a pre-defined distinction between the sign of regulation (repression or activation) of genes in a given gene set. The unsupervised nature of the DCA embedding captures any form of non-random, coordinated expression pattern.
Therefore, DrivAER captures complex, non-linear expression patterns commonly observed in scRNA-seq data. An additional benefit of DrivAER is its ability to visualize the gene set specific data manifold. These visualizations promote discovery of transcriptional regulation that may otherwise be hidden in the summary statistics generated by other methods including gene set enrichment analysis or VISION and PAGODA. Moreover, as demonstrated in the simulation analysis, DrivAER's relevance score is readily interpretable.
As illustrated in the blood development example, we divided the manifold into independent trajectories for interpretation. However, DrivAER provides the flexibility to be applied to the entire manifold or any subset of it. The user can make this choice and arbitrarily define regions of the manifold, which are expected to be regulated by a TP.
Additionally, as demonstrated in the blood development example, DrivAER enables users to make inferences about regulators that were not measured or where measurements are noisy. We envision that users will apply DrivAER to infer activity of regulators not generally detected in scRNA-seq data such as microRNAs and long noncoding RNAs.
In the current approach DCA needs to be retrained for each gene set because the input genes and thus the network architecture changes between gene sets. Therefore, the running time of DrivAER depends on the number of gene sets included in the analysis. In the interferon stimulation analysis, the running time per gene sets averages between 20-30 seconds depending on the number of genes and convergence of the model. To improve speed, we plan to extend DrivAER by developing a "hot-start" approach in future work.
In summary, specialized methods facilitating the functional interpretation of scRNA-seq data are needed to fuel the rapid progress in the field. DrivAER is a novel machine learning approach that is effective for manifold interpretation in scRNA-seq data. Our results demonstrate that relevance scores represent a useful measure to extract driving transcriptional regulators from complex scRNA-seq data sets. DrivAER, including interactive usage tutorial, is freely available from Github (https://github.com/lkmklsmn/DrivAER) and we anticipate broad usage by the community.

Transcriptional program annotations
The Molecular Signatures Database (MolSigDB , v7.0) was used to define transcriptional programs (Subramanian et al. 2005). The Hallmark gene set contained 50 gene sets corresponding to specific well-defined biological processes (Liberzon et al. 2015 (Hinton and Salakhutdinov 2006). One important characteristic that distinguishes DCA from other dimension reduction methods is a scRNA-seq specific noise model. The bottleneck layer captures the compression and represents the data manifold. As default for DrivAER, we set the bottleneck dimension to two neurons. DCA takes a raw count 13 matrix as input and outputs the data manifold coordinates using the parameter mode = "latent".
To account for differences in library size, size factors derived from the transcriptome-wide, instead of gene set specific, expression matrix are being fed into DCA.
The relevance scores are derived using random forest models as implemented in the Python module sklearn (v0.21.2). Once DCA has reduced the dimensions, the two-dimensional data manifold coordinates are used as input features and the variable of interest as outcome in the random forest model. For categorical outcomes, "sklearn.ensemble.RandomForestClassifier" is used. For continuous outcomes, such as pseudotemporal trajectories, "sklearn.ensemble.RandomForestRegressor" is used. The number of trees was set to 500. The Out-of-Bag accuracy score of the TP-specific random forest model represents the relevance score.
For the benchmarking purposes only, we applied support vector machine classification as implemented in the R package e1071 with default parameters. Additionally, we implemented three alternative dimension reduction methods into the DrivAER framework. PCA, tSNE and UMAP were implemented using the Scanpy functions "pp.pca", "tl.tsne" and "tl.umap", respectively. All functions use Scanpy's default parameters.
Simulation analysis scRNA-seq data was simulated using the splatter R package (Zappia, Phipson, and Oshlack 2017). Specifically, the splatSimulate() function was used to simulate scRNA-seq data with two equally sized groups, consisting of 500 genes and 2000 cells. The default gene expression and library size parameters were used. To simulate subtle transcriptional differences, the proportion of differentially expressed genes was set to 0.1 and the differential expression factor was set to 0.01. To include specific noise commonly encountered in scRNAseq data, the dropout type was set to "experiment". The "dropout.mid" parameter was set to 5 and "dropout.shape" was set to -1. The splatSimulate() function was also used to simulate 14 scRNA-seq data with four unbalanced groups, containing 1000 genes and 4000 cells. The proportion of cell numbers in these four groups was set to 0.1, 0.2, 0.3, and 0.4, respectively.
The "dropout.mid" parameter was set to 2 and "dropout.shape" was set to -1.
To simulate gene sets from a continuous spectrum of relevance the following approach was used. Gene sets were created by combining truly differentially expressed genes with genes showing no expression difference between the two groups. We generated gene sets containing six different fractions of truly DE gene sets (0, 0.2, 0.4, 0.6, 0.8, 1). Ten bootstrap samples were generated at each fraction. These 60 simulated gene sets were used for DrivAER evaluation.
For the evaluation of DrivAER using different configurations of hidden layers five bootstrap samples were generated at each fraction of truly DE gene sets. These 30 simulated gene sets were subjected to three different configurations of hidden layers (4, 2, 4), (8,4,8) and (16,8,16), in an independent analysis.

Interferon stimulation analysis
The scRNA-seq data set of 29,065 PBMCs from lupus patients with and without interferon stimulation were obtained from the GEO database (GSE96583). The tSNE coordinates as well as cell type and state (stimulated or unstimulated) information displayed in   (Wolf, Angerer, and Theis 2018) built-in datasets using the "scanpy.datasets.paul15()" function.
Expression data consists of 2730 hematopoietic stem cells and 3451 genes. The preprocessing of the data was performed following the Scanpy tutorial using "scanpy.pp.recipe_zheng17()" function. Specifically, the 1000 most highly variable genes were selected for downstream analysis. Louvain clustering (version 0.6.1) was conducted with resolution of 1, which resulted in 25 clusters. Clusters were annotated based on the expression of canonical cell type marker genes. Two major developmental trajectories were identified, namely the differentiation of hematopoietic stem cells to erythrocytes and monocytes. Pseudotemporal ordering was independently calculated for these two trajectories using the "scanpy.tl.dpt" function. DrivAER was applied to the raw counts and pseudotemporal ordering of each trajectory independently to infer relevant TPs.
Expression data for the Nestorowa data set was obtained from the "Gene and protein expression in adult hematopoiesis" website (http://blood.stemcells.cam.ac.uk/single_cell_atlas).
Based on the provided annotation, cells were divided into the erythrocyte and monocyte trajectory. Pseudotemporal ordering was calculated as described above. DrivAER was applied as described above. can provide more advanced latent space models. Next, VISION calculates a signature score for each annotated gene set and subsequently assesses whether the signature score is randomly distributed throughout the cellular manifold using a local autocorrelation statistic, the Geary's C (Geary 1954      ...

Figure 2
Click here to access/download; Figure;Figure2