scLM: Automatic Detection of Consensus Gene Clusters Across Multiple Single-cell Datasets

In gene expression profiling studies, including single-cell RNAsequencing (scRNA-seq) analyses, the identification and characterization of co-expressed genes provides critical information on cell identity and function. Gene co-expression clustering in scRNA-seq data presents certain challenges. We show that commonly used methods for single-cell data are not capable of identifying co-expressed genes accurately, and produce results that substantially limit biological expectations of co-expressed genes. Herein, we present single-cell Latent-variable Model (scLM), a gene co-clustering algorithm tailored to single-cell data that performs well at detecting gene clusters with significant biologic context. Importantly, scLM can simultaneously cluster multiple single-cell datasets, i.e., consensus clustering, enabling users to leverage single-cell data from multiple sources for novel comparative analysis. scLM takes raw count data as input and preserves biological variation without being influenced by batch effects from multiple datasets. Results from both simulation data and experimental data demonstrate that scLM outperforms the existing methods with considerably improved accuracy. To illustrate the biological insights of scLM, we apply it to our in-house and public experimental scRNA-seq datasets. scLM identifies novel functional gene modules and refines cell states, which facilitates mechanism discovery and understanding of complex biosystems such as cancers. A user-friendly R package with all the key features of the scLM method is available at https://github.com/QSong-github/scLM.

likelihood. That is, we replace the value in the parameter updates by its expectation with respect we cluster genes that are projected in the latent space to identify co-expressed genes. Here we GitHub repository: https://github.com/QSong-WF/scLM. Based on the single cell data characteristics, we used the negative binomial distribution to 1 6 8 simulate two synthetic cohorts (SD1, SD2). Each synthetic cohort contains 9 sets of simulated are provided in the scLM example data. constituting four groups of co-expressed genes as the group truth, which was achieved by 1 8 2 adjusting the 'de.prob' parameter. We also added the dropout effects in these simulation data by protocol developed for human PBMCs by 10X Genomics (San Francisco, CA).

9 7
All scRNAseq procedures were performed by the Cancer Genomics Shared Resource (CGSR) in this study. The scRNA-seq data were deposited in the GEO of NCBI database (GEO cells and genes expressed in over 300 cells as input. Each clustering result produced by applying a specific clustering method to a specific dataset [40]. CH index evaluates the cluster validity based on the average between-and within-cluster between objects in different clusters as the inter-cluster separation and the maximum diameter among all clusters as the intra-cluster compactness.

3 4
Cell clustering based on co-expressed gene modules 2 3 5 With the co-expressed gene modules, we utilized mean value of the modules in each single cell at 20. The n_neighbors value was set at 15, and min_dist was set as 0.1.  Hochberg correction; P-values less than 0.05 were considered statistically significant. We developed a new method, single-cell Latent-variable Model, scLM, for simultaneously 2 6 4 identifying consensus co-expressed genes from multiple scRNA-seq datasets. Our hypothesis is 2 6 5 that co-expressed genes coordinating biological processes can be captured across multiple 2 6 6 different datasets. In our model, we assumed that latent variables could capture the intrinsic  including the technical variances at the cell-level (݆) and batch effects at the sample-level (݇).

7 7
The latent variables and other parameters are estimated and obtained using Markov Chain Monte Methods. The software to implement the scLM method is available at https://github.com/QSong- in the comparison on real single cell data.

9 0
We first generated two synthetic data cohorts (SD1, SD2) from negative binomial datasets (n=2), and so on. Each set contained three co-expressed gene clusters as ground truth.

9 4
Additionally, we utilized the Splatter package [34] to generate another two batches of simulated 2 9 5 data (SD3, SD4) with dropout effects, which can more accurately recapitulate actual scRNA-seq 2 9 6 data distributions. Details of the simulation datasets are provided in the Methods section. With the simulated data cohorts, we applied scLM and other methods (LTMG, SCN, Seurat- clustering accuracy, we used the adjusted Rand index (ARI) as performance metric to rank these in SD1 (mean ± SE: 0.627 ± 0.028) and SD3 (mean ± SE: 0.520 ± 0.070) respectively. LTMG presented with a little higher ARI and lower variance in four data cohorts. These results  (HNSCC), and melanoma. The data pre-processing procedures are described in the Methods 3 1 5 section.

1 6
To assess and quantify clustering accuracy on real datasets, we used performance metrics index [39], to rank these methods. Importantly, scLM produced sets of clusters that showed 3 1 9 significantly higher CH value than other methods (Figure 3A), especially higher than LTMG (P- cluster validity than other methods based on average between-and within-cluster sum of squares.

2 2
In addition, compared to other methods, scLM also achieved significantly higher Dunn index lower DB index scores reflecting higher cluster quality ( Figure 3C). Though SCENIC and 3 2 5 Seurat-wgcna showed higher DB index score in the HNSCC dataset, they failed to show superior 3 2 6 performance on the other datasets. Thus scLM proved to achieve the best partitioning of co-3 2 7 expressed gene clusters that are most distinct from each other. As co-expressed genes are likely to be enriched with biological functions, we compared the 3 3 0 extent to which different methods affect the functional discovery, based on their identified co- methods, yet scLM identified the most on all the datasets. Some methods, like LTMG, failed to 3 4 3 identify gene clusters with enriched terms at the threshold of adjusted p-value < 0.01.

4 4
In addition to GO terms, we also examined the enriched pathways in the Reactome database, In real-world scenarios, single cells from different patients or different data sources often issues. The scLM method is designed to address such highly unbalanced data that could intentionally selected patient samples that varied with respect to cell number, which could create 3 5 7 challenges for this method. As a case study, we used scLM to analyze our in-house scRNA-seq  Using the 12 co-expression modules, the single cells were separated into two major clusters.

6 5
In each cluster, cells from different patients mixed well without interference from batch effects 3 6 6 ( Figure 5B, right panel), which further support that the co-expression modules are consistent 3 6 7 across patients. Interestingly, we found that cluster 1 had higher expression of epithelial 3 6 8 functional markers (EMT-related genes) than cluster 2 ( Figure 5C). These results indicate that 3 6 9 co-expression modules are capable of characterizing specific cell phenotypes. Similarly, in normal single cells, we observed 13 co-expressed gene modules (N1 -N13) that      T1  T2  T3  T4  T5  T6  T7  T8  T9  T10 T11 T12