Clustering single-cell multi-omics data via graph regularized multi-view ensemble learning

Abstract Motivation Single-cell clustering plays a crucial role in distinguishing between cell types, facilitating the analysis of cell heterogeneity mechanisms. While many existing clustering methods rely solely on gene expression data obtained from single-cell RNA sequencing techniques to identify cell clusters, the information contained in mono-omic data is often limited, leading to suboptimal clustering performance. The emergence of single-cell multi-omics sequencing technologies enables the integration of multiple omics data for identifying cell clusters, but how to integrate different omics data effectively remains challenging. In addition, designing a clustering method that performs well across various types of multi-omics data poses a persistent challenge due to the data’s inherent characteristics. Results In this paper, we propose a graph-regularized multi-view ensemble clustering (GRMEC-SC) model for single-cell clustering. Our proposed approach can adaptively integrate multiple omics data and leverage insights from multiple base clustering results. We extensively evaluate our method on five multi-omics datasets through a series of rigorous experiments. The results of these experiments demonstrate that our GRMEC-SC model achieves competitive performance across diverse multi-omics datasets with varying characteristics. Availability and implementation Implementation of GRMEC-SC, along with examples, can be found on the GitHub repository: https://github.com/polarisChen/GRMEC-SC.


Introduction
Cells serve as the fundamental units of living organisms, where metabolic activities and genetic reproduction take place (Zeng and Sanes 2017).Understanding cell type information is crucial for comprehending cellular mechanisms and growth trajectories (Kiselev et al. 2019).Moreover, cell type information is often required for various downstream analyses, including developmental trajectory analysis and cancer subtype recognition (Kiselev et al. 2019).However, determining cell types is often challenging as they are usually unknown in advance.Gene expression profiling provides an opportunity to accurately distinguish between cell types (Lacey et al. 1992, Trapnell et al. 2014), but traditional bulk sequencing technologies cannot capture and analyze individual cells effectively (Brawand et al. 2011).The emergence of single-cell sequencing makes it feasible to study transcriptome heterogeneity among cells through single-cell RNA sequencing (scRNA-seq) (Jaitin et al. 2014), paving the way for identifying cell types by clustering scRNA-seq data (Stegle et al. 2015, Kiselev et al. 2017).
Clustering scRNA-seq data faces a number of challenges (Kiselev et al. 2019), such as high dimensionality (Bellman 1966), dropout events (Andrews and Hemberg 2019), and high noise rate.Recently, numerous computational approaches have been developed for clustering scRNA-seq data.For instance, CountClust (Dey et al. 2017) clusters cells by referring to a generative probabilistic model, i.e. latent Dirichlet allocation.CIDR (Lin et al. 2017) calculates correlated distances and performs hierarchical clustering on scRNA-seq data against dropout events.Seurat (Stuart et al. 2019) improves clustering accuracy by considering k-nearestneighbors (k-NN) (Altman 1992) of spatially resolved gene expression profiling.SIMLR (Wang et al. 2017) focuses on learning a distance map with multi-kernel learning.
The above methods solely rely on transcriptome information (i.e.scRNA-seq data) to distinguish cell types, but the quality of scRNA-seq data limits their performance.Recently, the advent of single-cell multi-omics sequencing technologies, such as SHARE-seq (Ma et al. 2020), SNAREseq (Chen et al. 2019), and CITE-seq (Stoeckius et al. 2017), enables the measurement of multiple omics (e.g.gene expression and chromatin accessibility) within individual cells simultaneously.As different omics data may capture the cell type information from different aspects, integrating multiple omics data can lead to more accurate and comprehensive clustering results, and provide deeper insights into cellular heterogeneity (Nie et al. 2017a,b, Tao et al. 2017, Luo et al. 2018, Zou et al. 2022, Zhan et al. 2023).Various computational approaches have been developed to cluster single-cell multi-omics data.For example, BREM-SC (Wang et al. 2020) utilizes a Bayesian random effects mixture model to jointly cluster scRNA-seq and antibody-derived tags (ADTs) data obtained from CITE-seq.DCCA (Zuo et al. 2021) combines variational autoencoders (VAEs) and attention-transfer to jointly analyze scRNA-seq and singlecell genomics (scEpigenomics) data from the same cell.
While the aforementioned methods have offered valuable tools for cell clustering, finding a method that performs effectively in all scenarios remains challenging due to their model assumptions and the diversity of data distributions.For example, BREM-SC may deliver satisfactory performance on CITEseq datasets but exhibit poor performance on scATAC-seq datasets.Consensus clustering, on the other hand, can utilize clustering results obtained from various base clustering methods to improve clustering performance (Strehl and Ghosh 2002).However, existing consensus clustering methods heavily depend on the base clustering results.
To tackle the aforementioned challenges, we introduce a novel Graph Regularized Multi-view Ensemble Clustering (GRMEC-SC) model for clustering single-cell multi-omics data.Initially, we use multiple mono-omic clustering methods on each omic data independently to derive the base clustering results, which serve as guidance for learning the consensus co-cluster affinity matrix.To consider the information inherent in the original omics data, we introduce a weighted ensemble nonnegative matrix factorization (NMF) (Lee andSeung 1999, Nie et al. 2017a,b) strategy to extract the consensus low-dimensional representations of cells from multiple omics data.Subsequently, the consensus lowdimensional representations and the consensus co-cluster affinity matrix learned from multi-omics data and multiple base clustering results are integrated into two graph regularization terms to facilitate the learning of the final cluster indicator matrix.To showcase the effectiveness of our GRMEC-SC model, we perform experiments on five real-world multiomics datasets.The experimental results illustrate that our GRMEC-SC model can deliver consistent and competitive performance across a range of datasets with diverse characteristics, outperforming existing state-of-the-art mono-omic and multi-omics clustering algorithms.

Materials and method
In this section, we will outline the details of our Graph Regularized Multi-view Ensemble Clustering (GRMEC-SC) model.The flowchart of our model is illustrated in Fig. 1.We will commence by formulating the problem, followed by an in-depth explanation of our GRMEC-SC model and the derivation of its optimization scheme.Lastly, we will analyze the overall complexity of the model.

Problem formulation
Throughout this paper, matrices are written in upper-case letters and scalars in lower-case letters.The (i, j)th entry of a matrix X are denoted as X ij .The ith row or jth column of a matrix X are expressed as X i� or X �j .X T denotes the transpose of matrix X. trðXÞ stands for the trace of matrix X. jjXjj F represents the Frobenius norm of matrix X. Besides, 1 n denotes a column vector whose elements are all one.
Given a single-cell multi-omics dataset including m omics data, each omic data captures the information of p v features for the same set of n cells, which can be represented as a matrix X v 2 R n×pv , 1 ≤ v ≤ m.As all the entries in X v are nonnegative, we adopt NMF (Lee andSeung 1999, Wang andZhang 2013) to extract low-dimensional representations of cells from each omic data X v .Note that different omics data describe the same set of cells from different aspects.To extract the intrinsic consistent information inherent in multiple omics data to detect cell clusters more accurately, we introduce the following joint NMF (Zhang et al. 2012) (1) where denotes the consensus low-dimensional representations of cells learned from multiple omics data and denotes the corresponding basis in the feature space of each omic data, and k is the dimension of latent representations which is set to 50 by default.The impact of this parameter is demonstrated in the Supplementary Materials.
Due to inherent noise and biases in single-cell sequencing technologies, the reliability of various omics data may vary.To prevent the model from being misled by noisy information and to enhance the integration of multiple omics data effectively, we assign different weights to each omics data and introduce the following weighted fusion strategy: (2) where w v is the weight assigned to vth omic data.To learn these weights adaptively from the data, following (Nie et al. 2017a), we adopt the following function to calculate w v : denote the cluster indicator matrix, where C is the number of clusters.By enforcing P C c¼1 H ic ¼ 1 for i ¼ 1; . . .; n, each entry H ic of H denotes the probability that ith cell belongs to cth cluster.
Based on H, we can calculate an affinity matrix S H ¼ HH T , which captures the co-cluster similarities between cells.The higher the value of S H ij , the more likely it is that the ith and jth cells belong to the same cluster.
To learn W and H adaptively, by adding a graph regularization term to Equation (2) (Cai et al. 2010), we formulate the following objective function: where By minimizing Equation ( 4), cells with similar representations will have similar cluster assignments.This approach allows the representation learning process and the clustering process to mutually guide each other, leading to more precise results.
As various clustering approaches have been devised to identify cell clusters from mono-omic data like scRNA-seq data, we can generate multiple base clustering results by using different clustering models on a specific type of omic data.To leverage the insights extracted by diverse models, we introduce an ensemble learning strategy to integrate multiple base clustering results into our model.In particular, we apply different base clustering models on each omic data X v to generate the base clustering results.The rth base clustering result on vth omic data is denoted as P rv ; 1 ≤ r ≤ q; 1 ≤ v ≤ m, where P rv ¼ fp rv 1 ; . . .; p rv n g and p rv i denotes the cluster label to which the ith cell is assigned.
Then the base clustering result is transformed into a cocluster indicator matrix S rv 2 f0; 1g n×n : The following loss function is utilized to learn a consensus cocluster affinity matrix S from various base clustering results S rv : min Here, each entry S ij of S denotes the co-cluster propensity between ith and jth cells learned from multiple base clustering results.If all base clustering models consider these two cells to belong to the same cluster, they will achieve high S ij .Note that different base clustering results behave differently on capturing the true cell clusters.Thus, to integrate multiple base clustering results effectively and adaptively, instead of treating each base clustering equally, we introduce the following weighted ensemble learning approach: where α rv is defined as: The self-weighting approach described above eliminates the need for additional parameters, thereby reducing the complexity of tuning.To utilize S to guide the learning of cluster indicator matrix H, we impose a graph regularization term on the above loss function and obtain the following objective function: where L S ¼ D S −S is the Laplacian matrix of S and D S is a diagonal matrix with D S ii ¼ P n j¼1 S ij , and λ 2 is a tradeoff parameter.
To learn the consensus low-dimensional representation W, the consensus co-cluster affinity matrix S and the cluster indicator matrix H from multiple omics data and multiple base clustering results jointly, we combine Equations ( 4) and ( 9), and obtain the final objective function of our GRMEC-SC model as follows: where λ 1 and λ 2 are the trade-off parameters.The true cluster index z of ith cell falls on the maximum value H iz of row vector H i� , which means ith cell is most likely belong to zth cluster.This property allows us to directly obtain the clustering result by iteratively locating the largest index for each row of H.

Optimization
Similar to previous studies (Nie et al. 2017a,b), we solve the above problem by using an alternating iterative optimization strategy.In particular, we split the overall objective function into several subproblems and solve them individually with respect to one variable while fixing the others.In the meantime, w v and α rv will be updated according to Equation ( 3) and ( 8) respectively.Solving Equation ( 10) with respect to W is equivalent to: Taking the derivative of Equation ( 11) with respect to W and set it to zero, we can obtain the following updating rule of W as: Solving Equation ( 10) with respect to V v is equivalent to: The updating rules of V v can be deduced as follows: Solving Equation ( 10) with respect to S is equivalent to: where The updating rule of S is as follows: Solving Equation ( 10) with respect to H is equivalent to: where The updating rule of H is as follows:

Complexity analysis
The time complexity of updating W, V v , S and H once are and Oðn 2 CÞ respectively (k is typically smaller than n).Therefore, the total time complexity of updating our model once is Oðð

Experiments
In this section, we thoroughly evaluate the performance of our GRMEC-SC method on five single-cell multi-omics datasets.We extensively compare GRMEC-SC with both monoomic clustering and multi-omics clustering algorithms.

Datasets
To evaluate the performance of various models on different types of single-cell multi-omics data, we collect five different single-cell multi-omics datasets, including Inhouse (Wang et al. 2020), Public (Wang et al. 2020), 10x-pbmc-3k (https:// www.10xgenomics.com/resources/datasets/pbmc-from-a-healthydonor-no-cell-sorting-3-k-1-standard-2-0-0),Ma-2020(Ma et al. 2020), and cellMix (Chen et al. 2019).Here, Inhouse and Public are both CITE-seq data which measure matched transcriptome and surface protein data, while 10x-pbmc-3k, Ma-2020 and cellMix include matched scRNA-seq and scATAC-seq data.We use the preprocessing strategy utilized in Seurat (Hao et al. 2021) to preprocess scRNA-seq data and select the top 1000 highly variable genes (HVGs) for each scRNA-seq dataset.For ATAC-seq data, we utilize the Signac (Stuart et al. 2021) package to select peaks.Furthermore, for each dataset, we calculate the number of genes with zero expression for j>ɛ and t≤MaxIterations do 3: Update W with Eq. ( 12); 4: Update V v with Eq. ( 14); 5: Update S with Eq. ( 16); 6: Update H with Eq. ( 18) and unitize H by row; 7: Update the weight w v and α rv with Eq. (3) and Eq. ( 8); 8: Calculate objective function L t by Eq. ( 10); 9: end while each cell and eliminate cells with a high number of genes having expression values of 0. Table 1 summarizes the details of the real datasets used in this study.

Evaluation metrics
Two evaluation criteria, i.e.Adjusted rand index (ARI) (Hubert and Arabie 1985) and Normalized Mutual Information (NMI) (McDaid et al. 2011), are used for assessing the performance of various methods.For detailed definition of these two metrics, please refer to Supplementary Materials.

Benchmark and parameter settings
We compare the performance of GRMEC-SC with five mono-omic clustering algorithms and eight multi-omics clustering methods on five single-cell multi-omics datasets.Specifically, we compare our model with five popular scRNA-seq data clustering algorithms, i.e.CIDR (Lin et al. 2017), CountClust (Dey et al. 2017), SC3 (Kiselev et al. 2017), Seurat (Stuart et al. 2019), and SIMLR (Wang et al. 2017).Note that we also use these five methods as base clustering methods to generate base clustering results for each dataset.As these mono-omic clustering methods cannot handle multi-omics data, we apply them on each omic data separately and use their clustering results as the base clustering results for our model.Given that existing multi-omics clustering models are typically tailored for a particular type of multi-omics data, we specifically include a comparison with BREM-SC (Wang et al. 2020), CITEMO (Hu et al. 2022), and CiteFuse (Kim et al. 2020) on the CITE-seq datasets, which are designed for CITE-seq data.For multi-omics datasets containing both scRNA-seq and scATAC-seq data, we include the comparison with JSNMF (Ma et al. 2022), scMCs (Ren et al. 2023), andMOFAþ (Argelaguet et al. 2020).In addition, we also compare our model with five popular multi-view clustering methods, including MVEC (Tao et al. 2017), SWMC (Nie et al. 2017b), MVAN (Nie et al. 2017a), CSMV (Luo et al. 2018), and Seurat-WNN (Hao et al. 2021).
To ensure a fair comparison, the number of clusters for all clustering methods is set to the true number of cell clusters, except for CIDR and Seurat, which have the capability to adaptively predict the number of clusters.For our GRMEC-SC model, we select the optimal hyper-parameters λ 1 and λ 2 from range f10 −4 ; 10 −3 ; . . .; 10 4 g.The same parameter tuning process is carried out for the other comparison methods to achieve their optimal results, with the exception of SWMC, MVAN, and JSNMF.SWMC is a parameter-free algorithm, while MVAN and JSNMF automatically adjust the trade-off parameter depending on the number of connected components in the consensus matrix.

Parameter sensitivity
There are two trade-off hyper-parameters λ 1 and λ 2 in our GRMEC-SC model.λ 1 controls the agreement between the low-dimensional representation W and the cluster indicator matrix H, while λ 2 controls the effect of the consensus cocluster affinity matrix S on the final clustering results.To assess the sensitivity of GRMEC-SC to these parameters, a parameter sensitivity experiment is conducted.The performance of GRMEC-SC is assessed using NMI and ARI metrics.The values of λ 1 and λ 2 are selected from f10 −4 ; 10 −3 ; . . .; 10 4 g.
The performance of GRMEC-SC with varying values of λ 1 and λ 2 on the Inhouse dataset is depicted in Fig. 2. Results for the remaining datasets can be found in Supplementary Figs S2-S5.From these figures, it is evident that when λ 1 and λ 2 are sufficiently large, the performance of GRMEC-SC remains relatively stable.Based on these results, we recommend setting the default values of λ 1 and λ 2 to be 0.1 and 0.01.

Clustering results
Firstly, we compare GRMEC-SC with five mono-omic clustering methods, which are also utilized as base clustering methods.The results are depicted in Fig. 3.It can be observed from Fig. 3a that on the two CITE-seq datasets, our GRMEC-SC outperforms these base clustering methods in most cases in terms of the two evaluation metrics.From Fig. 3b, we can see that on the other three multi-omics datasets, GRMEC-SC still demonstrates competitive performance compared to the base clustering methods.In addition, the performance of each base clustering method varies across different datasets, with none of them dominating the others consistently.Since the base clustering methods are designed primarily for clustering scRNA-seq data and the characteristics of scATAC-seq data differ from those of scRNA-seq data, these methods show suboptimal performance on the scATAC-seq data from the cellMix and Ma-2020 datasets.By introducing a weighted ensemble learning strategy to integrate multiple base clustering results and incorporating the original omics data to reduce the model's dependence on base clustering results, our method achieves stable and robust performance across all cases.
Furthermore, we compare GRMEC-SC with eleven multiomics clustering methods, and the corresponding results are shown in Tables 2 and 3. From these tables, it is evident that our GRMEC-SC can outperform other compared methods in the majority of cases, highlighting the effectiveness of our representation learning and ensemble learning strategies.Unlike existing multi-omics clustering methods that are tailored for specific data types and may not be applicable to other data types, our GRMEC-SC is versatile in handling various types of multi-omics data and consistently achieves strong performance.
Our GRMEC-SC has the capability to learn consensus low-dimensional representations of cells from multi-omics data.To demonstrate the effectiveness of our model in learning cell representations, we compare the UMAP visualization of the original scRNA-seq data with our learned representations.Supplementary Fig. S6 illustrates the UMAP visualization on two CITE-seq datasets.Supplementary Fig. S6a and c displays the results obtained from the raw scRNA-seq data, while Supplementary Fig. S6b and d presents the results derived from the consensus low-dimensional representation W learned by our GRMEC-SC.It is apparent from these figures that distinguishing different cell types from raw scRNA-seq data is challenging, whereas our learned low-dimensional representation facilitates a clearer differentiation of various cell types.

Marker gene identification
Marker genes are highly expressed genes in specific cell types.
It is worth noting that finding the marker genes of cell clusters is an important step toward cell type annotation.Here, we use Wilcoxon rank-sum test to detect differential expression genes for each identified cluster and ranked them based on P-values.The identified top-4 differentially expressed genes of each cluster detected from the 10X-pbmc-3k dataset are shown in Fig. 4. We can find from Fig. 4a that the marker genes of all clusters except the first three are distinct.This observation may be attributed to the similarity among these three cell types, making it challenging to differentiate them directly based on marker genes alone.Integrating gene expression levels and gene distribution could potentially enhance the ability to distinguish these cell types more effectively.Figure 4b shows the expression levels of three marker genes in different cell clusters.It is clear that the expression level of the WDFY4 is higher in cluster 2 than in other clusters, and the expression levels of the ITM2C in cluster 3 and the NKG7 in cluster 4 are significantly higher than in other clusters, indicating their biological interpretability.These visualization results are obtained using the Scanpy package (Wolf et al. 2018).

Conclusion and discussion
In this study, a novel graph-regularized multi-view clustering model, named GRMEC-SC, has been developed for clustering single-cell multi-omics data.Unlike previous studies that primarily focus on clustering mono-omic data or specific It is essential to recognize that the computational complexity of our model presents challenges, especially when handling large-scale datasets.With the continuous advancements in sequencing technologies resulting in the generation of vast amounts of single-cell data, addressing the computational hurdles linked to such extensive datasets will be a key area of focus in our future research pursuits.We aim to explore strategies for efficient clustering on large-scale datasets, thereby improving the applicability and scalability of our model.

Figure 1 .
Figure 1.Schematic overview of the proposed GRMEC-SC model.Initially, multiple base clustering methods are applied to each omic data separately to produce base clustering results.Then the consensus low-dimensional representation matrix W and the consensus co-cluster affinity matrix S are learned from multi-omics data and multiple base clustering results, respectively.Finally, they are incorporated into two graph regularization terms to guide the learning of cluster indicator matrix H.
types of multi-omics data, our model is designed to handle various types of multi-omics data with diverse characteristics.Specifically, our model can simultaneously learn the consensus low-dimensional representations and the consensus cocluster affinity matrix of cells from multiple omics data and multiple base clustering results.By leveraging information from different data sources, our model demonstrates effectiveness across different types of multi-omics data.The performance of our model is rigorously evaluated through comprehensive experiments on five diverse multi-omics datasets.The experimental results showcase the superiority of our model compared to other state-of-the-art clustering methods.

Figure 3 .
Figure 3.The clustering performance of GRMEC-SC and various base clustering methods is evaluated on different datasets.Each base clustering method is applied individually to each omic data to assess its performance.(a) Results on CITE-seq datasets.(b) Results on multi-omics datasets that include both scRNA-seq and scATAC-seq data.

Figure 4 .
Figure 4. Our identified marker genes on the 10X-pbmc-3k dataset.(a) Dot diagram illustrating the average expression of the top-4 differentially expressed genes within each predicted cluster.(b) Violin plots showcasing the identified marker genes across different clusters.
Cluster indicator matrix H and cluster labels corresponding to the maxima of each row of H.
1: Initialize W, V v , S,and H randomly,

Table 1 .
Statistics of the single-cell multi-omics datasets.

Table 2 .
The performance of various methods on the multi-omics datasets which include both scRNA-seq and scATAC-seq data.a Optimal values are highlighted in bold and second best values are underlined. a

Table 3 .
The performance of various methods on the CITE-seq datasets.a a Optimal values are highlighted in bold and second best values are underlined.