A clustering method for small scRNA-seq data based on subspace and weighted distance

Background Identifying the cell types using unsupervised methods is essential for scRNA-seq research. However, conventional similarity measures introduce challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. Methods We proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. To accurately capture the intrinsic relationship among cells or genes, a new distance metric that combines Euclidean and Pearson distance through a weighting strategy was proposed. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom. Results We compared SSWD with seven prevailing methods on eight publicly scRNA-seq datasets. The experimental results show that the SSWD has better clustering accuracy and the partitioning ability of cell groups. SSWD can be downloaded at https://github.com/ningzilan/SSWD.


INTRODUCTION
Single-cell RNA-sequencing (scRNA-seq) technologies capture cellular heterogeneity between single cell, which allows researchers to dissect complex biological samples with detailed information about the transcriptome, thereby changing our understanding of biological systems (Tang et al., 2009;Jaitin et al., 2014;Praktiknjo et al., 2020). Identifying the cell types is essential in analyzing scRNA-seq data, and the quality will directly affect downstream analysis in single-cell (Kharchenko, Silberstein & Scadden, 2014). Unsupervised clustering is one of the most widely used methods for identifying cell groups in scRNA-seq data (Ji & Ji, 2016;Žurauskiene & Yau, 2016;Kiselev, Andrew & Hemberg, 2019;Peyvandipour et al., 2020;Qi et al., 2020). However, high dimensional, noise, and dropout characteristics of scRNA-seq data present traditional clustering methods with a challenge (Elowitz et al., 2002;Stegle, Teichman & Marioni, 2015). Therefore, it is important to develop efficient and reliable clustering algorithms to identify cell groups.
Recently, many novel clustering methods have been developed for identifying cell groups of scRNA-seq data. Most of them focus on computing more accurate and robust similarity measures between cells (Taiyun et al., 2018;Peng et al., 2020). Single-cell Interpretation via Multi-kernel LeaRning (SIMLR) (Wang et al., 2017) chooses the most appropriate distance measure through multiple kernel learning and uses k-means to determine the cell groups. Seurat (Satija et al., 2015;Butler et al., 2018) and SNN-Cliq (Xu & Su, 2015) are graph-based clustering methods. Seurat constructs a k-nearest neighbor (KNN) graph with Euclidean distance in PCA (Jolliffe, 2002). SNN-Cliq combines a previously developed clustering algorithm with an SNN-based similarity measure, which determines cell groups automatically but requires three parameters to be specified. SC3 (Kiselev et al., 2017) employs consensus clustering to merge the clustering results under Euclidean distance, Pearson's correlation, and Superman's correlation to improve performance. However, SC3 is not scalable (Kiselev, Andrew & Hemberg, 2019). Besides, nonnegative matrix factorization, imputation, dimensionality reduction-based methods, and mixture model ensemble have been used to assess cellular heterogeneity (Grün et al., 2015;Lin, Troup & Ho, 2017;Shao & Höfer, 2017;Yang et al., 2019;Huh et al., 2020;Venkatasubramanian et al., 2020).
Subspace clustering is an efficient technique to mitigate noise applied in various fields (Chen, Nasrabad & Tran, 2011;Ekström & Hagen, 2019). SinNLRR (Zheng et al., 2019) considers cell clustering as a sparse subspace clustering (SSC) problem and uses the multiplier with an alternating direction to solve the optimization problem. S3C2 (Zhuang et al., 2021) combines enhanced SSC and low-rank completion algorithms in an optimization framework. DSCD (Wang et al., 2020) discovers the low dimensional latent structure from the compressed representation in scRNA-seq data and learns global relationships in single cells via a novel self-expressive denoise layer.
Highlighted by previous methods, calculating the similarity (distance) matrix of cells and reducing noise interference are crucial in clustering. This paper proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which assumed that sets of gene subspace composed of similar gene kernel density distributing genes could distinguish cell groups better. We proposed a new distance metric EP_dis, which integrates Euclidean and Pearson distance through a weighting strategy. Furthermore, we used the relative Calinski-Harabasz (RCH) index to determine the cluster numbers instead of CH because of its advantage of comparability in degrees of freedom. SSWD also included a consensus clustering process. Each of the gene subspace's clustering results was summarized using the consensus matrix integrated by PAM clustering. We applied the SSWD to eight public scRNA-seq datasets and contrasted it with seven widespread scRNA-seq clustering methods. The results show that SSWD reduces the influence of noise in clustering and better captures intrinsic relationships among cells or genes, which has greater clustering accuracy and the partitioning ability of cell groups.

UCI datasets
Six real datasets (Table 2) from UCI (University of California Irvine) (https://archive.ics. uci.edu/) were used to validate the performance of RCH.

scRNA-seq datasets
We downloaded eight scRNA-seq datasets from GEO (https://www.ncbi.nlm.nih.gov/geo/) to validate the effectiveness of SSWD, for which the cell types were declared in the original publications. These datasets, including human and mouse species, involve various tissues and biological processes, such as cell development and differentiation, using different unit counts, e.g., RPKM and FPKM. Specifically, Biase, Cao & Sheng (2014), Yan et al. (2013), andQiaolin et al. (2014) consist of transcriptomes of human/mouse cells in embryos at some crucial developmental stages. Treutlein et al. (2014) contains 201 cells in four developmental stages of mouse lung epithelial cells. Patel et al. (2014) contains 430 glioblastoma cells from five patients. Li et al. (2016) is a human islet cell dataset, which contains alpha (n = 18), beta (n = 12), pp (n = 9), acinar (n = 11), and ductal (n = 8) cell subtypes. Tian307 and Tian305 (Tian et al., 2019) include lung adenocarcinoma cells from five patients. The detailed description of the datasets is listed in Table 3.

The improved k-means algorithm with EP_dis and relative CH (RCH)
The k-means is a widely used clustering algorithm (MacQueen, 1967;Jain, Murt & Flynn, 1999;Jain, 2008). The algorithm requires the user to provide cluster initialization, distance metric, and the cluster numbers as the parameters (Chiang & Mirkin, 2010). Here we designed an improved k-means algorithm by introducing the EP_dis and RCH, which measure the similarity between two cells more appropriately and can automatically determine the cluster numbers.

EP_dis metric
Euclidean distance (E) is the most commonly used distance metric in traditional k-means, it characterizes the global correlation in high-dimensional space between samples. However, it will lose the correlation information between samples (cells or genes) when they have the same trend (Taiyun et al., 2018). Pearson distance (P) is another commonly used distance metric in clustering, which can captures the locally variable trend between samples (cells or genes), where P = (1 − R), and R is the Pearson correlation coefficient (Fulekar, 2009).  300 (60, 70, 80, 90) samples and 10 features. i,j, and k represent the cluster id, the feature id and the sample id, respectively. ξ : the random error.

D1
D2  Here, we combined Euclidean and Pearson distances through a weighting strategy and defined a new distance, EP_dis metric (Ning et al., 2022). It was defined as follows: A bigger EP_dis shows a weaker similarity between samples. If w = 0, EP_dis is Pearson distance; if w = 1, it is Euclidean distance. w is the weight, and it ranges from 0 to 1. The matrix E and P must be min-max normalized when calculating EP_dis because the range of E and P are different. Take the maximum SS B /SS W as the standard, and a step-by-step search determines the suitable w in EP_dis. Where SS B = k i=1 n i ||c i −c|| 2 represents the sum of squares between clusters and SS W = k i=1 n j j=1 ||x j − c i || 2 represents the sum of squares within clusters; k represents the cluster numbers; n i (n j ) represents the sample numbers in cluster x i N is the overall mean; N is the sample numbers. We adopt the maximum technique (Fränti & Sieranoja, 2019) to obtain the cluster's initial centroids to ensure clustering stability.

Determine the number of clusters
The k-means algorithm needs to be specified the number of clusters. The clustering internal validation (CIV) indices, such as the Calinski-Harabasz (CH) index (Caliński & Harabasz, 1974), Silhouette (Sil) index (Rousseeuw, 1987), and Gap Statistic (Tibshirani & Hastie, 2001), can be used for estimating the cluster numbers. The CH has been proven the best in estimating cluster numbers (Milligan & Cooper, 1985;Chiang & Mirkin, 2010). It is defined as: where N as the sample numbers, NC as the largest cluster numbers. The k with the maximum CH is the suitable cluster numbers. In different k, the CH value is incomparable because the degrees of freedom vary. So, we designed a new index, relative CH (RCH) (Ning et al., 2022), that was relatively comparable under different k: .
The workflow of the improved k-means algorithm with EP_dis and RCH is shown in Fig. 1.

The overview of the SSWD
In the scRNA-seq data matrix X G×N = {x ij |1 ≤ i ≤ G,1 ≤ j ≤ N }, rows represent genes, and columns represent cells. x ij represents the value of gene i in the j th cell. The framework of SSWD is depicted in Fig. 2.
Step 1 filtering genes Since rare and ubiquitous genes provide insufficient information for clustering, we only retained the v genes (default: 1,000) with the highest variance after log-transformed. Specifically when the maximum value in X is greater than 10,000, X = log10(X + 1), otherwise X = log2(X + 1). The gene subset X was the input of the second module.
Step 2 partition genes with subspace In scRNA-seq data, the sets of subspace represent the groups of genes. The gene subspace with similar density genes can distinguish informative features from noise (Song et al., 2021). We used the function density in R to calculate the gene's density. Specifically, the kernel density function scattered the density of genes over a regular grid of 512 points and convolved this approximation with the discretized kernel version using a fast Fourier transform. Then the function used the linear approximation to evaluate the density at each point (Sheather & Jones, 1991). In the density matrix E G ×512 , column and row represent the density values and gene, respectively. The improved k-means algorithm with EP_dis and RCH was employed to group the genes with similar density in matrix 'E'. Then, the

Step 3 cell clustering in subspace
The sets of gene subspace containing more than three genes have been kept. Then, we used PCA for dimensionality reduction and retained the first d-dimension with the Elbow method (Thorndike, 1953). Then the improved k-means algorithm with EP_dis and RCH was employed to get the sets of gene subspace clustering results Y subspace i .
Step 4 consensus clustering The cluster-based similarity partitioning algorithm (CSPA) was used to compute the consensus matrix M (Strehl & Ghosh, 2002).
based on the clustering results from the sets of gene subspace. The num is the number of subspaces where cells i and j are in the same cluster. If num =0, cell i and j are never in the same subspace. Because the M was a discrete matrix, the improved k-means with EP_dis and RCH were unsuitable, but the PAM algorithm did. PAM is a variation of the  k-means clustering algorithm, which uses the median of data points rather than the mean and minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distance as the objective function (Park & Jun, 2009). The PAM is more robust to noise and outliers than k-means. Then, we used the Sil index (Rousseeuw, 1987) to estimate the cluster numbers (cell groups).

Time complexity of SSWD
The main time-consuming step of SSWD is clustering by the improved k-means with EP_dis and RCH. In step 2 (see The overview of the SSWD), we used the improved k-means algorithm in the density matrix. We denoted n represents the sample numbers, m represents the feature numbers, k represents the cluster numbers, NC represents the range of cluster numbers, l represents the iteration numbers to determine the cluster centers, and w_step represents the search step. Since the k<<n, NC<<n, the step 2 time complexity holds about O(lmn). In step 3, each subspace would be performed the improved k-means algorithm after PCA. We denoted d as the retained dimension after PCA, and s is the number of genes subspace. The SSWD time complexity has roughly O(lmn +lnds). Since d<<m, s<<m, we can simplify the time complexity of SSWD to approximately O(lmn).

Biological insights
We transformed the clustering results of each cell group into ''one-against-the-rest''. Then, we executed the Wilcoxon rank-sum test for each gene between the expression value and the binary cluster, adjusting the p-value based on FDR. The gene that adjusted p-value<0.001 was preserved as the differential gene. Next, we used the AUC score to evaluate the performance of genes in distinguishing different cell types. Since AUC was only suitable for dichotomous problems, we constructed a binary classifier based on the mean expression value of each gene and compared the processed values with the binary cluster value. We defined the genes with AUC >0.85 and p-value<0.001 as marker genes.

Evaluation metrics
Two external validation indices, ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information) were used to evaluate the effectiveness of clustering methods. ARI (Hubert & Arabie, 1985) is a widely used external validation index in clustering, and it is defined as follows: where R and C are published and predicted clusters, respectively. The overlap of samples between R and C can be generalized into a contingency table. n ij is the times a sample occurs in the ith cluster of R and the jth cluster of C, a i is the sum of the ith row in the contingency table, b j is the sum of the jth column in the contingency table, and (.) represents the binomial coefficient. NMI (Strehl & Ghosh, 2002) is defined as follows: where n is the probability that a cell belongs to both the ith cluster in R and the jth cluster in C. The range of ARI and NMI are [0, 1]. The larger ARI (NMI) represent a better performance of clustering.

Reference methods
In this article, seven prevailing clustering algorithms were introduced as reference methods. The SILMR and SNN-Cliq used the same log transformation as this paper. SinNLRR and S3C2 used the correct cell groups for clustering. Unless specified, the default parameters in the program were used as suggested in the original paper.

Performance evaluation and comparison with reference methods
We compared the performance of SSWD with seven prevailing clustering methods in eight scRNA-seq datasets ( Table 4). The SSWD achieved the best clustering performance with an average ARI of 0.791 and was 0.143 higher than the second-ranked SC3, whereas the SNN-Cliq had poor performance (ARI of 0.364). SSWD ranked in the top three for ARI on all other datasets except Yan. SSWD attained the best results for NMI in three datasets (Li, Tian305, Tian307) and the second-best in four datasets (Biase, Yan, Deng, Treutlein). The average NMI of SSWD was the highest (0.850). Seurat had the poorest performance with only 0.579 in NMI because it failed on Biase, and the NMI of Li was only 0.122.
We further demonstrated the SSWD performance by ranking clustering accuracies on eight datasets (Fig. 3). For ARI (Fig. 3A), SSWD was superior to the seven reference methods in rank-wise (median of 2). SNN-Cliq performed the worst, with a median of 7. For NMI (Fig. 3B), SSWD was also superior to others, and the performance of CIDR, Seurat, and SNN-Cliq was all poor. Furthermore, the one-sided Wilcoxon signed-rank test was used to explain the statistical difference between SSWD and the reference methods. Except for SC3 and SinNRLL (in ARI), all the p-value are less than 0.05, which shows SSWD is superior to other methods ( Table 5).
The SSWD was also better for estimating the cell groups. Five out of eight datasets (Biase,Patel,Li,Tian307,Tian305) acquired the correct cell groups using SSWD. Especially for Tian307 and Tian305, only the SSWD estimated the correct cell groups and achieved the best ARI of over 0.948. Deng contains seven cell groups, for which all methods failed to identify the correct number of cell groups. For Treutlein, CIDR and Seurat estimated the correct cell groups, but the ARI (0.188 and 0.531) and NMI (0.304 and 0.648) were lower than those of SSWD (0.607 and 0.732). For Yan, the SinNRLL and S3C2 performed very well under the correct cell groups.

Annotate the clusters
We illustrated the effectiveness of the cell annotation using PanglaoDB (Oscar, Li-Ming & Johan, 2019) to clusters taking the Li dataset as an example. Li is a human pancreatic islet cells dataset containing five subtypes (alpha, beta, pp, acinar, and ductal) (Li et al., 2016). According to the AUC score (see Biological insights), we obtained the marker genes for each cluster identified by SSWD. Figure 4 is the expression heatmap of the top 10 marker genes for each cluster, which was divided into five clear modules and indicated that these marker genes could distinguish the clusters well. The keration8 (KRT8) in cluster 1; transthyretin (TTR), glucagon (GEG) in cluster 2; insulin (INS) in cluster 3; pancreatic polypeptide (PPY ) in cluster 4; REG1B, REG1A, CTRB2 in cluster 5 were all reported in the original publication. We also annotated the cluster with PanglaoDB. The cluster Table 4 The performance of SSWD. -The method that fails in clustering. () The actual cell groups have been provided as prior parameters. The best accuracy and the correct number of clusters (cell groups) are marked as bold for each dataset.

Datasets
Actual cell groups results annotated with PanglaoDB are consistent with the cell annotations in the original publication (Table 6).

Role of the EP_dis metric
EP_dis was used in SSWD to assess the similarity between cells or genes. According to the EP_dis definition (see Materials & Methods), the optimal w was determined by SS B /SS W using a search strategy. When w = 1, the EP_dis equals the Euclidean distance; when w = 0, it is the Pearson distance. We used two simulated datasets, D1 and D2, to display the impact of EP_dis and explain the process of optimizing w by SS B /SS W . Figure 5 shows the clustering accuracy of D1 (Fig. 5A) and D2 (Fig. 5B) Fig. 3 The performance ranking of seven clustering methods on eight datasets. Each method is ranked according ARI(A) and NMI(B) for eight datasets. Lower rank represents better performance. Ties are replaced by the mean of their ranks. Figure 3 The ranking performance of eight clustering methods on eight datasets. Each method is ranked according to ARI (A) and NMI (B) for eight datasets. A lower rank represents better performance (1 is the best and 8 is the worst). Ties are replaced by the mean of their ranks.
Full-size DOI: 10.7717/peerj.14706/fig-3 Table 5 The results of the Wilcoxon signed-rank test conducted on SSWD versus the reference algorithms. The p-value ( < 0.05) indicates the significant difference between SSWD and the reference algorithms. endpoints (0.6 in D1 and 0.8 in D2), which indicates that the EP_dis could capture more information between samples.

Role of the relative CH
The RCH in the improved k-means algorithm was used to determine the cluster numbers. In SSWD, we employed RCH to estimate the gene subspace numbers and guide each set of gene subspace grouping. The capability of the RCH directly affects the performance of the SSWD. We utilized simulated datasets D3-D6 with different characteristics, six UCI datasets, and three scRNA-seq datasets to illustrate RCH properties and compare them with CH (Table 7). We can see that CH and RCH were consistent in D3-D6, indicating their good performance in simulated datasets. In the UCI datasets, RCH could estimate the correct cluster numbers except for Dermatology and Yeast, but the corresponding cluster numbers estimated by the RCH was closer to the real value than those of CH. For the scRNA-seq datasets, RCH and CH all failed. Their poor performance may be due to the characteristics of scRNA-seq data. Nonetheless, the RCH result was closer to the true value.

Role of the subspace
After performing steps 1 and 2 of SSWD (see ''Materials & Methods''), the Li has been separated into eight sets of gene subspace, and seven participate in consensus clustering (Fig. 6, Fig. S2). The expression heatmaps of the best three sets of genes subspace display clear patterns ( cells with similar expression patterns. Compared with the EP_dis heatmap by 1,000 genes ( Fig. 6G), the consensus matrix using sets of genes subspace (Fig. 6H) enhances intercellular signaling. The consensus matrix clustering result was better (ARI of 0.967, NMI of 0.964) than the former (ARI of 0.386, NMI of 0.579) because the former could not distinguish alpha and pp cells well.

Discussion of prevailing methods
We also provided further discussion in Tables 4-5. The average performance of SC3 was only lower than SSWD, and its results were not significantly different from SSWD in the one-sided Wilcoxon signed-rank test. The SC3 combined multiple similarity measures (Euclidean, Pearson, Spearman) in clustering. It used the consistency matrix to integrate the multiple clustering results, and the consistency matrix strengthened the consensus signal between cells. At the same time, we can see that Deng and Treutlein, the best accuracy performers in SC3, could not obtain the correct number of cell groups. Although Biase estimated the correct number of cell groups in SC3, one cell was classified mistakenly, while SinNRLL could classify all cells accurately. Both SinNRLL and S3C2 introduce the idea of subspace clustering. Their average performances were better than other methods except for SSWD and SC3. However, this result was based on the cell group numbers being provided. Evaluating the cluster numbers is an important aspect of clustering methods. Although SinNRLL could estimate the cluster numbers by other methods, its accuracy is still unsatisfactory (Zheng et al., 2019). SNN-Cliq performed the worst (ARI = 0.364, NMI = 0.624), with none of the seven datasets estimating the correct number of cell groups. SNN-Cliq tended to divide more clusters, probably because the method requires providing three suitable parameters, and the results depend on the graphical representation of the data. CIDR used an implicit imputation approach to reduce the impact of dropout in scRNA-seq and used CH to estimate the cell groups. The method determined the correct number of cell groups in Treutlein and Tian307, but their clustering accuracies were poor. SIMLR adopts a multi-kernel strategy to adaptively select an appropriate distance metric and automatically determine the cell groups. However, this method achieved good performance only in Patel because it used Euclidean distance as the metric to construct a Gaussian kernel function (Taiyun et al., 2018). For Seurat, Biase failed, and the ARI of Li was only 0.084. The results show that Seurat may be unsuitable for small datasets, consistent with the literature (Kiselev, Andrew & Hemberg, 2019). The SSWD had the best performance in experiments. However, the performance of Patel and Yan were mediocre. Although Patel estimated the correct cell groups, the clustering accuracies were only ranked the third (in ARI) and the fourth (in NMI), probably because there were negative values in Patel datasets. All methods failed to estimate the correct cell group numbers in Yan. The poor performance of Yan in SSWD was because the estimated cell groups was far from the actual numbers.
We can draw the following conclusions from the above observations: (1) Due to the complex structure of scRNA-seq data, developing an optimal clustering method for all situations is impossible. (2) Determining the cluster numbers is difficult, so assigning cells to appropriate types is more important. (3) Selecting suitable similarity measures and using subspace in single-cell clustering help obtain better clustering results.

CONCLUSIONS
The identification of cell types is a fundamental problem in scRNA-seq data analysis. In recent years, many clustering methods have been proposed. Most of them focus on computing more accurate and robust similarity measures between cells. However, conventional similarity measures are encountering challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. This study proposed a clustering method for small scRNA-seq data, named as SSWD, based on sets of gene subspace and weighted distance. Firstly, an improved k-means with EP_dis and RCH was applied to divide sets of gene subspace with similar density distributions, which better identify distinct cell groups. Secondly, cell clustering was performed in these sets of gene subspace. Lastly, the ensemble clustering with PAM was conducted on the consensus matrix composed of gene subspace clustering results. The results of eight scRNA-seq datasets showed that SSWD could effectively reduce the influence of noise in clustering and better capture the intrinsic relationship between cells or genes, thereby achieving more robust and accurate clustering results.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Natural Science Foundation of Hunan Province (2021JJ30351), the Scientific Research Project of Hunan Provincial Department of Education (21B0187), and the Special Funds for Construction of Innovative Provinces in Hunan Province (2021NK1011). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.