DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering

Single-cell ribonucleic acid sequencing (scRNA-seq) is a high-throughput genomic technique that is utilized to investigate single-cell transcriptomes. Cluster analysis can effectively reveal the heterogeneity and diversity of cells in scRNA-seq data, but existing clustering algorithms struggle with the inherent high dimensionality, noise, and sparsity of scRNA-seq data. To overcome these limitations, we propose a clustering algorithm: the Dual Correlation Reduction network-based Extreme Learning Machine (DCRELM). First, DCRELM obtains the low-dimensional and dense result features of scRNA-seq data in an extreme learning machine (ELM) random mapping space. Second, the ELM graph distortion module is employed to obtain a dual view of the resulting features, effectively enhancing their robustness. Third, the autoencoder fusion module is employed to learn the attributes and structural information of the resulting features, and merge these two types of information to generate consistent latent representations of these features. Fourth, the dual information reduction network is used to filter the redundant information and noise in the dual consistent latent representations. Last, a triplet self-supervised learning mechanism is utilized to further improve the clustering performance. Extensive experiments show that the DCRELM performs well in terms of clustering performance and robustness. The code is available at https://github.com/gaoqingyun-lucky/awesome-DCRELM.


Datasets
We constructed comparative experiments on 12 real scRNA-seq datasets to verify the effectiveness of our DCRELM.The detailed information of these datasets is shown in Table 1, where #Cell is the number of cells, #Genes is the number of genes, #Cell types is the number of cell subtypes, and #References is the source of the dataset.We use datasets with small, medium, and large-scale samples, as well as datasets with significant features ranging from low to high dimensions.
Python package SCANPY 37 is used to preprocess scRNA-seq data.scRNA-seq data is a single-cell gene expression matrix, where rows and columns represent cells and represent genes, respectively, with each cell having the same number of genes.In the data preprocessing step, 95% of cells with zero values of gene expression are deleted to reduce the impact of useless genes on model calculation and clustering accuracy, and the mean and variance of the normalized data range are set to 0 and 1.

Framework of the DCRELM
The overall framework of the DCRELM is illustrated in Fig. 1.The DCRELM consists mainly of five modules: the ELM module, ELM graph distortion module, autoencoder fusion module, dual information correlation reduction module, and triplet self-supervision strategy clustering module.To better address the problem of high-dimensional sparse scRNA-seq data, first, we use the ELM to obtain low-dimensional and dense features of cells.Second, the graph distortion method is used for data augmentation, while the dynamic autoencoder fusion mechanism is employed to fuse the attribute information of cells and graph structure information among cells.Third, a dual information correlation reduction network was utilized to remove genes related to low-quality cells and genes with low expression.Last, different types of cells are effectively identified by minimizing the KL loss function of the triplet self-supervised strategy.Table 2 summarizes the notations in this paper.

Extreme learning machine
ELM [38][39][40] is known for its universal approximation capability and the hidden space created by random nonlinear feature mapping.ELM is a single hidden layer feedforward neural network (SLFN) that randomly assigns an input weight ϕ i and a hidden layer biase ζ i .The input cell-gene matrix is assumed to be X = [x 1 , x 2 , ..., x i , ..., x N ] ∈ R N×M , where N is the number of cells and M is the number of genes.The ELM hid- den layer output matrix is expressed as follows: where M is the number of random mapping nodes, and T (•) is the activation function.In the high-dimensional and sparse feature space of scRNA-seq data, identifying cell clusters is challenging.We utilize the ELM to effectively map sparse features to low-dimensional and dense spaces, solving this problem.

ELM graph distortion module
To further improve the generalizability and robustness of the DCRELM, we use the ELM graph distortion model to learn rich representations of cells in multiple ways.We considered feature destruction and edge perturbation, two types of distortion, in the cell graphs.Feature destruction is attribute distortion, where the noise matrix NO ∈ R N× M follows the Gaussian distribution N (1, 0.1) .The destroyed result matrix H ∈ R N× M is represented as H = H ⊙ NO , where ⊙ denotes the Hadamard product.Moreover, there are two methods for structural distortion: edge removal based on the similarity between cells and graph diffusion.First, the paired cosine similarity of cells is calculated in the latent space.Second, the lowest 10% of linking relationships are removed, generating a mask matrix MA ∈ R N×N based on the adja- cency matrix , where the degree matrix In the graph diffusion step, we use the PageRank (PPR) method to transform A m into a graph diffusion matrix A d .The calculation of A d is formulated as where τ is the balance parameter.We employ a siamese network to obtain the feature representations of cells from two perspectives to enhance the clustering performance of the DCRELM.

Autoencoding fusion module
As shown in Fig. 2, the autoencoder fusion module obtains the attribute information of cells and the graph structure information among cells via AE and IGAE 41 and dynamically fuses them to obtain more suitable feature representations.An AE is a multilayer feedforward neural network with ReLU activation.The encoding and decoding of each layer are as follows: (1) (2) Framework diagram of the DCRELM.The cell-gene expression matrix X from the scRNA-seq data was selected as the input matrix.The ELM maps the original high-dimensional sparse X into a random mapping space to obtain a low-dimensional dense cell output matrix H. Using a siamese network framework, the attribute information of the cell output matrix H is enhanced to obtain H 1 and H 2 , while the graph structure information of the cell adjacency matrix A is enhanced to obtain A m and A d .Then, the fusion encoder in the autoencoder fusion module is employed to extract latent features H υ1 and H υ2 from H 1 and H 2 , and the dual correlation information reduction network is utilized to remove noise and redundant feature information.High-quality cell-gene expression features are obtained by decoding the fusion module.The KL loss function of the triplet self-supervised strategy is minimized to improve clustering performance and effectively identify cell types.

Scientific
. IGAE is a multilayer feedforward neural network with a nonlinear activation function σ .The encoding and decoding of each layer are as follows: represent the learnable parameters of the ℓ-th encoder layer and j-th decoder layer, respectively.σ represents a nonlinear activation function.
IGAE employs a mixed loss function to minimize the weighted attribute matrix and adjacency matrix, i.e.
, and is the reconstructed adjacency matrix, and γ is a predefined hyperparameter.We adopt a dynamic fusion mechanism to integrate the attribute information H AE of each cell and the graph structure information H IGAE among cells, i.e. and To fully consider the local and global relationships among cells, first, we introduce the adjacency matrix A into H I to obtain the embedding feature H L for local structure-enhanced fusion.Second, the normalized self-correlation matrix S can obtain the global correlation feature H G , where

Dual information correlation reduction
We use a dual information correlation reduction network (DICR) to remove redundant information and improve the discriminative ability of the learned embedded features.Specifically, dual information correlation reduction is reduced in two ways: sample-level correlation reduction and feature-level correlation reduction.
First, we calculate the cross-view sample correlation matrix , where H ν 1 and H ν 2 are two view nodes embedded through the siamese network.The cross-view correlation matrix The purpose was to pull in two samples of the same dimension and pull out samples of different dimensions.Second, the correlation reduction of feature levels is divided into three steps.(1) The readout function R(•) is used to embed H ν 1 and H ν 2 from R N×d mapped to R k×d : , where H ν 1 :j represents the j-th column of We perform normalization processing to pull in two features of the same dimension and pull out features of different dimensions, i.e.T F = 1 We obtain the latent features H = 1 2 (H ν 1 + H ν 2 ) .Therefore, considering information reduction from two dimensions can further remove redundant information.

Clustering module
The DCRELM employs a triplet self-supervised strategy to enhance clustering performance, which simultaneously leverages the target distribution to enhance the guidance for the AE and IGAE.We utilize the t-distribution to compute the similarity between the samples and the clustering centres in the fusion embedding H .This similarity measurement helps capture the relationship between the samples and the clustering centres during the clustering process.Fusion embedding H integrates AE and IGAE information to generate a target distribution.The calculation process is described as follows: , where the degree of freedom for the Student's t-distribution is denoted by υ , while q ij represents the probability of assigning the i-th node to the j-th centre.This probability, which is referred to as a soft assignment, quantifies the likelihood of the i-th node belonging to the j-th centre.We normalize the frequency of each cluster based on q ij and obtain the calculation method for p ij as follows: . The distribution q ′ of H AE and the distribution q ′′ of H IGAE are calculated in the same way as the distribution of H * is calculated.We adopt the KL- divergence and designate the triplet self-supervised strategy clustering loss as:

Objective function
As shown in Eq. ( 5), the learning objective function of the DCRELM comprises three main components: the reconstruction loss of AE and IGAE, the DICR module, and the clustering model.These components collectively contribute to the learning process of the DCRELM.The DICR module includes T C loss, T F loss, and T R loss, where T R = JSD(H, AH) , and it is aimed at alleviating oversmoothing.JSD(•) refers to the Jensen-Shannon divergence.T KL is the clustering loss function.ε and are hyperparameters.

Implementation and parameter settings
This paper conducts experiments using PyTorch to execute the DCRELM in a Python 3.8 environment.The number of randomly mapped nodes is selected from {200, 500, 1000, 1500, 2000} .The number of nodes in the first three layers of the AE encoding layer is selected from {256, 512, 1024, 2048} , and the number of nodes in the last layer is equal to the number of randomly mapped nodes.The number of nodes in the first two layers of the IGAE encoding layer is selected from {256, 512, 1024, 2048} , and the number of nodes in the last layer is equal to the number of randomly mapped nodes.The DCRELM is trained using Adam with 2000 epochs and a learning rate of 0.0001.All the experiments were conducted on an NVIDIA A40 (48G).

Evaluation metrics
We use three evaluation metrics-the normalized mutual information (NMI), adjusted rand index (ARI), and F 1 -to measure the clustering performance of the clustering methods.The NMI is utilized to measure the simi- larity of the clustering results and combines the concepts of information entropy and mutual information.The ARI is employed to quantify the agreement between the predicted clusters and the true clusters.F 1 measures the classification performance of the algorithms.

Comparison of algorithm clustering performance
In this section, we conduct clustering experiments on 12 real scRNA-seq datasets and compare them with six state-of-the-art, single-cell clustering methods, namely, scDeepCluster 20 , GraphSCC 34 , scGNN 32 , DREAM 31 , scDCCA 29 , and scDFC 35 .Furthermore, we employ three evaluation metrics, namely, the NMI, ARI, and F 1 , to assess the performance of each method.Tables 3, 4 and 5 show the experimental results of seven methods on 12 real scRNA-seq datasets.The best results are highlighted in bold.As shown in Tables 3, 4 and 5, the DCRELM achieves the best clustering ( 4)  www.nature.com/scientificreports/all the other algorithms.On the Klein and Muraro datasets, the DCRELM exhibits significant improvements in terms of the NMI and ARI compared to the scGNN.Overall, the DCRELM outperforms the other methods.
To visualize the clustering results of the seven clustering methods, we choose a smaller scale real dataset Lawlor and a larger scale real dataset Klein, and use t-SNE 42 to project the clustering results of each clustering method into two-dimensional space.We compared the DCRELM with the other six clustering methods on the Lawlor and Klein datasets.As shown in Fig. 3, the different cell subtypes predicted by the DCRELM exhibit

Parameter analysis
To obtain low-dimensional and dense cell gene expression features, we use the parameter M to control the number of hidden layer nodes.The parameter selection range of M is {100, 200, 500, 1000, 1500, 2000} .Figure 5 shows the effect of M in terms of the NMI, ARI, and F 1 of the DCRELM on four datasets: Human, Yeo, Klein, and Muraro.Figure 5 shows that the clustering performance of the DCRELM varies with respect to M on the four datasets.For example, the DCRELM is not very sensitive to M on the Muraro dataset, while it is sensitive to M on the Human dataset.Therefore, the selection of the appropriate M value plays an important role in the clustering performance of the DCRELM.
To obtain the effective attributes and graph structure information of cells, we use embedding dimensions to control the number of nodes in the network layer for the AE and IGAE.The selection range for embedding dimensions is {128, 256, 512, 1024, 2048} .Figure 6 shows the impact of the parameter embedding dimension on the clustering results of the DCRELM on the four datasets.Based on Fig. 6, for datasets with sample sizes smaller than 1000, the optimal embedding dimension size for the AE and IGAE network layers is set to 256.For datasets with sample sizes larger than 1000, the optimal embedding dimension size for the AE and IGAE network layers is set to 2048.Therefore, the appropriate embedding dimension is related to the sample number of datasets.As shown in Fig. 7, due to the removal of the dual information correlation reduction module, DCRELM-CR could not effectively remove low-quality cells or genes with low expression.Therefore, the NMI, ARI, and F 1 of DCRELM-CR are lower than those of the DCRELM.Due to the removal of the dynamic autoencoder fusion, DCRELM-DF cannot effectively utilize the graph structure information of the fused cells.Therefore, the NMI, ARI, and F 1 of DCRELM-DF are lower than those of the DCRELM.Due to the removal of feature destruction and edge disturbance in the graph distortion module, DCRELM-N and DCRELM-E exhibit lower robustness than the DCRELM.

Conclusion
In this paper, we propose a new deep clustering method, the DCRELM, for scRNA-seq data.This method obtains low-dimensional and dense gene representations through an ELM random mapping space and then uses a graph distortion module to improve the robustness and uncertainty of the model.The dynamic fusion of dense-cell gene representations with cell attribute information and graph structure information helps establish connections among cells and among genes.We employ dual information correlation reduction to filter out redundant information and noise at both the cellular level and gene level.Additionally, we utilize a triple, self-supervised learning mechanism to further enhance the clustering performance.Extensive experiments demonstrate that the DCRELM outperforms the other comparison methods.In the future, we will consider multimodal data clustering, integrating data from different levels to more comprehensively describe the heterogeneity of single cells.

Figure 2 .
Figure 2. Flowchart of the autoencoder fusion module.The autoencoder fusion module obtains the attribute information H 1 AE and H 2 AE of cells and the graph structure information H 1 IGAE and H 2 IGAE among cells via AE and IGAE and fuses these two pieces of information to obtain more suitable feature representations H * of cells.

DCRELM consists of five
parts: ELM module, ELM graph distortion module, dual information correlation reduction module, autoencoder module, and autoencoder fusion module.These five parts correspond to time complexities O(N * M * M) , O(N 2 ) , O(N 2 * d) , O(N * M * d) , and O(N 2 * d) , where N is the number of cells, M is the number of genes, M is the number of random mapping nodes, and d is the embedding size.Therefore, the total time complexity of DCRELM is O(N * M * M) + O(N 2 * d) + O(N * M * d) , where M and d are much smaller than M. Overall, DCRELM significantly reduces the dimensionality of gene representation and can better handle larger scale scRNA-seq datasets.
T = T AE + T IGAE Reconstruction + T C + T F + εT RDICR https://doi.org/10.1038/s41598-024-64217-ywww.nature.com/scientificreports/performance in most datasets.With the exception of three datasets, the DCRELM has the highest NMI and ranks second in terms of the ARI among all the algorithms.Although the DCRELM is not the highest on the Kolo, WB, or CNIK datasets, it still performs in the top three.In terms of F 1 , the DCRELM significantly outperforms

Figure 3 .
Figure 3. Visualization of the prediction results of seven clustering methods on the Klein and Lawlor datasets.

Figure 4 .
Figure 4. Change in the NMI and ARI for each method on the Klein datasets with 20%, 30%, 40%, and 50% dropout rates.

Figure 5 .
Figure 5. Impact of latent feature M dimension on the clustering performance of the DCRELM.

Figure 6 .
Figure 6.Impact of the embedding size on the clustering performance of the DCRELM across four datasets. https://doi.org/10.1038/s41598-024-64217-y where ν t represents the ν t th view, and ℓ and j represent the ℓ th encoder layer and jth decoder layer, respectively.P 1 and B 1 represent the encoding weight and biase, respectively.P 2 and B 2 represent the decoding weight and biase, respectively.σ ReLU is the ReLU activation function.AE H ν t

Table 1 .
Characteristics of experimental datasets.

Table 3 .
NMI of the DCRELM and six comparison methods on 12 datasets.The optimal values are shown in bold.

Table 4 .
ARI of the DCRELM and six comparison methods on 12 datasets.The optimal values are shown in bold.

Table 5 .
F 1 of the DCRELM and six comparison methods on 12 datasets.The optimal values are shown in bold.