FSPAM: A Feature Construction Method to Identifying Cell Populations in ScRNA-seq Data

: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has introduced new information about the structure of cells, diseases, and their associated biological factors. One of the main uses of scRNA-seq is identifying cell populations, which sometimes leads to the detection of rare cell populations. However, the new method is still in its infancy and with its advantages comes computational challenges that are just beginning to address. An important tool in the analysis is dimensionality reduction, which transforms high dimensional data into a meaningful reduced subspace. The technique allows noise removal, visualization and compression of high-dimensional data. This paper presents a new dimensionality reduction approach where, during an unsupervised multistage process, a feature set including high valuable markers is created which can facilitate the isolation of cell populations. Our proposed method, called fusion of the Spearman and Pearson affinity matrices (FSPAM), is based on a graph-based Gaussian kernel. Use of the graph theory can be effective to overcome the challenge of the nonlinear relations between cellular markers in scRNA-seq data. Furthermore, with a proper fusion of the Pearson and Spearman correlation coefficient criteria, it extracts a set of the most important features in a new space. In fact, the FSPAM aggregates the various aspects of cell-to-cell similarity derived from the Pearson and Spearman metrics, and reveals new aspects of cell-to-cell similarity, which can be used to extract new features. The results of the identification of cell populations via k-means++ clustering method based on the features extracted from the FSPAM and different datasets of scRNA-seq suggested that the proposed method, regardless of the characteristics that govern each dataset, enjoys greater accuracy and better quality compared to previous methods. feature extraction, fusion.


Introduction
With the advent of the new generation of DNA sequencing method, known as next generation sequencing (NGS), the quantitative and qualitative knowledge of transcriptomes progressed remarkably, through which researchers were able to extract the gene expression of cells completely. The parallel and automatic nature of these new processes such as RNA sequencing (RNA-seq) causes the production of millions of sequences concurrently, resulting in significant amplification of the operational power. Also, sequencing technologies with a high power have considerably reduced the costs of sequencing [Wang, Gerstein and Snyder (2010) ;Nagalakshmi, Waern and Snyder (2010)]. Single-cell RNA-sequencing (scRNA-seq) is known as a novel technology first presented in 2009 [Tang, Barbacioru, Wang et al. (2009)]. This method did not gain popularity until 2014, i.e., when new protocols were developed, and its sequencing costs diminished. ScRNA-seq technology measures the distribution of expression levels for every gene throughout the entire population of cells and allows new biological questions to be studied, where specific cellular changes in transcriptomes are important. For example, one can mention the identification of cell types, heterogeneity of cell responses, confirmation of gene expression, and deduction of gene regulatory networks across the cells. There are several protocols for using scRNA-seq such as SMART-seq2 [Picelli, Björklund, Faridani et al. (2013)], CELL-seq [Hashimshony, Wagner, Sher et al. (2012)], and Drop-seq [Macosko, Basu, Satija et al. (2015)]. In recent years, the use of scRNA-seq has allowed researchers to describe phenotypic heterogeneities observed in certain groups of cells and tissues through cell-by-cell indexing from transcriptome heterogeneity [Pouyan and Nourani (2017)]. One of the important uses of the results of scRNA-seq is identifying cell populations. In some cases, it results in the detection of rare and new cell subsets [Shalek, Satija, Adiconis et al. (2013); Buettner, Natarajan, Casale et al. (2015); Grün, Lyubimova, Kester et al. (2015); Nelson, Mould, Bikoff et al. (2016); Pellegrino, Sciambi, Yates et al. (2016)], which cannot be identified via previously known factors. It can further be employed in various areas such as cancer. For example, single-cell RNA-sequencing has been used for the identification of new cell subsets in the colon [Grün, Lyubimova, Kester et al. (2015)], fetus [Nelson, Mould, Bikoff et al. (2016)], cancer [Patel, Tirosh, Trombetta et al. (2014)], brain [Liu, Nowakowski, Pollen et al. (2016); Tasic, Menon, Nguyen et al. (2016)], pancreas [Segerstolpe, Palasantza, Eliasson et al. (2016); Wang, Schug, Won et al. (2016)], and immune cells [Villani, Satija, Reynolds et al. (2017)]. In spite of the hopes developed in this regard, there are challenges that complicate the analysis of scRNA-seq data. Some of these obstacles include the stochastic nature of the expression of genes, the existence of noise, dropout events, and high dimensions of these data. Nevertheless, in recent years, many attempts have been made to overcome these computational challenges. The proposed method in this paper, which is based on a graph-based Gaussian kernel, extracts a set of high-quality features before clustering through a proper fusion of the Pearson and Spearman criteria in a new nonlinear space. In fact, the FSPAM aggregates the various aspects of cell-to-cell similarity derived from the Pearson and Spearman metrics, and reveals new aspects of cell-to-cell similarity, which can be used to extract valuable features. In summary, it can be said that, the proposed method can be used both for extracting a highquality feature and for identifying accurate cell populations, can be used as a useful tool for analyzing and visualizing scRNA-seq data for bioinformatics researchers. This paper is organized as follows: Section 2 provides a review of related work. Section 3 explains the details of the proposed method. In Section 4, experimental results are reported clearly and, by comparing the results with other state-of-the-art methods, we evaluate the proposed method. Finally, the conclusion and discussion are presented in Section 5.

Related works
In most works related to the identification of cell populations in scRNA-seq data, attempts have been made to perform cell clustering by developing machine learning techniques. Partition clustering methods such as k-means and other distance-based clustering algorithms such as hierarchical clustering have been widely used for identifying cell populations in scRNA-seq datasets. For example, Jaitin et al. combined a hierarchical clustering method and probabilistic hybrid models to classify single-cells of different tissues [Jaitin, Kenigsberg, Keren-Shaul et al. (2014)]. Kiselev et al. [Kiselev, Kirschner, Schaub et al. (2017)] proposed a clustering method called single-cell consensus clustering (SC3), which integrates multiple cluster labels by a consensus approach and can improve cell type identification. SC3 combines all the different clustering outcomes into a consensus matrix that summarizes how often each pair of cells is located in the same cluster. The final result is determined by complete-linkage hierarchical clustering of the consensus matrix into k groups. Žurauskienė et al. [Žurauskiene and Yau (2015)] presented a modified clustering method called pcaReduce for the scRNA-seq data which repeatedly combined the PCA with kmeans to generate a hierarchical tree of cells. This method seeks to establish a connection between the reduced representations given by principal components analysis (PCA) and the number of resolvable cell types (clusters). SINCERA package is another example employing hierarchical clustering, in which the Pearson correlation is used for the similarity criterion, while linkage mean is employed for the linkage method in default settings. This package presents a generally applicable analytic pipeline for processing scRNA-seq data from a whole organ or sorted cells [Guo, Wang, Potter et al. (2015)]. SNN-cliq method presented by Xu et al. [Xu and Su (2015)] uses the shared nearest neighbor (SNN) for defining similarity between data points (cell), which performs clustering according to an algorithm based on the graph theory. This method models data as an SNN graph, with nodes corresponding to data points and weighted edges reflecting the similarities between data points. It then finds the ultimate clustering solution by using graph-theoretic techniques to cluster the sparse SNN graph. Pouyan et al. [Pouyan and Nourani (2016); Pouyan and Kostka (2018)] introduced methods called RAFSIN and RAFSIL, employing a random forest algorithm for clustering cell populations. RAFSIN uses random forests for identifying the dependence of cell markers and modeling cell populations based on the cell network concept. This cellular network helps to discover what types of cells exist in the tissue. RAFSIL method is also an approach based on the random forest for learning cell-to-cell similarities from scRNA-seq data. RAFSIL runs a two-stage method in which the features related to the scRNA-seq data are created after learning the similarities. This method is designed such that it can be adapted and developed, whereby the similarities obtained from RAFSIL can be used in projects associated with data analysis such as dimension reduction, visualization, and clustering.
Li et al. [Li, Zhang, Wong (2019)] introduced clustering methods based on evolutionary multiobjective. They proposed an evolutionary multiobjective ensemble pruning algorithm (EMEP) that addresses those realistic restrictions. The EMEP algorithm first applies the unsupervised dimensionality reduction to project data from the original high dimensions to low-dimensional subspaces; basic clustering algorithms are applied in those new subspaces to generate different clustering results to form cluster ensembles. Also, Li et al. ] provided a multiobjective evolutionary clustering based on adaptive nonnegative matrix factorization (MCANMF) for multiobjective single-cell RNA-seq data clustering. Firstly, adaptive non-negative matrix factorization is proposed to decompose data for feature extraction. After that, a multiobjective clustering algorithm based on learning vector quantization is proposed to analyze single-cell RNA-seq data. The mentioned methods, for precise the identification of cell populations, have used solutions for overcoming some computational challenges associated with scRNA-seq data. One of these challenges is the high dimension of this type of data, i.e., the sheer number of features (genes). The process of reducing the number of features and removing noise from data, which is the outcome of the data dimension reduction process, can significantly improve the ability to separate cell populations [Van Der Maaten, Postma and Van Den Herik (2009)]. To achieve this aim, various methods have been developed trying to visualize scRNA-seq data to identify cell populations through dimension reduction. Most of these methods use well-known tools such as PCA and t-SNE [Van Der Maaten and Hinton (2008)] for this aim. The PCA is considered a linear conversion method of data. Therefore, it may not be practical in many scRNA-seq datasets with a nonlinear nature where one cannot present the gene expression data as a linear combination of interrelationships between two cells. One of the nonlinear techniques that is currently used is t-SNE coupled with Euclidean distance, which can unveil the global structure and obtain many local structures of data with large dimensions. For example, a method called viSNE has been presented for reducing the dimensions of scRNA-seq data, which operates based on t-SNE, and maps high dimension cytometry data to two dimensions while preserving the structure [Amir, Davis, Tadmor et al. (2013)]. Nevertheless, unlike PCA, t-SNE does not learn an explicit map between high and low dimension spaces. This suggests that the points that are close to each other in high dimension spaces will also be close to each other in the low dimensions obtained, while most global relations cannot be interpreted directly [Wagner, Regev and Yosef (2016)]. Note that, according to the developers, t-SNE is a global visualization tool, and not a dimension reduction method, which has not been designed for dimension reduction in the scRNA-seq data. Furthermore, the Euclidean distance criterion which is used as the default distance criterion in most methods functions poorly on the data with high dimensions [Aggarwal, Hinneburg and Keim (2001) ;Beyer, Goldstein, Ramakrishnan et al. (1999)], and may not be useful for scRNA-seq data [Xu and Su (2015)]. Therefore, new and sometimes hybrid machine learning methods and the combination of distance criteria should be used in this type of data so that one can reduce the data dimensions more appropriately to perform a suitable and precise clustering.

Proposed method
One of the ultimate goals of analyzing scRNA-seq data is identifying cell populations. Analyzing scRNA-seq data is a sophisticated procedure faced with issues such as the intrinsic probability of gene expression, existence of noise data, dropout events, and high dimensions. Each of these issues can cause diminished efficiency and accuracy in identifying cell populations. Therefore, they should be prepared for final clustering using preprocessing techniques such as filtering, dimension reduction, and data reconstruction. The input of the process of scRNA-seq data analysis is typically a matrix called the normalized gene expression matrix as × , which has rows and columns. In the mentioned matrix, and represent the number of genes and number of cells, respectively, where the number of genes amounts to tens of thousands of genes, while the number of cells varies between hundreds to millions of cells in some datasets. This sheer number of genes, which is considered as features of the problem, results in excessive dimensions for this type of data. Therefore, the process of their analysis for identifying the cell populations becomes complicated. One of the most important measures taken in identifying cell populations is dimension reduction, where using machine learning techniques, a set of features are extracted which can support separation of cell populations. Considering the intrinsic complexity of scRNA-seq datasets, use of classic machine learning methods may not prove very effective. Therefore, by developing these methods and presenting novel approaches, this complexity should be overcome. The proposed method, called FSPAM here, presents a complete preprocessing step for clustering and identifying cell populations from scRNA-seq data, whose focus and contribution are related to extracting proper features for reducing the dimensions of this type of data. Since typically in scRNA-seq data, one cannot define a linear relationship between the important cellular markers, in FSPAM attempts have been made to overcome this problem through the graph theory. The proposed method, based on a graph-based Gaussian kernel and PCA, extracts a final set of features in three stages with a proper fusion of different correlation criteria. It is indeed a reduced set of markers or highly important genes which can help us in identifying cell populations in the clustering stage. The general procedure employed in the proposed method has been demonstrated in Fig. 1. In the rest of this section, each of the stages of the proposed method is explained thoroughly.

Gene filtering
One of the challenges in scRNA-seq data analysis is the noise propagation which is due to amplification error during the inverse transcription stage in RNA-seq experiments. Noise propagation emerges as excessive growth of zero and close to zero values in the dataset, creating problems in scRNA-seq data analysis. Therefore, typically in the first stage of the analysis of this type of data, a filter is applied to these data so that the features and genes that are most probably noise would be removed from the dataset of interest. The rest of the operations are then performed on the genes with a high degree of importance. Here, we have used frequency filtering (FRQ) [Pouyan and Kostka (2018)] to select the genes, in which we consider only the genes that are expressed in a specific fraction of cells. Specifically, here we experimentally identify and eliminate the genes regarded as noise that have been expressed in less than 5% of all cell samples, and keep the remaining cells as significant features to be used in the subsequent stages.

Computing cell-to-cell distance matrix
In this stage, using the Pearson and Spearman correlation coefficient criteria, the cell-tocell distance matrix is calculated. Each of these matrices shows an aspect of the correlation and cell-to-cell relationship in each dataset. The Pearson correlation coefficient, known as the moment correlation coefficient or zero order correlation coefficient, is used to determine the magnitude, type, and direction of the relationship between two distance or relative variables, or a distance variable and relative variable. It is calculated by the Eq. (1).
where, and are the variables of interest and and are their mean. The closer the absolute value of the correlation coefficient to 1, the stronger the relationship between the two variables is. In contrast, the correlation coefficient close to zero indicates that there is a very weak relationship between and variables. The Pearson correlation coefficient is a parametric method which is typically used for data with a normal distribution or a large amount of data. If we encounter ranked data or abnormally distributed data, the Spearman correlation coefficient is usually used, in which the rank of variables is used to calculate the magnitude of the relationship between two variables. In some way, it can be considered equivalent to the Pearson coefficient nonparametric method. Accordingly, the related equation can be considered as Eq. (1), in which the rank is used instead of the value of a variable [Hauke and Kossowski (2011); Kowalski (1972)].
Since we intend to present a data-driven approach in this paper, which deals with identifying cell populations regardless of any initial assumption, therefore it is assumed that in the proposed method, no previous information is available on the distribution governing the data as well as the number of data, through which one can select the proper correlation coefficient criterion. Thus, in the following, using an efficient method, the affinity matrices resulting from the Pearson and Spearman coefficient are fused and further used for extracting suitable features.

Computing cell-to-cell affinity matrix
PCA is one of the well-known and practical methods for dimension reduction in a linear fashion which tries to represent the covariance structure of a group of variables by a small set of variables. Note that this new set is a linear combination of the initial set. PCA is a method based on analyzing eigenvector decomposition (EVD), which divides the problem into principal components. The main disadvantage of the linear conversion methods such as the PCA is that if data have a nonlinear and more complex structure, this type of methods cannot be useful. One of the solutions for overcoming this problem is applying the kernel function trick. By utilizing kernel functions, one can well calculate the principal components in spaces with high dimensions, where these feature spaces are associated with the input space through a nonlinear mapping.
In kernel-based cases, indeed a linear transformation is learned in Reproducing Kernel Hilbert Space (RKHS) [Mingtao, Zheng and Haixia (2010)]. However, since the kernel reproducing space, such as the Gaussian kernel, has nonlinear statistics from the normal data space, it yields a nonlinear conversion on the initial feature space. Note that in these methods, we do not directly move to kernel reproducing space; rather, we learn transformation as follows through a kernel function which can be applied to any data pair and implicitly in the kernel space (Eq. (2)). ( where, represents the number of initial features, shows the number of samples present in the dataset, denotes the number of final features obtained, and is the kernel function which is defined as Eq. (3). ( where, and show a pair of data samples present in the dataset, and is the kernel function of interest. The graph-based Gaussian kernel function used here receives the cell-to-cell distance matrix and calculates the affinity matrix using the following relation for each data sample based on -nearest neighbors (Eq. (4)). (4) where, ( , ) is the distance between two samples calculated based on one of the criteria such as the Euclidean, Pearson, Spearman, etc. Further, µ is a parameter which is typically adjusted experimentally. Eventually, , refers to a term obtained by the Eq. (5) based on the locality of the -neighbor of each data sample. where, ( ( , )) is the mean distance between the sample and its -neighbor, ( ( , )) represents the mean distance between the sample and its -neighbor, and ( , ) shows the distance between and samples.
Briefly, the introduced graph-based Gaussian kernel, based on the locality of -neighbor of each cell, calculates the affinity matrix from the distance matrix. Through this, the input features' space is transferred to a new space via the nonlinear mapping. Then, using PCA in this new space, the eigenvectors and eigenvalues, which are the principal components, are extracted. Via this technique, one can overcome the linearity of PCA.

Fusion and feature construction
In the previous stages, two aspects of cell-to-cell similarity were obtained using the Pearson and Spearman metrics. In this step, by fusing these criteria, new aspects of this similarity will be discovered, which can lead to the extraction of new features, and help us to identifying cell populations. In this step, the affinity matrices resulting from the Pearson and Spearman coefficients is fused by the similarity network fusion (SNF) as presented by Wang et al. [Wang, Mezlini, Demir et al. (2014)]. Concerning the SNF presented for integrating different types of data on the genome scale, it fuses the affinity networks equivalent to affinity matrices more effectively, such that the resulting network offers a complete view about the essential relationships between the samples (cells). The resulting fused affinity matrix, which is very significant, is used for different purposes. Here, we used it for extracting high-importance features. Generally, the SNF involves two main stages: 1) creating an affinity-sample network for each type of data, and 2) integrating these networks in the form of a single affinity network through a nonlinear and graph-based fusion method. The procedure of SNF involves first calculating the sample-to-sample affinity matrix for each dataset based on an affinity criterion. This matrix is equivalent to an affinity network whose nodes are samples, while the weighted edges represent the extent of similarity of each pair of samples. For the stage of combining networks, the SNF uses a nonlinear approach based on the message sending theory, which frequently updates every network by receiving information from other networks, where each repetition increases the extent of affinity. After several repetitions, the SNF converges to an integrated network. The main advantage of this type of integration method is that the weak affinities, which are the low-weight edges, disappear, thereby supporting noise reduction. On the other hand, the strong affinities or the high-weight edges observed in one or several networks are summed up together, strengthening strong similarities. Also, the low-weight edges supported by all networks are preserved, given the extent of their strong connection to neighbors. Such a nonlinear fusion allows the SNF to use it more completely by integrating the shared and complementary information of a local network structure. In the following, PCA is applied to the affinity matrices, and after achieving the principal components (PC), the best components are extracted as a set of features with high importance (PC i) using the Elbow method [ Thorndike (1953)].This method plots the PCs obtained based on the value and in a descending order on the coordinate axes. The point where the break occurs in the diagram is called the elbow point. The PCs located before the elbow point are kept as the PCs or the best set of features obtained from each stage (Fig. 2). If the Elbow method is not used for choosing more important principal components, we should consider the number of components as a predetermined constant number, which questions the flexibility of the proposed approach. Nevertheless, usage of the Elbow method helps the system to obtain highly important components after the application of PCA given the dataset of interest. This makes our proposed approach flexible and data-driven.
After extracting the PC1, PC2, and PC3 feature sets, the final feature set is obtained by juxtaposing these features and creating the ultimate eigenvector (Eq. (6)).
(6) = 1 ∪ 2 ∪ 3 PCF contains a set of valuable features which can help us in identifying cell populations. Note that this feature set is obtained without any previous knowledge about the distribution governing the data and label of samples. Indeed, it is considered a kind of unsupervised feature extraction approach, which is far more valuable than supervised algorithms. Notably, the set and number of the final features constructed by the FSPAM are different from one dataset to another, where it extracts the best collection of markers in line with the dataset of interest, making the proposed approach flexible and data-oriented.

Clustering
After the feature extraction, in the next step, the clustering process is performed to identify the cell populations in the new space. One of the most popular clustering methods used in the clustering of scRNA-seq data is the k-means method. In this paper, due to the simplicity and high speed, the k-means clustering method has been used. One of the problems of kmeans, however, is its instability, which happens to the random selection of centroids. To overcome this challenge, a method called k-means++ [Arthur and Vassilvitskii (2007)] is presented which partly addresses this problem and provides a more stable clustering algorithm. In this method, first, during an iterative process, by selecting and testing different centroids, the best centroids are identified, after which the standard clustering of k-means is performed using these points. Although it is time-consuming to find these centroids, as it reduces the convergence time of the standard k-means, it will also compensate for that extra time.

Results and discussion
The proposed method, the FSPAM, has been developed by the R language and the experiments were run on an intel Core i7 CPU 2.67 GHz computer with 6 GB RAM. In this section, we first discuss the results of the proposed method from various aspects such as flexibility, accuracy, quality, and stability, and then compare the final results with different and well-known methods in this regard. Before that we briefly review the datasets and evaluation parameters used in this experiment.

Evaluation metrics
To evaluate the quality of the clustering, we used three well-known clustering criteria, i.e., ARI, NMI, and Purity. Each of them is explained briefly further.

Adjusted Rand Index (ARI): assume that we divide cells by clusters, where { } =1
represents the final labels produced by the clustering method. Also, assume that { } =1 reflects the real labels of each cell (correct cell type). Based on the two mentioned definitions, ARI is calculated according to Eq. (7): In this relation, and are the indices referring to clusters. = ∑ ( = ), = ∑ ( = ) , and = ∑ ( = ) , ( = ) . In these relations, ( = ) is the indicator function, whose value is 1 when = ,; otherwise, it is zero. Briefly, if the label of clusters produced by a clustering algorithm fully corresponds to original labels, then the ARI value is 1; otherwise, the ARI value declines in proportion with the inconsistencies that exist.
(8) = ( , )/�ℎ( )ℎ( ) As with ARI, if there is 100% correspondence between and clustering solutions, the NMI value becomes 1. Briefly, the closer the ARI and NMI values related to a clustering method applied to a dataset to 1, the higher the quality of the clustering will be. Purity: the criterion of purity is measured for clusters with a unit class. For its calculation, for every cluster, the number of data points from the typical class is counted in the cluster of interest. Then, all clusters are summed up together and divided by the number of data points (Eq. (9)).

Parameters adjustment
In the feature extraction step, when we convert the distance matrix to a similarity matrix, we use a graph-based Gaussian kernel, which operates according to Eqs. (4) and (5). These relations convert all distance values to similarity values based on a k-nearest neighbor and a parameter µ. Here, we used an empirical method to obtain the best results for which the values were as follows µ = 0.5 and = 40. The results related to the adjustment of these parameters are presented in Fig. 3.

Results
A notable point about the FSPAM is that the proposed method produces a variable and high-quality set of features and principal components in line with each dataset, which helps in more accurate identification of cell populations. This characteristic has been shown in Fig. 4, in which PC1, PC2, and PC3 represent the number of features extracted according to the Pearson, Spearman, and their fusion affinity matrices respectively. Also, PC_F represents the number of final constructed features, which has been extracted as 35, 25, and 67 for Kolod, Buettner, and Usoskin datasets respectively. Also, in order to find the success of the proposed method for the proper fusion of the Spearman and Pearson correlation coefficients, we examine the results of the FSPAM with the results of the Spearman and Pearson correlation coefficients separately. The results show that the use of these coefficients depends on the datasets and their governing distribution, while the FSPAM method obtains the best results, independent of the dominant characteristics of the data (Fig. 5). In the following, we first obtain the clustering results and identify the cell populations using different clustering methods based on the extracted features. Then, we examine the clustering quality of the proposed method by visualizing the data, and finally discuss the stability and robustness of the proposed method.
In the performed implementation, after extracting the set of final features, to identify the cell populations, we tested different methods of clustering including hierarchical (HR), GMM, DBSCAN, and k-means. The obtained results indicated that the k-means clustering offers the best results (Tab. 2). Therefore, for the rest of the work and in the other stages of testing and evaluating the FSPAM, in the clustering stage, a method developed based on k-means called k-means++ has been used. In the next step, we show that the set of features obtained from FSPAM enhances the quality of clustering cells and can visualize cells with a far better quality by the PCA dimension reduction method. For this purpose, we map every dataset twice by the PCA on a two-dimensional space to visualize the cellular space. In the first state, the PCA is directly applied to the original dataset, and all cells are visualized on the resulting two-dimensional space. In this second state, first the FSPAM is applied to the dataset, and then the features of interest are extracted. Next, the PCA is applied to visualize data on these extracted features, with the results summarized in Fig. 6. In every panel of this figure, every point represents a cell, and each color refers to a type of cell. As can be observed, the features extracted by the FSPAM differentiate different types of cells with a far higher quality. Furthermore, to investigate the stability, we replicated the FSPAM 50 times for each of the three mentioned datasets, such that in every replication, 90% of data were chosen randomly.
The results related to the ARI criterion obtained in these replications have been summarized as a box diagram in Fig. 7, in which blue, red, and green boxes represent variations of the results in the three datasets of Kolod, Buettner, and Usoskin, respectively. As can be observed in this diagram, for the dataset Kolod, FSPAM yields the minimum extent of change, where the accuracy of the results fluctuates within the quartile range of zero. In this dataset, only two different values of 0.98 and 0.993 were obtained, while the other results in the other 48 replications were equal to 1. Furthermore, in the other datasets, again favorable results were obtained, such that in Buettner and Usoskin data, the results fluctuated within the quartile range of 0.0522 and 0.0097, respectively. These results suggest that generally the FSPAM enjoys high robustness and stability, and one can rely on the obtained results to a large extent. Finally, for the quantitative assessment related to the clustering quality, we used the kNN classification method in the resulting two-dimensional space, whereby the resulting classification error (the rate of wrongly classified cells or NNE) has been calculated as the overall error of mapping. Since the kNN algorithm is dependent on k parameter, the number of nearest neighbors, we computed the resulting two-dimensional mapping error per different values of k (3,5,7,9), with the results being summarized in Tabs. 3 and 4. As can be observed, the NNE error when using the features extracted by the FSPAM has been far lower than the case when the PCA has been directly applied to the scRNA-seq gene expression matrix. This suggests the success of the proposed method in constructing the valuable features and reducing the dimensions of the problem.

Evaluation of the proposed FSPAM method with other methods
In this section, we compared the FSPAM with three traditional clustering methods including k-means, GMM, and hierarchical clustering method (HCLUST) as well as new clustering datasets including the SINCERA [Guo, Wang, Potter et al. (2015)], SNN-Cliq [Xu and Su (2015)], and pcaReduce [Žurauskiene and Yau (2015)], which have been designed for scRNA-seq data. The FSPAM and other six mentioned methods were applied to three datasets of scRNAseq in Tab. 1, where for each method, the ARI, NMI, and Purity were calculated separately. Fig. 8 summarizes the comparison of the obtained results. Also, its details are provided in Tabs. 5, 6, and 7, representing the high accuracy of the proposed FSPAM method. As shown in Tab With regard to the results obtained in this section and previous sections, it can be seen how the FSPAM method can extract valuable and, of course, proportional features to each scRNA-seq dataset, with a high accuracy and quality to identify cell populations.

Conclusion
In this paper, we dealt with identifying cell populations from scRNA-seq data. The analysis of this type of data has some challenges including the existence of noise, dropout events, and their high dimensions. Our proposed method, which we called, the fusion of the Spearman and Pearson affinity matrices, FSPAM, was an unsupervised method according to the graph-based Gaussian kernel. It extracted a suitable feature set for every scRNA-seq dataset without any previous knowledge about the type of cells and through proper fusion of the affinity matrices resulting from the Spearman and Pearson correlation criteria. They were used as valuable markers in the clustering process to identify cell populations. The results on three different datasets indicated that the set of features obtained from the FSPAM enhanced the quality of clustering cells and could visualize cells with a far better quality by the PCA dimension reduction method. The notable point about the FSPAM was the variable set of features extracted from the feature construction step in line with each dataset, making the FSPAM a data-driven approach. Also, to investigate the stability, we replicated the FSPAM 50 times for each of the three mentioned datasets. The results indicated that generally the FSPAM enjoys great robustness and stability, and one can rely on the obtained results to a large extent. Finally, for quantitative assessment related to the clustering quality, we used the kNN classification method in the resulting two-dimensional space. When using the features extracted by the FSPAM, NNE error was far lower than in the case when the PCA was directly applied to the scRNA-seq gene expression matrix. Also, to evaluate the proposed FSPAM method, we examined it against other well-known methods and from different aspects. To do this, we used three methods of classical clustering (k-means, HCLUST, and GMM) and three state-of-the-art methods presented to identify cell populations in the scRNA-seq data (pcaReduce, SINCERA, and SNN-Cliq). The results revealed that through a valuable feature set tailored to any data, the FSPAM method can be very accurate in clustering and identifying cell populations.
In summary, we can say that in this paper a method called FSPAM was presented which can be used both for extracting a variable set of valuable features and for identifying accurate cell populations. Indeed, the proposed method, which is an accurate, quality, and stable approach, can be used as a useful tool for analyzing scRNA-seq data for bioinformatics researchers in this area.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.