Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles

Rapid advance in single-cell RNA sequencing (scRNA-seq) allows measurement of the expression of genes at single-cell resolution in complex disease or tissue. While many methods have been developed to detect cell clusters from the scRNA-seq data, this task currently remains a main challenge. We proposed a multi-objective optimization-based fuzzy clustering approach for detecting cell clusters from scRNA-seq data. First, we conducted initial filtering and SCnorm normalization. We considered various case studies by selecting different cluster numbers (cl = 2 to a user-defined number), and applied fuzzy c-means clustering algorithm individually. From each case, we evaluated the scores of four cluster validity index measures, Partition Entropy (PE), Partition Coefficient (PC), Modified Partition Coefficient (MPC), and Fuzzy Silhouette Index (FSI). Next, we set the first measure as minimization objective (↓) and the remaining three as maximization objectives (↑), and then applied a multi-objective decision-making technique, TOPSIS, to identify the best optimal solution. The best optimal solution (case study) that had the highest TOPSIS score was selected as the final optimal clustering. Finally, we obtained differentially expressed genes (DEGs) using Limma through the comparison of expression of the samples between each resultant cluster and the remaining clusters. We applied our approach to a scRNA-seq dataset for the rare intestinal cell type in mice [GEO ID: GSE62270, 23,630 features (genes) and 288 cells]. The optimal cluster result (TOPSIS optimal score= 0.858) comprised two clusters, one with 115 cells and the other 91 cells. The evaluated scores of the four cluster validity indices, FSI, PE, PC, and MPC for the optimized fuzzy clustering were 0.482, 0.578, 0.607, and 0.215, respectively. The Limma analysis identified 1240 DEGs (cluster 1 vs. cluster 2). The top ten gene markers were Rps21, Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Khk, Rps3a1, Aldob and Rps17. In this list, Khk (encoding ketohexokinase) is a novel marker for the rare intestinal cell type. In summary, this method is useful to detect cell clusters from scRNA-seq data.


Introduction
Rapid technology development in sequencing over the last two decades has made the transcriptomic analysis of cells and tissues more reliable and informative [1]. Cells are basic units of organisms and the building blocks of various complex tissues; they are controlled by many factors that affect their cell status and features (e.g., cell type specific expression, senescence). Quantification of the mRNA transcripts in genome-wide basis is useful to characterize the molecular circuitries as well as cellular states. In general, such datasets are accumulated with higher spatial resolution, whereas the single-cell RNA sequencing (scRNA-seq) permits to conduct the transcriptome-wide analyses of single cells to discover the interesting biomedical insights as well as biological perception [1,2]. As a heterogeneous cell population, scRNA-seq shows the levels of gene expression for each individual cell, while in bulk-tissue RNA sequencing, mean value of the expression signature in the basis of their cell population level has been evaluated. ScRNA-seq needs the isolation as well as lysis of the single cells, the transformation of their corresponding RNA to cDNA, and the amplification of the cDNA to produce the high-throughput sequencing libraries.
Here one important factor is evaluation of sensitivity of the scRNA-seq technique i.e., the probability for capturing and transforming a corresponding mRNA transcript that presents in a single cell to a cDNA biomolecule that present in the library. Another important factor is evaluation of the accuracy, i.e., how good the read quantification correlates with the actual expression of the mRNA. The third factor is the evaluation of precision in which the amplification happens, i.e., the technical variation related to the quantification. Moreover, the integration of precision, sensitivity, and the number of cells that have been used in analysis, evaluates the power for identifying the relative differences in the levels of expression. To choose a proper and efficient method among all the state-of-the-art scRNA-seq techniques, it requires to measure those parameters accurately. Major advantages and limitations of various scRNA-seq techniques have been well discussed in the field. In addition, Smart-seq technique was optimized for the full-length coverage, sensitivity, cost as well as accuracy [3]. Next, the enhanced version of this technique (enhanced Smart-seq2 technique) was developed by Picelli et al. in 2014 [4] that is also useful in various works [5][6][7].
It is well-known that the characterization of entire cell types in any complex tissue needs the processing of at least a couple of thousand single cells [8]. Larger sample size would make the evaluation better. It is true that a larger number of cell type-related transcripts are not identified in current scRNA-seq because of the failure during the stage of amplification and the relative limitation of short read coverage. Consequently, a limited number of cell type-related genes might fail to affect the downstream analysis regime in sufficient way. Current discovery of droplet-based single-cell transcriptomics is helpful to perform the parallel profiling of the tens of thousands of the single cells at significantly low expense per cell. Various studies within the range of the transcriptomes of cells from 20 k and 70 k have already published [9][10][11].
One of the major challenges of single-cell transcriptomics is the identification of the clusters of cells. Most recently, Kiselev et al. provided a comprehensive review of use and challenges of different unsupervised clustering algorithms on scRNA-seq data [12]. Unsupervised learning (clustering) has a key role to analyze scRNA-seq data since it can be applied to detect the putative cell types. In that review article, the advantages and shortcomings associated with the biological interpretation as well as annotation of the evolved clusters in different clustering algorithms were described. Andrews and Hemberg provided a survey of various computational methods to determine cell populations for scRNA-seq data [13]. Furthermore, Zhu et al. developed a novel technique, semisoft clustering which could classify both the pure cell types from the individual cells of the scRNA-seq gene expression profile [14]. Diaz-Mejia et al. compared four benchmarked algorithms, GSVA, CIBERSORT, ORA and GSEA to assign the cell type labels to the cell clusters in the scRNA-seq data [15]. Diaz-Mejia et al. chose the scRNA-seq datasets from various sources such as peripheral blood mononuclear cells, liver as well as retinal neurons for which the corresponding reference cell type-based gene expression signatures would be available. In addition, rare cell populations also play a significant role to detect the pathogenesis of cancer mediating angiogenesis, immune responses in cancer as well as other diseases. Antigen-related T cells are mandatory for forming the immunological memory [16][17][18]. Endothelial progenitor cells are potential biomarkers for the tumor angiogenesis [19,20]. Circulating tumor cells and their usability in management of the cancer as well as clinical data study were represented in Krebs et al. [21]. Stem cells can replace the damaged cells, and can also make treatment on several diseases such as heart diseases, Parkinson's disease, etc. [22]. In general, the number of methods developed for identifying the rare cell transcriptomes is very few. Among them, two algorithms (RaceID [23] and GiniClust [24]) work nicely. The two limitations of these algorithms are high elapsed time and lack of memory efficiency in case of large-scale oversized scRNA-seq profile. Hence, there are many approaches that were developed for satisfying different individual objectives for the analysis of the single-cell sequencing data.
In the literature, there are many articles about the identification of gene signature [25][26][27][28] or, biomarker [29][30][31] or, gene module [26,32,33] or, transcriptome analysis [34,35] for microarray/RNA-seq data and the integration of multi-omics data [36][37][38][39][40]. Among them, there are a few papers using fuzzy clustering for the microarray/RNA-seq data [41,42]. However, there has been no study yet using multi-objective optimization and fuzzy clustering together on single-cell RNA-seq data. Therefore, in this article, we developed a new computational framework to identify cell clusters using multi-objective optimization and fuzzy clustering together on scRNA-seq data. Our method has several advantages in the application of fuzzy clustering and multi-objective optimization strategy. The intuition behind using fuzzy clustering and TOPSIS ("Technique for Order Preference by Similarity to an Ideal Solution") multi-objective optimization strategy is described as follows: (i) Fuzzy C-Means (FCM) is a kind of clustering strategy in which each sample point belonging to the cluster is characterized by its membership function. In general, FCM tries to maintain the membership matrix of the input dataset that has been updated on every iteration by estimating the associated weight of every sample point to evaluate its degree of membership. The summation of every sample point towards all the clusters is unity. The main benefits of this strategy to scRNA-seq data include its capability to form clusters of the overlapped sample points and the results satisfy the property of convergence. The potential limitations of the cluster validity are that the prior necessity of c value is required for the quality clustering outcomes and outliers might be assigned to the similar membership value in every cluster. These limitations make it less desirable for using any kind of gene expression data.
(ii) TOPSIS method is used to identify the set of multi-objective optimized clusters. In the scRNA-seq data, the number of cell clusters varies among data sets. In this study, we attempted to identify the best set of multi-objective optimized clusters as measured by the quality of clustering, i.e., different clustering validity index measures. (iii) We performed a comparative analysis of our proposed method with the existing k-means clustering method. The comparative analysis indicated that our method outperformed over k-means clustering method.
Specifically, cell filtering and gene filtering were first performed for the single-cell data, and then normalized the data using the robust SCnorm normalization, respectively. We then considered various case studies through the selection of different cluster sizes, and used fuzzy c-means clustering algorithm, individually. For each case, we obtained the four cluster validity index measures, Partition Entropy, Partition Coefficient, Modified Partition Coefficient, and Fuzzy Silhouette Index. We set these four measures as objective functions in which first index was treated as minimization objective and rest of these three were assumed to be maximization objectives. Next we applied a multi-objective decision-making technique, TOPSIS with providing equal preference to each objective for detecting the multi-objective optimal (best) solution. From the optimal solution, we obtained the corresponding optimal cluster size along with cell-cluster information. We then performed Limma statistical package using the cell-cluster information to identify differentially expressed genes for each evolved cluster while compared to the rest. Furthermore, the top ten differentially expressed genes were considered to be the potential gene markers for the cluster. Furthermore, we conducted the KEGG pathway and Gene Ontology analyses through DAVID online database. Finally, our framework provided the multi-objective optimized clusters as well as potential gene markers for each cluster that might be useful to any scRNA-seq data.

Materials and Methods
In this article, we provided an extensive analysis to identify the single-cell clusters and gene markers, respectively using multi-objective optimization-based fuzzy clustering for a scRNA-seq gene expression dataset. See Figure 1 for the flowchart.

Initial Filtering
First, we performed some filtering analysis on the initial data. According to the latest literature search [23,43], it had been noticed that the filtering criteria was not fixed anymore. Hence, we chose the standard cutoffs for cell filtering as well as gene filtering. However, first we transformed the count data to a Boolean matrix where the non-zero values were replaced by 1. Next, we counted the summation of all the non-zero values for each cell, and then chose those cells which contained the summation score greater than 2000. After that, we focused on gene filtering. Here we only selected those genes (features) that contained the summation of all the corresponding non-zero values greater than 3 and less than 500. Finally, we computed the feature-wise variance for each gene, and then chose some top highest variant genes for the next step. Thereafter, we identified the count-depth relationship through plot and then applied a recent robust normalization technique, SCnorm [44] made for single-cell sequencing data.
Of note, prior to normalize through SCnorm normalization [44], the relationship between the expression counts and the corresponding sequencing depth (named as the count-depth relationship or slope) for the experimental data should be verified. SCnorm normalizes across the cells for eliminating the effect of the sequencing depth on the counts. The genes were initially partitioned into 10 equally sized groups depending upon their non-zero median expression. In SCnorm, only those genes that had at least 10 non-zero expression values, were selected be default. SCnorm had initiated at the value of the parameter K equal to 1 that helped to normalize the data with the assumption that all the genes had to be normalized in a single group. The sufficiency of the score K = 1 was estimated through determining the normalized count-depth relationship. To do so, all the genes were divided into 10 groups depending on their corresponding non-zero unnormalized median expression scores (considering equal group-size) and evaluated the mode for each corresponding group. Whenever all 10 modes were within 0.1 of zero, the value K = 1 would be sufficient. While any of those modes was less than −0.1 or greater than 0.1, the SCnorm method attempted to normalize considering K = 2 and repeated the group-wise normalizations along with the corresponding estimation. It would continue until all those modes were within 0.1 of zero. Moreover, SCnorm method initially tried to fit the corresponding model for the value K = 1, and then subsequently increases the value of K until an approximate satisfactory stopping point had been reached. As the gene expression generally increased proportionally while increasing the sequencing depth, the count-depth relationships were required to evaluate near 1 for all the genes. In general, for single-cell data, those relationships were somehow variable across the genes. However, after the SCnorm normalization, the count-depth relationship could be evaluated on the normalized data profile where the slopes near zero signified successful normalization.

Fuzzy Clustering for Finding Optimized Cell Clusters
After normalization, we applied a well-known clustering algorithm, fuzzy c-means clustering for the initial number of clusters, cl = 2, 3, ...udc (where udc be a user-defined cluster size), signifying (udc-1) number of case studies, individually, and then computed four cluster validity indices: Partition Entropy, Partition Coefficient, Modified Partition Coefficient, and Fuzzy Silhouette Index from each case study.
Fuzzy c-means [45] is a clustering algorithm based upon fuzzy membership concept in the domain of machine learning. Let Fuzzy c-means clustering algorithm makes partitions of n data features (data points) X = {x 1 , x 2 , x 3 , ..., x n } n×a into cl (1 ≤ cl ≤ n) fuzzy clusters while every feature contains a number of attributes. Suppose Cen = {cen 1 , cen 2 , cen 3 , ..., cen cl } cl×a stands for the set of cluster centers, and U = [u jq ] cl×n is cl × n matrix of the membership degrees where u jq denotes the membership degree of q th feature to cl th cluster center. The above matrix satisfies the following criteria: The fuzzy c-means algorithm applies the following objective function to obtain the optimal solution of the corresponding fuzzy optimal clustering. The objective function J f cm is be defined in the following: where m (1 ≤ m ≤ ∞) is the fuzzification coefficient denoting the fuzzy degree of clustering. In our work, we used m =1. Here || * || be any norm measuring the similarity between the cluster center and any measured data. The objective function J f cm should be minimized.
The minimization of the objective function is performed through Lagrange multiplier technique under the constraint cl ∑ j=1 u jq = 1 (q = 1, 2, 3, ..., n), whereas the membership degree and the cluster centers have been updated by the following equations: The algorithm terminates while the criteria max jq |u i+1 jq − u i jq )| ≤ is satisfied, whereas be a termination constant that lies between 0 and 1, while i denotes the iteration step id. This algorithm converges to either a local minimum or saddle point of the objective function J f cm .

Measuring Cluster Validity Index Measures
Partition Entropy (PE) [46,47] and Partition Coefficient (PC) [47,48] are two cluster validity indices that were developed by James C. Bezdek. PE and PC were defined as follows: and Modified Partition Coefficient (MPC) [49,50] was introduced for correcting the monotonic trend of PC. The values of MPC lie between 0 and 1. MPC was defined as follows: Fuzzy Silhouette Index (FSI) [50,51] is such a measure where it selects the two such clusters in which x q contains the highest membership degrees. FSI was described in the following: where Here, a data feature (point) x q is in the part of the cluster gt j , (gt j ∈ (gt 1 , gt 2 , gt 3 , ..., gt cl )); whereas δ(x q , gt j ) is basically the intra-cluster distance that signifies the mean distance between x q and all other points that belong to the same cluster, gt j . On the other hand, β(x q , gt j ) is an inter-cluster distance that denotes the distance between x q and its neighbor cluster closest to the cluster gt j .
For obtaining the optimal clusters, PC, MPC, and FSI should be maximized, while PE should be minimized.

Identifying Optimal Fuzzy Clustering Solution Using Multi-Objective Decision-Making Model
After producing the values of these four measures for each case study, we applied a multi-criterion decision-making model, TOPSIS ("Technique for Order Preference by Similarity to an Ideal Solution") [52,53].
First the decision matrix was constructed, and the weight of criteria was evaluated. Suppose, Y = y ij is a decision matrix and Wt = (wt 1 , wt 2 , wt 3 , ..., wt n ) is a weight vector, (y ij ∈ , wt j ∈ and wt 1 + wt 2 + wt 3 + ... + wt n = 1). Of note, the criteria of the functions could be one of the two types: (i) benefit functions (maximization is better), or, (ii) cost functions (minimization is better).
Next, the normalized decision matrix was computed. In this step, different attribute dimensions were converted into the non-dimensional attributes that allowed the comparisons toward the criteria. Since the different criteria were generally evaluated in different units, the evaluation scores in the corresponding matrix Y were required to convert into a normalized scale. The normalized values could be performed through one of the popular standardized formulas. The most useful techniques of computing the normalized score n ij were as follows.
After that, the weighted normalized decision matrix v ij was determined through the following manner: v ij = wt j n ij , for i = 1, 2, 3, ..., m and j = 1, 2, 3, ..., n. where wt j denoted the corresponding weight of j-th criterion, and n ∑ j=1 wt j = 1.
Next, the positive ideal alternative and the negative ideal alternative were evaluated. The ideal positive solution was the solution which maximized the benefit portion of the criteria as well as minimized the cost portion of criteria. On the other hand, the negative ideal solution was the solution that maximized the cost section of the criteria along with minimizing the benefit section of the criteria. Positive ideal solution, AB + contained the following form: and negative ideal solution AB − contained as follows: where I and J were related to the benefit portion of the criteria and the cost portion of the criteria, respectively, i = 1, 2, 3, ..., m and j = 1, 2, 3, ..., n.
The separation measures were computed from the positive ideal solution and from the negative ideal solution. Here, in TOPSIS approach, distance metrics were used. The separation of every alternative obtained from the positive ideal solution was demonstrated as below: where i = 1, 2, 3, ..., m, and p ≥ 1. On the other hand, the separation of every alternative obtained from the negative ideal solution was described in the following: where i = 1, 2, 3, ..., m, and p ≥ 1. Now in the case of p = 2, the most useful traditional n-dimensional Euclidean metrics were produced as below: and where i = 1, 2, 3, ..., m.
After that, we computed the relative closeness (RC) of i-th alternative AB j with respect to AB + that was described as below: where 0 ≤ RC i ≤ 1, and i = 1, 2, 3, ..., m. Finally, we ranked a set of alternatives in descending order of the corresponding score of RC i . In the TOPSIS algorithm, we provided the scores of the four cluster validity index measures for the (udc-1) number case studies (cl = 2, 3, ..., udc), where Partition Coefficient, Modified Partition Coefficient, and Fuzzy Silhouette Index were three objectives to be maximized, and Partition Entropy was the objective to be minimized. Using this multi-objective optimization technique with providing equal weight to each objective, we determined the TOPSIS optimal score for each study. Higher score signified here better rank. The top ranked case study denoted here the final optimal solution. Hence, we only considered the final optimal solution among those case studies, and remaining solutions were discarded. The four cluster validity indices for the optimal solution that signified the quality of the underlying clustering, were represented in the final clustering result. From the optimal clustering, we also obtained the number of best optimal cluster size used for the clustering, and also cluster information of the cells that would be used for the next step, differentially expressed gene identification.

Identification of Differentially Expressed Genes through Statistical Test
In this step, we applied Limma statistical tool [40,54] that used Empirical Bayes test using the cluster information of the cells for each resultant cluster in compared to the rest of the clusters, and generated the p-value for each gene (feature). The p-values were then adjusted by the most stringent p-value correction strategy, Bonferroni method [55]. Those genes that contained adjusted p-value less than 0.05, were considered to be differentially expressed genes. The top genes were then searched in the literature to know whether there was any connection of them with any disease or biological function or not.

Gene Set Enrichment Analysis
We conducted KEGG pathway and Gene Ontology (GO) enrichment analysis for the differentially expressed genes for each resultant cluster vs. the rest clusters using DAVID online tool. After this analysis, we considered those pathways or GO-terms were significant if the Bonferroni corrected p-values were less than 0.05.

Identification of Novel Gene Markers
The top genes that were found to be associated with neither any disease or biological function in any literature, nor significantly enriched in any KEGG pathway or GO-term, were chosen as novel gene markers.

Source Dataset
In this study, we first proposed single-cell cluster identification method based on multi-objective optimization for scRNA-seq gene expression data. The R code is available at https://github.com/ sauravmtech2/MOO-FUZZ-for-scRNAseq-. Then, we evaluated the method by using the count data of "Whole Organoid Replicate 1" within a single-cell mRNA sequencing dataset (GEO ID: GSE62270). This dataset has initially contained 23,630 features (genes) and 288 cells in mice. Here, we provided our evaluate results.

Filtering Analysis
In the cell filtering, we selected a total of 206 cells in this step. In the gene filtering, the number of the selected filtered genes became 11,466. After computing the gene-wise variance, we chose top 3000 highest variant genes for the later step. We illustrated the count-depth relationship plot in Figure 2A. Next, we then normalized the data using the SCnorm normalization technique. In our study, the value of K in count-depth relationship plot was evaluated 4 (the four groups of slopes colored as sky, blue, magenta and red sequenced from lowest to highest expression median score) while all those 10 slope densities had the absolute slope mode < 0.1 (default cutoff). However, the normalization of the highly expressed genes would be good, whereas the lowly expressed (and moderately expressed) genes might be over-normalized and they generated negative slopes.

Fuzzy Clustering of Cells
For clustering the cells, we applied fuzzy c-means clustering for different initial number of clusters, cl = 2, 3, ..., 10 (nine case studies as udc = 10 here), individually, and produced the values of the four cluster validity indices: FSI, PE, PC, and MPC. See Table 1 for details.

Gene Set Enrichment Analysis
Furthermore, we conducted gene set enrichment analysis using online tool DAVID and obtained significant KEGG pathway and Gene Ontology (GO) terms (three terms: Biological Process: BP, Cellular Component: CC and Molecular Function: MF). After Bonferroni correction, we had 18 significant KEGG pathways, 37 significant GO:BP terms, 55 significant GO:CC terms and 23 significant GO:MF terms. The top three significant KEGG pathways were Ribosome (Bonferroni corrected p-value = 1.08 × 10 −59 and 94 participating genes), Spliceosome (Bonferroni corrected p-value = 1.02 × 10 −19 and 54 participating genes), Biosynthesis of antibiotics (Bonferroni corrected p-value = 1.61 × 10 −8 and 52 participating genes). Details were summarized in Table 4.

Discussion
In addition, we conducted literature search on the top ten gene markers. Among then ten genes, we obtained nine genes (Rps21, Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Rps3a1, Aldob and Rps17) that had an involvement with some diseases or biological functions, while only one gene (Khk) was found to be unknown (novel). For example, the gene Rps21 was associated with some related biological functions such as artificial nucleic acid molecules [56], exosome and human ribosome biogenesis [57], arterial vasculature [58], a KEGG pathway, Ribosome, and several Gene-Ontologies such as GO:BP: GO:0006412 translation, GO:CC: GO:0005622 intracellular, GO:0022627 cytosolic small ribosomal subunit, GO:MF: GO:0003735 structural constituent of ribosome, GO:0044822 poly(A) RNA binding. Hence, the status of the gene Rps21 is known. Similarly, the other eight genes, (Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Rps3a1, Aldob were connected with either some biological functions or KEGG pathway or Gene Ontology or both. For details, see Table 6. However, in case of the gene Khk, there was no connection found with any disease or biological function through literature search. Also, there is no single KEGG pathway or GO-term for the gene. Hence, Khk gene was a novel maker for the rare intestinal cell types for "Whole Organoid Replicate 1". Of note, our study was conducted on a publicly available dataset GEO ID: GSE62270 that was published by Grun et al. [23]. In the dataset, Grun et al. proposed RaceID method and performed an initial analysis with the published dataset. They had selected multiple cell types, such as organoid, Lgr5, Reg4, etc. and then performed analysis on them, whereas we chose only an organoid, "Whole Organoid Replicate 1" that composed of several cell types, and then conducted our analysis using our proposed framework. Our main objective behind our study is to propose a new method to identify cell clusters using multi-objective fuzzy clustering technique for single-cell data. To verify the performance of our method, we used this dataset. The method developed by Grun cell et al. was for rare cell type detection, whereas our method was developed only for cell-cluster identification through multi-objective optimization and fuzzy clustering. As the comparative study, our proposed method and RaceID both used some initial pre-filtering strategy such as cell filtering and gene (feature) filtering. During pre-filtering, we applied a recent robust normalization technique, SCnorm made for single-cell sequencing data, but in RaceID, the total transcript count across each cell was normalized to the median transcript number toward cells. Thereafter, RaceID used k-means clustering algorithm while the number of cluster was determined from "gap statistics" [23], whereas our proposed method used the stronger clustering algorithm, fuzzy c-means in which the number of cluster was determined by a multi-objective optimization technique, TOPSIS on the four cluster validity index measures, PE, PC, MPC, and FSI of which first one was minimization index and remaining three were maximization indices. Since our method focused on optimizing the cell clusters depending upon the quality of clusters, our technique of identifying cell clusters was obviously stronger than the cell-cluster identification approach used in RaceID. On the other hand, RaceID can detect outlier cells (rare cells) from the resultant clusters, whereas we did not do anything in our study to identify outliers (rare cells). In future, we will extend our work to detect rare cells for single-cell RNA-seq data. In case of our analysis, the number of cells was 206 after initial filtering. Here through integrated use of fuzzy clustering and TOPSIS multi-objective optimization technique, we obtained optimal cluster result (TOPSIS optimal score = 0.8584) that comprised of two optimal cell clusters, one containing 115 cells and another with 91 cells. The evaluated scores of the four cluster validity indices, FSI, PE, PC, and MPC for the optimized fuzzy clustering were 0.482, 0.578, 0.607 and 0.215, respectively. In case of the analysis by Grun et al, six cell clusters had been identified through k-means clustering [23]. Table 6. Evaluation of the top ten gene markers through literature evidence, KEGG pathway and Gene Ontology analyses.

--Novel
In technical point of view, our algorithm used fuzzy c-means clustering that is better and more flexible in statistical point of view rather than traditional k-means clustering. In addition, we checked some network evaluating factors such as average scaled connectivity, average maximum adjacency ratio (MAR), density and average Pearson's correlation. To do so, for our proposed method, we first picked up the set of cells for each evolved cluster, individually and then computed average Pearson's correlation in cell-pair wise manner for each cluster. We chose the cluster that had highest average correlation. Using the cluster, we determined the above four network evaluating measures, average scaled connectivity = 0.752, average maximum adjacency ratio (MAR) = 0.404, density = 0.295 and average Pearson's correlation = 0.786. Similarly, for comparative study, we first applied k-means clustering on the data providing the same number of cluster size (=2) as input that was obtained in our proposed method. We then selected the cluster that had highest average correlation, and computed the above four measures, average scaled connectivity = 0.746, average maximum adjacency ratio (MAR) = 0.402, density = 0.294 and average Pearson's correlation = 0.784. In all these cases, our method provided better scores than those for traditional k-means clustering. Hence, our proposed method provided stronger clustering performance than the existing one.
In addition, we conducted extensive analysis on the two resultant clusters obtained by our method and compared with the 16 clusters (comprising of different rare cell types: goblet, tuft, Paneth and enteroendocrine cells) generated from the random organoid cells of Grun et al. [23] (Supplementary Table S1 of Grun et al. named as "nature14966-s1.xlsx"). To do so, we first chosen 1240 differentially expressed genes obtained by Limma using the class-label information (cluster #1 vs. cluster #2) of the underlying cells. Thereafter, we checked the fold change of each gene using cell-cluster class labels (each cluster vs. rest), and set the fold change cut off 2 to obtain up-regulated gene markers for the cluster. For cluster #1, we identified 344 up-regulated markers, while for cluster #2, the number of up-regulated markers were 653. After that, we performed intersection of these cluster-markers with the cluster-markers of Grun et al. for the organoid. In the case of our cluster #1 vs. each of the 16 clusters of Grun et al., the number of overlapped gene markers were 13,61,45,9,11,4,67,18,49,25,79,29,18,52,63, and 113, respectively. Similarly, for the case of our cluster #2 vs. each of the 16 clusters of Grun et al., the number of overlapped gene markers were 5, 81, 12, 0, 1, 0, 107, 8,35,42,95,38,20,13,36, and 66, respectively. For detailed information, three Additional files, Tables S5-S7 were provided.
Of note, Limma was generally made for microarray data and later it was extended the model to use it for the RNA-seq data. In details, we first used different pre-filtering techniques and then normalization technique (SCnorm) prior to use Limma for the specific reason. There is some literature where the comparative study among different tools including Limma for scRNA-seq data had been demonstrated [63]. However, currently there are many new tools developed for only scRNA-seq data whose performance is found to be better than Limma to some extent while using scRNA-seq data [63].

Conclusions
Analysis on the single-cell messenger RNA sequencing data is always a challenging task. In this article, we conducted a comprehensive analysis of detecting the cell clusters and potential gene markers, respectively through a multi-objective optimization-based fuzzy clustering framework for the scRNA-seq gene expression data. In this regard, we initially performed cell filtering as well as gene filtering, and then applied SCnorm normalization. Next, we conducted nine case studies through choosing various cluster sizes (=2, 3, ..., 10), and performed fuzzy c-means clustering algorithm individually. From each case study, we measured the scores of the four cluster validity index measures, Partition Entropy, Partition Coefficient, Modified Partition Coefficient, and Fuzzy Silhouette Index. Meanwhile, we fixed these four measures as different objective functions in which first measure was treated as minimization objective, while the rest of these three measures were considered to be maximization objectives, and then applied TOPSIS multi-objective decision-making technique for determining the optimal (best) solution. The case study which contained highest TOPSIS optimal score (top optimal rank), was elected as the best optimal solution. The cluster information of the cells had been detected. After that, we used Limma statistical tool and identified the differentially expressed genes for each resultant cluster with compared to the rest. We applied the count data of "Whole Organoid Replicate 1" scRNA-seq dataset for the rare intestinal cell type of mus musculus. Using the proposed method, we produced the best optimal solution with TOPSIS optimal score 0.8584 and corresponding optimal clustering result that contained two clusters of which one consisted of 115 cells while the other cluster consisted of 91 cells. The scores of these four cluster validity indices, FSI, PE, PC, and MPC for the optimal fuzzy clustering technique were 0.4818, 0.5784, 0.6073 and 0.2145, respectively. Next, by using Limma, we identified 1240 differentially expressed genes (cluster 1 vs. cluster 2). The top ten gene markers were Rps21, Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Khk, Rps3a1, Aldob and Rps17, respectively among which Khk is novel marker. Gene set enrichment analyses (KEGG pathway and Gene Ontology analyses) had been performed through DAVID online database. Finally, it can say that our proposed framework generated the multi-objective optimized clusters and potential gene markers, respectively that are highly useful for any kind of scRNA-seq data. Of note, the major objective of our proposed method could detect Pareto-optimal cell clusters using multi-objective fuzzy clustering technique for scRNA-seq data. In our method, we had not included any rare cell (outlier) detection strategy. In future work, we will extend our current work to detect rare cells (outliers).