Cell-type composition analysis of scRNA-seq data with deep convolution neural network

Background: With the rapid development of single-cell RNA sequencing (scRNA-seq), more large-scale single-cell sequencing data has been generated. Due to the continuous increase of single-cell sequencing data, the analysis of cell-type composition from single-cell transcriptomics has also to face huge challenges. Since the emergence of scRNA-seq technology, the size of sequencing datasets has grown more than 1 million times in just over a decade. Meanwhile, as more gene markers are discovered, the data dimension of single-cell sequencing becomes higher. All of these put forward more stringent requirements on data dimensionality reduction and clustering algorithms. Under the constraints of practical factors such as occurrence of noise and dropouts and the limitation of overhead, it is also required an eﬀective and eﬃcient method that can obtain accurate analysis results in a very short time, and has a competitive algorithm stability. Results: We present scCAE, an eﬀective and eﬃcient method based on convolution autoencoder that can accurately and rapidly analyze cell-type composition from single-cell transcriptomics datasets. Our method achieved the best results in the data sets that simulate the cell diﬀerentiation process among existing methods, which achieved the ARI of 69.64% and 68.83% at 10 and 25 clusters tasks. And, in the case of diﬀerent dropouts, our method also works well. When the sparsity level of data metric is 71%, scCAE can achieved the ARI of 45.29%, which is the highest of the existing methods. In terms of algorithm overhead, our method has also achieved good results by comparing with several existing methods. It takes less time than most methods and takes up much less memory than other algorithms based neural networks. Conclusions: Our method, scCAE, has more accurate and reasonable results in the analysis of cell-types composition. And, because of the design of imputer, it can deal with a large number of dropouts in the data matrix. Because of the structure of convolution network, scCAE has less time and space overhead than other deep-learning-based methods. Thus, we demonstrate that scCAE is a competitive method for analysis of cell-type composition from scRNA-seq data. We expect that our study can be a stepping stone for further prosperity of single-cell transcriptomics analysis.


Background
Single-cell RNA sequencing (scRNA-seq) is a cell transcriptomics research method based on the second generation gene sequencing technology. This technology allows high-throughput expression detection of genes at the single cell level, in-depth analysis of complex cell populations, and characterization of single-cell expression profiles to avoid heterogeneity of individual cell biological information caused by cell homogeneity is covered [1]. The advancement of microfluidic chips makes it possible to isolate a large number of cells.
With the improvement of RNA isolation and amplification methods, it is possible to analyze the transcriptional composition of individual cells using scRNA-seq technology [2][3][4][5].
The first scRNA-seq was published in 2009, analyzing only eight cells [6]. Seven years later, 10X Genomics has obtained a data set of more than 1.3 million cells. The appearance of such a large amount of data makes it possible to discover and analyze cell types. Large dataset ensure that the analysis has high accuracy and improves the ability to detect rare cell types. However, it also brings problems such as high dimensions, high dropouts, and excessive computational overhead.
Linear transformations, e.g., principal component analysis (PCA) [7][8][9][10], are unable to accurately capture the relationship between cells because of their high dropouts and noise. Non-linear approaches are more flexible as they provide results that are more intuitive and easier to interpret through visual observation. Currently, the most commonly used non-linear dimensionality reduction methods are t-SNE [11] and UMAP [12]. t-SNE calculates the transformation from high-dimensional space (Gaussian distribution) to twodimensional space (t distribution), and minimizes the difference between all points in these two spatial distributions. However, t-SNE still has problems with long tails, tending to retain local features and slow calculation speed. UMAP uses manifold and projection calculations to achieve the purpose of dimensionality reduction of high-dimensional data.
The methods for dimensionality reduction and clustering of single-cell transcriptomics data mainly include graph-based PCA analysis methods represented by Scanpy [10], Seurate [13], PhenoGraph [14], etc.; Data-driven K-means analysis methods represented by SIMLR [15]; Analysis methods based on neural networks represented by SAUCIE [16], DCA [17], AE [18][19][20][21], scScope [22], etc.; Other mathematical methods represented by ZINB-WaVE [23], MAGIC [24], etc. In practice, PCA is highly scalable and suitable for many types of data analysis. However, with smaller dataset, the accuracy will decrease significantly. SIMLR can effectively improve the sensitivity of noisy data sets through parallel training based on distance metrics, but because the algorithm can adjust the distance metrics to make cells suitable for clusters, it may artificially improve the calculation results and affect the real effect of the algorithm [25]. ZINB-WaVE provides a more stable and accurate low-dimensional representation of data without the need for preliminary standardization steps. However, this method has a huge flaw that cannot guarantee that the extracted low-dimensional signals are biologically relevant. Analysis methods based on neural networks developed in recent years due to the rise of deep learning, and have achieved good results. For example, scScope uses a recurrent network layer to iteratively interpolate the dropouts of the input scRNA-seq data, which can more efficiently and accurately perform dimensionality reduction and clustering on the data, especially for large dataset.
As a hot topic in machine learning field in recent years, deep learning is very suitable for life sciences and medical fields. Convolutional neural networks (CNNs) improve the training performance by using spatial relationships to reduce parameters and are the first true multi-layer structure learning algorithms [26]. On this basis, fully convolutional neural networks (FCNs) [27] are proposed, and the further developed U-net is widely used in the field of medical image segmentation [28]. In the field of biological information, deep learning algorithms are also gradually being widely used. Convolutional neural networks can predict molecular properties, thereby further deducing the interaction between drug molecules [29]. Through the use of reinforcement learning and generative adversarial networks, it is possible to find drug structures that meet the conditions according to specified goals, thereby guiding the development of new drugs [30]. In genetic analysis, by converting genetic data into images and using neural network-based image recognition models, the differences of the genome are found [31]. To summarize, the adoption of deep learning methods has brought more possibilities to the analysis of biological information.
Based on the idea of deep learning, we further proposed a convolution-neural-network-based method, named as scCAE. It uses a symmetrical convolutional layer structure at both ends, and an autoencoding structure in the middle. We define this structure as a convolutional autoencoder. In addition, our network also uses a skip layer similar to Unet, and an imputer that handles the dropouts present in the data. In experiments, we found that in the case of scCAE with imputer, compared with the structure without recurrent imputer, the analysis of cell-type composition has been significantly improved, especially in the case of more dropouts. In the end, a standard scCAE method process also needs to consider the quality control, batch correction and louvain clustering. The main contributions of this work are listed as follows. 1 Increased accuracy and complicated inference. Accuracy is the primary criterion for evaluating such methods. In terms of the effectiveness of existing algorithms, scCAE has made a significant improvement in the accuracy of celltype analysis, especially in some more complicated scRNA-seq data, where there are specialized groups of cells differentiation. This is mainly reflected in the higher Adjusted Rand Index(ARI), Normalized Mutual Information(NMI) and Silhouette Width(SW) compared to other methods. In addition, our method possesses relatively stable accuracy in experiments and is not easily affected by initial randomness. 2 Better handling of a large number of dropouts.
Single-cell RNA sequencing usually has more dropouts and noise than groups of cells sequencing. Generally, in cell expression matrices, matrix sparsity greater than 50% is a common phenomenon. The method in this paper effectively eliminates these interferences to the results by using imputer and convolutional layers. Even when the matrix sparsity exceeds 70%, scCAE still works well. 3 Large scale dataset and less overhead. Faced with the continuous accumulation and development of single-cell RNA sequencing, data sets are becoming larger, and the number and type of cells that need to be analyzed at one time is also increasing. scCAE uses a convolution structure and shares convolution kernels, which greatly reduces the time and space overhead.It supports GPU parallel computing, and takes up less memory space, which means that when handle with big data, our method will be more effective and efficient. The rest of this paper is organized as follows. Section 2 mainly introduces the specific structure of sc-CAE and the simulation data to be prepared for subsequent experiments. Section 3 compares scCAE and other methods through experiments on the dataset. The dataset are mainly divided into two types, one is generated by simulation in this paper, and the other is the existing real-world data from Internet and our laboratory. Section 4 concludes the whole paper.

scCAE model
The main structure of scCAE network adopts a structure similar to U-net as shown in Fig. 1. Under this structure, the network can inherit the characteristics of U-net suitable for large scale data processing, which is reflected in high-dimensional processing capabilities on single-cell RNA sequencing data. This is mainly due to the application of convolutional layer and pooling layer. The convolutional layer helps to identify more local features from the dataset, and the pooling layer helps the network reduce the number of output values by reducing the size of the input.
Due to the particularity of the single-cell sequencing data structure, which is a two-dimensional matrix, the feature information formed by the expression level between different cell individuals is meaningless. Therefore, when data are input, they cannot be input as the form of a two-dimensional matrix. It should be split from the cell barcode to obtain the expression vector of a single cell. In addition, in order to retain as much of the original expression level information as possible, the input of scCAE uses the size of [1,2048]. Since the input data is a one-dimensional vector, the convolutional layer used by scCAE is a one-dimensional convolution instead of the two-dimensional convolution commonly used in image processing. The pooling layer is also a one-dimensional pooling.
Because the main task of the network is dimensionality reduction and cell-type composition analysis rather than image segmentation, the output of U-net, heatmap structure, will not be used in the middle hidden layer. We added the structure of the neural network autoencoder. This structure can compress the input data with information obtained after convolution. As convolution and pooling layer has effectively extracted the features of the data, the coding layer requires ot too many neural nodes, which greatly reduces the overhead. For the upsampling part, scCAE does not adopt the U-net's channel fusion strategy. This is because in the dimensionality reduction task, if use this fusion strategy, the output of the upsampling will rely excessively on the output of the previous convolutional layer rather than the autoencoder. This situation is not conducive to training. For this reason, the upsampling part of scCAE is designed without channel fusion strategy, but the idea of skip layer is retained. We adopt a strategy where the weight sum is 1, and the weighted addition of the two inputs, which not only ensures the efficiency of training, but also makes the network more symmetrical.
Due to the dropouts in the real-world dataset, on this basis, the imputer is added, which has ability to impute the adverse impact of dropouts and noise in sequencing data. In scCAE, the imputer adopts a threelayer neural network structure as [8×128, 128, 4×128], which is connected to the output of the convolutional layer and the input of the autoencoder layer by means of a recurrent neural network.
Before the data are input, they also need to be preprocessed such as quality control and elimination of batch effects. After obtaining the dimensionality reduction results through the scCAE network, the clustering method needs to be considered. The output of scCAE is a one-dimensional feature vector, which can be easily integrated with any clustering method. For general-scale data, we use the Louvain method [32] for the final cluster analysis. Louvain Method is superior to other methods in terms of calculation time and has higher accuracy. In the large-scale analysis task, we use the experience of scScope [22] method. Due to the extremely high calculation cost and memory requirements in the graph construction, PhenoGraph [14] can not process millions of cells. By using the combined processing of K-means and PhenoGraph, this problem can be solved well. We first divide the data into several clusters with the idea of divide and conquer, first use kmeans for initial clustering, and then use PhenoGraph for analysis.

Training function
The parameters in the scCAE network are mainly obtained through end-to-end training. Due to the variability of parameters, scCAE is insensitive to whether the input data has been normalized and other steps.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Nonetheless, we recommend normalizing the data in practice, because it may help the convergence rate of training. Considering that there are batch effects in general experimental data, scCAE provides a batch correction layer. This is an optional process and is given by where f Batch is batch correction layer, andx c ∈ R N denotes the original input single-cell expression vector. lr is activation function, leaky Relu. B is batch correction matrix which needs to be updated by learning and u c is binary experimental batch effects indicator vector.
When downsampling through the convolution and pooling layer, the downsampling layer f Conv is given by where P M ax denotes max pooling layer, and W Conv is learnable convolution kernel. b 1 ∈ R M is bias unit. After this layer, the single-cell expression profile is compressed from N-dimension into M-dimension, M < N . Note that the convolution and pooling operations here are one-dimensional.
Then, the data will be compressed again with encoder layer f En by The decoder layer f De decompresses the latent representation h c2 and restores it to the M-dimensional space, which is consistent with the dimension obtained by the downsampling, and is given by Then, the decoded latent representation will be decompressed again with upsampling layer f Deconv , which is given by where W T Deconv is learnable transposed convolution kernel in deconvolution operation. a and b is constant For Imputer, we propose two ideas, which are two different functions. The first idea is to directly connect Imputer as a neural network. Because convolution can effectively reduce the impact of dropouts and compress the expression vector, Imputer's main role at this time is to deepen the network and make the network have a stronger inference ability, and is given by where W I ∈ R M ×N is learnable weights and b I ∈ R M is bias. With this idea, v c = u c directly recurs to the front network.
The second idea emphasizes that the imputer plays the role of self-correction in it, and adopts a similar step function f Step for u c to obtain v c . Then, the imputer performs by After obtaining the vector v c from imputer, the new latent representation h c2 will be updated as h c1 = h c1 + v c . Then resent h c1 into encoeder, decoder and upsampling layers and learn to get the latent expression vector. Considering the recurrent structure of sc-CAE, the above function can be updated as , v 0 c = 0, where t = 1 · · · T , when we designed scCAE, the value of T was 2. If T = 1, there's no imputer structure in scCAE, we called scCAE-nonImputer. This paper uses MSEloss (L2 loss) as the loss function, which can be described as where i = 1 · · · N denotes each gene expression in single-cell profile, and we set N = 2048. For the choice of optimizer, This paper chose Adam optimizer, and learning rate is 1e-3, betas = (0.5, 0.999).

Data generation
In order to evaluate the scCAE method, the datasets required for the experiments in this paper are three types, one is the simulation datasets generated by the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Splatter [33] package in R language; another is the dataset published by 10x Genomics; and the other is the dataset of circulating tumor cells (CTCs) obtained by scRNA-seq in our laboratory. Splatter is an R package for the simple simulation of single-cell RNA sequencing data. Since the existing sscRNA-seq datasets generally has a large number of cells, and it is impossible to visually observe the cells for labeling, it is difficult to use the existing real datasets for the comprehensive evaluation of our method. Therefore, in this paper, most of the algorithm evaluation datasets need to be generated by splatter and labeled with classification.
As shown in Fig. 2, Fig. 2(a) is an example of simple clustering datasets. The cells are divided into two types, and the boundaries on PC1 and PC2 planes are clear, and the distribution of data points conforms to a uniform normal distribution. Fig. 2(b) is an example of datasets that simulates the differentiation process. Cells are divided into ten categories and the direction of differentiation on the PC1 and PC2 planes is clear, but the boundaries are not clear. Fig. 2(c) is an example of datasets with dropouts for expression. The boundaries between PC1 and PC2 planes are not obvious, and the distribution of various types of data points is uneven and not concentrated.
For simple clustering datasets, the dataset is generally interested in studying mixed cell populations, and wants to understand the existing cell types or the differences between them. we set the parameter nGenes = 2048, method = "groups", the number of types is set to 5/10/20/30/50, the number of cells in each type can be generated randomly.
For cell clustering datasets (complex clustering datasets) that simulate the differentiation process, one type of cell is generally being transformed into another. Splatter approximates this process by simulating a series of steps between two groups and randomly assigning each cell to a step. we set the parameter nGenes = 2048, method = "paths", de.prob = 0.2, the number of types is set to 10/25, the number of cells in each type can be generated randomly.
For datasets with dropouts in cellular RNA expression, generally part of the gene expression in the cell is not detected, which is reflected in the sparseness of the expression matrix. we set the parameters nGenes = 2048, dropout.type = "experiment", dropout.shape = -1, dropout.mid= 0/1/2/3, the number of types is set to 10, the number of cells in each type can be generated randomly.
In addition, we also used two kinds of existing datasets to evaluate scCAE. One is the dataset from 10x Genomics (http://10xgenomics.com), a dataset containing 33K PBMCs from a Healthy Donor. We use it to evaluate the time cost of our method. And the another one is the scRNA-seq data processed precise isolation of circulating tumor cells (CTCs) from blood samples in our lab, which contains a total of 3243 cells were sequenced with a median of 3226 genes per cell. And there were approximately 17 CTCs in total. By analyzing these existing datasets, we look forward to better assessing the practicality of scCAE.

Results
Before analyzing the cell-types composition, it is necessary to further get the appropriate parameters for sc-CAE and scCAE-nonImputer in the upsampling layer. We adjusted the weights and conducted experiments on the cell clustering datasets (complex clustering datasets) that simulate the differentiation process, as shown in Fig. 3. Under the constraint of a + b = 1, weight a was continuously increased from 0.5 to 0.99. A higher value of a means that the input of the upsampling is more affected by the output of the downsampling of the convolution network, and the less it is affected by the output of the autoencoder. It can be found through experiments that in the scCAE-nonImputer model, when a is in the range of [0.5,0.8], ARI increases with the increase of a. When a is in the range of [0.8, 0.99], ARI drops sharply with the increase of a, indicating that the dimensionality reduction capability of the model has been sharply reduced by the parameters at this time, and depends more on the direct connection between downsampling and upsampling layer, rather than the autoencoder structure. When the scCAE model was used for the experiments, a double peak appeared, and the local optimum of ARI was obtained when a = 0.6 and a = 0.8, respectively, and the global optimal ARI was obtained when a = 0.8. It can be concluded that when a = 0.8 and b = 0.2, that is, X 3 (s) = 0.8X 1 (s) + 0.2X 2 (s), scCAE and scCAE-nonImputer performed best. Therefore, we suggest setting the parameters a = 0.8 and b = 0.2 in the scCAE and scCAE-nonImputer models. Subsequent experiments in this paper will also use these parameters.
After the parameters are determined, the scCAE, scCAE-nonImputer and other existing methods are tested and compared in the simple clustering datasets, thereby obtaining box-plot Figure 4 based on the ARI results.
In order to ensure the reliability of the experimental results, we used several methods to perform cluster analysis on some simulation simple clustering datasets generated when the number of clusters was 5, 10, 20, 30, and 50, respectively. According to Fig. 4, on the simple datasets, when the actual number of   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 clusters is 5, each algorithm achieved satisfactory results. Among them, the results of scCAE, scCAE-nonImputer, PCA and scScope are not much different, but AE has achieved a poor result. When the number of clusters is 10, the above methods perform well, and the effect of AE is even better than that when the number is 5. When the number of clusters is 20, the results of the scCAE-nonImputer method became worse, but the scCAE performs well. When the number of clusters is 30, the result of the DCA dropped sharply, and each method begins to show a significant decrease in stability with several experiments. When the number of clusters is 50, the effect of scCAE deteriorates sharply. Similarly, the PCA and scScope algorithms also perform poorly, but the trend is slightly less obvious than scCAE. It can be concluded that on the simple datasets, scCAE-nonImputer can achieve satisfactory results when the number of clusters is 5 to 20, but when the number of clusters continues to increase, scCAE-nonImputer is no longer applicable, which may be because its network depth is shallow. For scCAE, this problem has been significantly improved, probably because of the use of the imputer, which adds the RNN structure to the original network, thereby having a deeper network hidden layer.
A series of evaluation statistics such as ARI and NMI obtained from the above experiments are tabulated, as shown in Table 1. We can infer from the table that when the number of clusters is 5 to 20, the evaluation results of the methods are similar. When the number is 30, the ARI and NMI of scCAE-nonImputer are significantly lower than other algorithms, SW approaches 0, and the DBI is higher, which means that the effect of scCAE-nonImputer is not ideal at this time. When the number is 50, the ARI of scCAE is only 1% different from that of PCA and scScope, and the difference of NMI is about 3%. The DBI value is even better than PCA. It can be seen from the analysis of simple datasets experiments by ARI, NMI, SW, and DBI that scCAE-nonImputer performs well when the number of clusters is small. The scCAE with imputer performs better, similar to the effect of the scScope algorithm, and still has a certain gap with the PCA. But, as the cluster boundaries of simple datasets are obvious, the requirements for inference are simple, and the PCA has an advantage, so there is no need to negate the effects of the scCAE and scCAE-nonImputer.
However, the simple clustering datasets are still far away from the real datasets, which does not simulate the relationship between cell sets or subsets, and does not simulate the trajectory of cell differentiation. The boundaries are too clear, and the distribution of cells is not sufficiently realistic. Hence it is necessary to use complex clustering datasets that simulate the differentiation process for experimental evaluation. In this paper, experiments were conducted on the number of clusters of 10 and 25 respectively, and the results of multiple experiments were recorded for each cluster number. Fig. 5 presented the ARI obtained from each group of experiments.
As shown in the figure, when the number of clusters is 10, scCAE performs best, and its optimal ARI can exceed 0.7, and the overall performance is around 0.7. The performance is followed by scScope, whose overall ARI performance is slightly below 0.7. Although the ARI of AE can exceed 0.7, but because the results are too scattered in the figure, it is very unstable. The ARI of scCAE-nonImputer and PCA are generally around 0.65, and the experimental results are very stable. The performance of SAUCIE and DCA is far from impressive. When the number of clusters is 25, which is close to the usual large scRNA-seq datasets, scCAE still performs best. The performance of scCAE-nonImputer is similar to that when the number of clusters is 10. But, the ARI of scScope has dropped significantly, the clustering results are not good and extremely unstable. PCA decreased slightly, and its stability slightly weakened.
The evaluation statistics of ARI, NMI, SW, and DBI(short for Davies Bouldin Index) obtained from the experiment results are summarized in Table 2. When the number of clusters is 10, scCAE's ARI, NMI, and SW all show that the algorithm has the best effect of analysis. The scScope performed slightly worse than scCAE-nonImputer and PCA is similar to scCAE-nonImputer by the evaluations. When the number of clusters is 25, the dataset are very complex and there are many types of cells, which further tests these methods. At this time, our method, scCAE, still performed the best, whose ARI is 7% points higher than sc-Scope and 6% points higher than PCA. Moreover, the SW index is even better, which reaches 0.2 (0.2665) among these methods. Similarly, scCAE-nonImputer has achieved very good results. In summary, scCAE has achieved outstanding results in the complex clustering task of simulated single-cell RNA sequencing datasets, which is superior to existing methods.
Single-cell RNA sequencing data usually has dropouts and noise. In order to evaluate the missing value adaptation and compensation ability of our method, we simulated a variety of expression matrices by Splatter with different sparsity, and experimented with different methods on these datasets, as shown in Fig. 6(a). When the sparseness of the input expression matrix is 23%, the ARI of most methods can be above 0.9, and the ARI of scCAE and scCAE-nonImputer can even reach one. When the sparsity of the expression matrix is 34%, the effects of DCA, AE, PCA are affected by the increase of sparsity, and the stability has a greater impact. But scCAE and scCAE-nonImputer are hardly affected by this change in sparsity. When the sparsity is 45%, the effect of scCAE-nonImputer is significantly reduced but scCAE performs well. We believe this is due to the lack of imputer for scCAE-nonImputer. When the sparseness of the matrix continues to increase, which means the dropouts become more, it can be inferred that the effect of scCAE still performs best in several methods. From the perspective of ARI, scScope and scCAE are similar, but other methods have been greatly affected. It can be concluded that when there are few missing data in the scRNA-seq data, both scCAE and existing methods can be well analyzed. However, when there are a lot of missing values in the data, the effect of scCAE is far superior to other methods, and slightly better than scScope.
In order to intuitively see the clustering results in the case of high sparsity, this paper visualized the results of some methods when the sparsity is 58%, 72% and the number of clusters is 10, as shown in Fig. 6(b). When the sparsity of the expression matrix is 58%, only the number of clusters inferred by sc-CAE is 10 types, which is consistent with groundtruth, while PCA, AE, and scScope are 9, 13, and 11 types, respectively. It can be found that the scCAE performance is well, although a small number of cells will be misclassified. When the sparsity of the expression matrix is 72%, the number of clusters inferred by scCAE is 11 that is close to groundtruth. From the perspective of visualization, the classification of groundtruth is similar to vertical layering, and the result of our method is closest to groundtruth. While the clustering results of the AE and scScope algorithms have been unable to see the basic outline of clustering through visualization. From the visualization results, it can be concluded that scCAE has outstanding performance to analyze cell-type composition from single-cell transcriptomics datasets with dropouts.
This paper compared the time of analyzing multiple size single-cell RNA sequencing datasets by using sc-CAE and some existing algorithms, as shown in Tab. 4. The experiment is performed by extracting 10K, 50K, 100K, 500K, and 1M cells from the 10X Genomics dataset. We run these methods on the server with Intel Xeon E5 CPU and GTX 1080ti GPU. It can be concluded from the table that the running speed of scCAE is still very fast, slightly slower than PCA, faster than scScope and other methods. In addition, we analyzed the memory cost of four neural networks algorithms, scCAE, scCAE, AE, and scScope, as shown in Tab. 4. By inputting the data of [1,2048] to each network, we can obtain the number of parameters used and the size of network. scCAE-nonImputer has the fewest parameters, only 677,824. Since scCAE added Imputer, the parameters reached 2,110,336. Even so, it is still less than half of the number of scScope network parameters, which is 5,353,664. And, in terms of memory size, scCAE takes up very little, about half of scScope. We believe that this increase in speed and decrease in memory is due to the structure of the convolution layer in scCAE.
To verify the effect of scCAE method on the accuracy of clustering in actual datasets, we applied sc-CAE to analyze cell-type composition from single-cell transcriptomics datasets with CTCs. This dataset was collected from our laboratory. We have developed a microfluidic device with an innovative hydrodynamic structure to separate circulating tumor cells (CTCs) from cancer patient blood samples. The device demonstrated excellent cell enrichment performance, and the isolated cell solution contained high purity of CTCs and some white blood cells, while maintaining very high viability (>98%), highly beneficial for single-cell sequencing. The isolated cell solution was split into two tubes, one used for immunostaining, and another in which single-cell RNA sequencing was conducted. The immunostaining results confirmed that 17 CTCs existed in approximately 3200 cells. (Currently, the method and dataset are being submitted). Even so, the content of CTCs is still not easily analyzed in cell samples. We have adopted many other solutions, which hardly analyze it well. We hope that scCAE can distinguish cell-type of CTCs well, and maintain the integrity and accuracy of CTCs while maintaining the integrity and rationality of other cell populations. The visualization results are shown in the Fig. 7(a), which proves that our method accurately identifies CTCs (Class 12 in red box). In Class 12, the expression of KRT8, KRT18, PPBP and other gene markers in the cells were significantly up-regulated, as shown in the Fig. 7(b). Epithelial markers such as Keratin 8 (KRT8) and Keratin 18 (KRT18) are commonly used markers to identify cancer cells. Taking into account the number of cells, we firmly believe that such cells are CTCs. In addition, according to the Fig. 7(b), scCAE reasonably distinguishes other cell types. The good performance of scCAE on the CTCs dataset once again proved that the method has excellent adaptability, and is particularly suitable for discovering a small number of cell sets or subsets based on different expressions between cells.