VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder

Single cell RNA sequencing (scRNA-seq) is a powerful technique to analyze the transcriptomic heterogeneities in single cell level. It is an important step for studying cell sub-populations and lineages based on scRNA-seq data by finding an effective low-dimensional representation and visualization of the original data. The scRNA-seq data are much noiser than traditional bulk RNA-Seq: in the single cell level, the transcriptional fluctuations are much larger than the average of a cell population and the low amount of RNA transcripts will increase the rate of technical dropout events. In this study, we proposed VASC (deep Variational Autoencoder for scRNA-seq data), a deep multi-layer generative model, for the unsupervised dimension reduction and visualization of scRNA-seq data. It can explicitly model the dropout events and find the nonlinear hierarchical feature representations of the original data. Tested on twenty datasets, VASC shows superior performances in most cases and broader dataset compatibility compared with four state-of-the-art dimension reduction methods. Then, for a case study of pre-implantation embryos, VASC successfully re-establishes the cell dynamics and identifies several candidate marker genes associated with the early embryo development.

of cells were formed, but at least four clusters were composed of more than one cell types.
(The points of different cell types were stacked together for possible misleading visualization) This result was undesirable because the cells from different types should not compactly cluster together. Instead, VASC formed eight compact clusters of cells from the same cell type.
The other three cell types, GW16, GW21 and GW21+3 (originally sampled from the germinal zone of human cortex at gestational week 16, 21 and cultured for another three weeks respectively), were distributed more decentralized than the others. These cells, along with NPCs (neural progenitor cells), are all neural cells, and were more closely represented by VASC, which should be more reasonable. Then, the performances of these methods were quantitatively assessed by comparing the cell sub-populations in the reduced subspaces (the sub-populations were identified by k-means clustering [24]) with the true cell type labels annotated by the original references.
Four different indexes -NMI (normalized mutual information score) [25], ARI (adjusted rand index) [26], HOM (homogeneity) and COM (completeness) [27] were used to quantitatively assess the clustering performances. (Please see the Methods section for details) These comparisons showed that VASC outperformed the compared methods in terms of NMIs and ARIs in most cases (best performances on 15 and 17 datasets, respectively) ( Figure 3). And, VASC always ranked in the top two methods on all the tested datasets, which suggested that it has broad compatibility for various kinds of scRNA-seq datasets. Please see the detailed results in the supplementary materials.

Analysis of the model stability and parameter setting
In this section, we analyzed the stability and several parameter settings of VASC. Firstly, we analyzed the model fitting processes of VASC on two datasets, the Pollen and Biase datasets (with 301 and 56 cells, respectively). The loss function decreased sharply during the first a few epochs, and simultaneously, NMI/ARI values increased sharply (Figures 4a & 4b). After the first 100 epochs, the loss curves quickly converged to a lower limit and the loss fluctuations of the dataset with more samples (Pollen) were smaller than the one with fewer samples (Biase). Based on these observations, the stopping criterion for VASC was set as that the loss function had no obvious decrease within 50 epochs (see details in Methods).
Next, the stability of VASC was analyzed. Due to the randomness of the stochastic gradient descent method, the model initialization and the of k-means clustering, different runs generated slight different results. The four datasets with the fewest sample sizes, Biase (56 samples), Goolam (124), Pollen (301) and Yan (90) were used to test the stability of VASC by 20 repeated runs. As expected, the two dataset sets with more samples (Goolam & Pollen) showed much higher consistent results (Figure 4c). The following down-sampling analysis of the Pollen dataset also suggested the same trend ( Figure 4d). The NMIs of the Biase dataset were almost distributed between two values. Considering the relatively small number of cells (only 56 samples), this distribution may be caused by the turnover of just one or two cells at the boundary. The similar result was also observed for the Yan dataset. However, the Goolam and Pollen datasets with more cells did not show this pattern. Then, the down-sampling experiment based on the Pollen dataset was implemented to further test the method's stability.
The dataset was bootstrapped with 10%, 30%, 50%, 70%, 90% and 100% cells also with 20 repeated runs. It is observed that the NMIs and ARIs dropped if the number of samples were too small and their variations increased accordingly. But, VASC get comparable results if the percentage of sampled cells was above 50%.
Then, the effectiveness of the ZI layer, which modeled the dropout event, was assessed.
Results showed that it improved both the stability and the average performances ( Figure 4c).
The data projection to a 2D subspace is suitable for the visualization, but the subspace with higher dimension may explain more variations. In the model, the dimensions of the latent variables were changed from 2 to 20 for the Pollen dataset. Results showed that the increase of the dimensions did not improve the identification of known cell populations and the subspaces with too high dimensions even caused worse results ( Figure 4e).

Case study: human pre-implantation embryos
The scRNA-seq is very useful for studying the cell dynamics during embryo pre-implantation development. We applied VASC on a recently published dataset of human pre-implantation embryos (the Petropoulus dataset), including 1,529 cells with detailed annotations of developmental stages, inferred lineage and inferred pseudo-time information [28]. The 2D visualizations showed that VASC and t-SNE recovered the known developmental stages (form E3 to E7) more precisely, with the exception that the E3 cells were out of the trajectory by t-SNE. Both PCA and ZIFA generally recovered the stage trajectory but the E6 and E7 cells were largely overlapped. SIMLR, which emphasized the modularity of cell populations, did not re-establish the basic shapes. (Figures 5a-e) Next, we investigated other annotations based on the visualizations. A sharper split was found in the E5 cells by VASC, compared with the t-SNE result. We re-annotated the cells with their lineages instead of the developmental stages. It is found that the sharp split learned by VASC was a good separation of the pre-lineage cells from the others (Figure 5f). The inner cell mass (ICM), including the PE (primitive endoderm) and EPI (epiblast), were split from TE (trophectoderm), and the boundary was almost perpendicular to the direction of developmental stage (Figure 5f). It was also found the two sub-populations of the TE cells, mural and polar cells, can be separated in the visualization (Figure 5g). Finally, the trajectory recovered by VASC was strongly coincided with the inferred pseudo time (Figure 5h).
The candidate genes associated with the embryo pre-implantation development were identified by calculating the Spearman's correlations between the gene expressions and the two features in the reduced subspace. Many known regulators and markers were found in the top correlated genes, such as PGF, GCM1, CYP19A1, MUC15, CD24, CCR7, GREM2, CGA, GATA2, TDGF1, ESRG, GDF3 and DNMT3L mentioned in the dataset article (rank <= 100 for either feature). Interestingly, the top-ranked genes were significantly enriched in metabolic processes, such as "carbohydrate derivative metabolic process" (37 genes, q-value 5.63e-05 by DAVID 6.8 [29]), "oxidation-reduction process" (32 genes, 4.87e-05) and "lipid metabolic process" (32 genes, q-value 4.94e-03). Recent technique advances found several metabolic pathways play as essential factors for regulating stem cells' stemness and differentiation [30]. These analyses identified several interesting candidate genes involved in different metabolic processes: for example, CYP11A1 (a member of the cytochrome P450 superfamily of enzymes, the sample superfamily of CYP19A1), NR2F2 (a member of the steroid thyroid hormone superfamily of nuclear receptors), PKM (a pyruvate kinase that catalyzes the transfer of a phosphoryl group from phosphoenolpyruvate to ADP, generating ATP and pyruvate, a key kinase for glycolysis), PPARG (a member of the peroxisome proliferator-activated receptor subfamily of nuclear receptors) and IDH1 (an isocitrate dehydrogenase, key enzyme for cytoplasmic NADPH production). There are two parameters (the mean vector and the co-variance matrix) in the variational distribution Q(z|X). When the sample size is small, it is better to fix the co-variance matrix.

Discussion
But, when the size is large enough (above 1,000 according to our preliminary experiments), a co-variance matrix learnt from the data can get better results. It is expected that more complex variational distribution families can be tested in near future, as the sample size of scRNA-seq dataset is quickly increasing.
It is found that the inclusion of zero-inflation (ZI) layer improves the representation of VASC in terms of recovering the known cell types. Compared with ZIFA, the Gumbel distribution used by the ZI layer didn't generate strictly zeroes, which may additionally model the near-zero dropout events, which is proposed as a limitation of ZIFA [6].

Conclusions
In this study, a dimension reduction method VASC was developed for scRNA-seq data visualization and analysis. We systematically compared VASC with four state-of-the-art dimension reduction methods on twenty datasets. Results show that VASC achieves superior performances in most cases and is broadly suitable for different datasets with different data structures in the original space. The application on a dataset of the human pre-implantation embryo development shows that VASC can re-establish the cell dynamics in the reduced 2D-subspace and identify the associated marker genes.

Variational autoencoder
VASC is a deep variational autoencoder(VAE) based generative model and is designed for the visualization and low-dimensional representation analysis of scRNA-seq data. The variational autoencoder (VAE) aims to model the distribution P(X) of data points in a high-dimensional space ߯ , with the aid of low dimensional latent variables z. The whole model could be divided into two procedures, the first of which is to generate the samples of z in the latent low-dimensional subspace, and the second is to map them to the original space ߯ . The critical point is to generate z having the high probability to recover the observed data matrix X. It means that the generated z may possibly capture the intrinsic information of the original data.
Theoretically, the best choice to generate z is the posterior distribution P(z|X), which however, is usually too complicated and intractable. VAE tries to use a varitional probability Q(z|X) to approximate it, which is optimized to minimize the Kullback-Leibler(KL) divergence between Q(z|X) and P(z|X): By applying the Bayes rule and rearranging the order, it can be re-written as:

Performance assessment
To measure the quality of low-dimensional representation, k-means clustering was applied to the 2D representations of all the methods and compared the clustering results with known cell types provided by their original references. The number of clusters, k, was set to number of known cell types. Four measure indices were used to assess the performances.

Normalized mutual information (NMI).[25]
Suppose P is the predicted clustering results, and T is the known cell types (the same below). Denote the entropy of P and T as H(P) and H(T), respectively, and the mutual information between them as MI(P,T). NMI is computed as:

Adjusted rand index (ARI).[26]
Suppose n is the total number of samples, For each dataset, we considered four state-of-the-art dimension reduction methods -PCA [3], t-SNE [4], ZIFA [6] and SIMLR [7]. For all the methods, no gene filtering was used and the same log-2 transformation was applied. For PCA and t-SNE, we used the built-in python sklearn package functions. For the datasets more than 500 cells, we firstly applied a PCA transformation with 500 dimensions before t-SNE. The key parameter of t-SNE -perplexity was set to 0.2 times the number of cells as the paper [28] suggested. For ZIFA, we downloaded their package and used their block_ZIFA module because of the large number of genes. For SIMLR, we used their R package. For benchmarking the dimension reduction performance, k-means was used to get the predicted cell types based on their 2D representations.

Additional files
Supplementary materials include more details about VASC and supplementary figures.

Decoder network
Candidate recovered profileỸ