Computational assignment of cell-cycle stage from single-cell transcriptome data

The transcriptome of single cells can reveal important information about cellular states and heterogeneity within populations of cells. Recently, single-cell RNA-sequencing has facilitated expression proﬁling of large numbers of single cells in parallel. To fully exploit these data, it is critical that suitable computational approaches are developed. One key challenge, especially pertinent when considering

dividing populations of cells, is to understand the cell-cycle stage of each captured cell. Here we describe and compare five established supervised machine learning methods and a custom-built predictor for allocating cells to their cell-cycle stage on the basis of their transcriptome. In particular, we assess the impact of different normalization strategies and the usage of prior knowledge on the predictive power of the classifiers. We tested the methods on previously published datasets and found that a PCA-based approach and the custom predictor performed best. Moreover, our analysis shows that the performance depends strongly on normalization and the usage of prior knowledge. Only by leveraging prior knowledge in form of cell-cycle annotated genes and by preprocessing the data using a rank-based normalization, is it possible to robustly capture the transcriptional cell-cycle signature across different cell types, organisms and experimental protocols.

Introduction
to fully understand a number of different biological problems.
So far, information about cell cycle stage has largely been obtained by experimental approaches. For instance, cells can be treated with chemicals to induce cell-cycle arrest in a specific phase [10]. Alternatively, cell sorting methods can be used to stratify cells by size (counterflow centrifugation elutriation [11]) or DNA content (e.g., Hoechst staining [12]), which facilitates enrichment of cells in different stages of the cell cycle. Alternatively, strategies based on genetic manipulation through insertion of fluorescent probes in genes that are differentially expressed in different cell-cycle stages (e.g., FUCCI technique [13]) can be employed. However, these approaches have major drawbacks as they can be very labour extensive and, due to their invasive nature, have the potential to disturb the biological system substantially (e.g., cell-cycle arrest can have a large impact on differentiation potential).
In the context of scRNA-Seq experiments, the transcriptome data itself provides informative cues about the cell cycle stage of individual cells [14,15]. In particular, genome-wide transcriptome data provides information on the expression levels of informative cell-cycle marker genes, which have been carefully annotated in several systems and cell types (e.g., in human, yeast and Arabidopsis [16]). Consequently, we reasoned that these genes can be used to infer the cell cycle phase directly from the transcriptome. Such an approach would be complementary to experimental sorting procedures and could help reduce biases that might arise from more invasive experimental techniques. Moreover, the cell-cycle structure of unsorted populations of cells profiled by scRNA-seq could be investigated.
While strategies have been developed to remove cell cycle variation from scRNAseq data without inferring the cell-cycle stage in order to improve the detection of sub-populations of cells [14], and computational analyses have been used to distinguish cycling from quiescent cells [17], the possibility of explicitly predicting the cell-cycle stage of cells from their transcriptome has not yet been explored. In this paper we analyze six supervised computational methods to predict G1, S or G2M phase given the transcriptome of a cell. We train each algorithm on a recently published scRNA-seq dataset where cell-cycle information is available from experimental annotation [14], and we assess their performance on scRNA-seq datasets generated from a variety of cell types and organisms.

Material and Methods
We use a supervised machine learning approach to evaluate the ability of six algorithms to predict the unobserved cell cycle stage of a cell from its transcriptome profile. These include five established supervise machine learning approaches as well as a custom-built predictor. Each algorithm was trained on the same scRNA-seq dataset where the cell-cycle stage of each cell was known. Additionally, different sets of cell-cycle annotated genes were used to build each classifier. A schematic overview of our approach is shown in Figure 1. The six prediction algorithms' performance was measured using 10-fold cross-validation on the training dataset and a variety of independent datasets. The predictive power of all classifiers was quantified by calculating the F1-score (harmonic mean of recall and precision), which has been shown to be an effective summary statistic for multi-class classification [18].
In order to quantify how well the predictors perform across all cell-cycle phases, we also calculated the macro-averaged F1 score by taking the average of precision and recall over all cell cycle phases before computing the harmonic mean, so as to make it independent of the number of cells in each phase in the testing dataset.

Prediction algorithms
We compared a total of six classifiers including linear and non-linear predictors as well as one custom method specifically designed for in-silico cell-cycle allocation.
Below, we provide brief details about the methods employed and their implementation.

Random forest
We used the scikit-learn implementation of random forests (ExtraTreesClassifier) [19] and trained 500 trees by minimizing the entropy in the leaves of the individual randomized trees, constructed using a subset of all N features ( √ N ).

Logistic regression and Lasso
Logistic regression was used, both without regularization and with an L1 penalty (lasso) [20]. The lasso penalty was determined using an internal 5-fold cross validation, maximizing the F1 score.

Support vector machines
We used support vector machines with an rbf kernel with feature selection [21]. Kernel parameters were determined using a cross-validated grid search. Due to the large number of variables, feature selection was performed based on a univariate feature ranking [22]. First, for each gene an ANOVA was performed and genes were ranked according to their F-statistic. Next, the best number of features was determined in an integrated cross-validated grid search.
Multi-class classification was performed via a one-vs-one scheme (scikit-learn implementation), allowing for multi-class classification based on standard binary SVMs.

PCA-based classification
Recently, we showed that the first principal component (PC) of a set of annotated cell cycle marker genes is sufficient for constructing a cell-cell covariance matrix, reflecting the cell cycle induced correlation among cells [14]. We therefore evaluated a Gaussian Naive Bayes classifier based on the first PC derived from the set of cell cycle markers. Furthermore, we explored the additional predictive power of higher order PCs (see supplementary figure B.6). Naive Bayes classifiers assign a probability to each data instance based on a set of features by assuming conditional independence of the features. The Gaussian Naive Bayesian classifier assumes a Gaussian distribution of each continuous feature (here PC1 and higher order PCs if applicable) with mean and variance specific to each class (here: cell-cycle phase). Mean and variance parameters are estimated using maximum likelihood as implemented in the scikit-learn framework [19].

Pairs
We developed a classification algorithm based on the idea of the relative expression of "marker pairs", which is also exploited in top scoring pairs classifier, developed for classifying cancer types based on microarray data [23,24,25]. The algorithm selects pairs of genes whose relative expression has a sign that changes with the cell-cycle phase. These pairs can then be used to quantify the evidence that a given cell is in G1, S or G2M phase, as described in the Appendix A. Since only the sign of the relative expression of gene pairs in the same cell is used, this method does not require any normalization for sequencing depth.

Selection of cell-cycle marker genes
To establish a set of cell cycle annotated genes, we combine all genes annotated to cell cycle in the Gene Ontology database (GO:0007049) [26] along with the 600 topranked genes from CycleBase [27,16]. Furthermore, we construct an informative set of cell cycle marker genes, by excluding those genes whose variation was below the technical noise in the training dataset (see section 2.3). In addition, we demonstrate the benefits of using prior knowledge by evaluating the performance of a classifier based on the complete, unbiased set of expressed genes.

Data post-processing
To obtain gene expression values that are comparable across a wide range of protocols, we used a rank normalisation approach: for each cell we ranked the expression values (FPKM, RPKM or normalized with size factors as in the respective primary publication) of the set of genes used for training from lowest to highest. We then used these rank-normalized gene-expression values as input for all algorithms. In addition, we explored an alternative normalisation strategy where the data from each cell was normalized with the total number of reads mapped to the gene set used for prediction.

Training data set
We trained all classifiers on a recently published single-cell RNA-seq dataset comprised of 182 mouse embryonic stem cells (mESCs) with known cell-cycle phase [14]. In brief, Rex1-GFP-expressing mESCs (Rex1-GFP mESCs) were cultured using serum-free NDiff 227 medium (Stem Cells Inc.) supplemented with 2i inhibitors.
Hoechst staining (Hoechst 33342; Invitrogen) was optimized for Rex1-GFP mESC, and cells were sorted using FACS for three different cell-cycle phases (G1, S and G2M phase). Next, single-cell RNA-seq was performed using the C1 Single Cell To establish a set of informative cell-cycle marker genes, we determined the intersection of annotated cell cycle marker genes with the set of variable genes in the Hoechst-stained mESCs. We further reduced the set of informative cell cycle genes by determining the set of genes with variation above the technical background level for an additional single-cell RNA-seq dataset [12]. This resulted in a smaller set of 405 genes. The rank-normalised expression of these informative cell cycle markers for 182 cells constitutes the training data.

Datasets with cell cycle information
We tested the performance of all predictors on a variety of independent data-sets with known ground truth.

Mouse mESCs data (Quartz-Seq protocol)
We used the normalized gene expression data from the primary publication [12]. In brief, mESCs were FACS sorted into G1, S and G2M phases based on their Hoechst 33342-stained cell area. Next, seven S, eight G2M and 20 G1 cells were sequenced using the Quartz-seq protocol and gene expression was normalized to FPKM values.
Due to the lack of spike-ins, we estimated the amount of technical (null) noise expected for genes with variable levels of expression using a log-linear fit between the expression mean and the squared coefficient of variation between cells [14,28].
This approach resulted in a total of 5,546 highly variable genes.

Human leukemia cells (bulk)
We analysed data from bulk human myeloid leukemia cells [11]. Cells were assigned to cell-cycle stages (G1, S and G2M) using centrifugal elutriation and mRNA expression was quantified using RNA-seq.

Bulk mESCs
mESCs were stained with Hoechst 33342 and FACS sorted for cell cycle stages (G1, S and G2M). Approx 150,000-300,000 cells from an asynchronous population and from each cell cycle fractions (G1, S and G2M) were used for bulk mRNA sequencing, with librariues being generated using the Illumina TruSeq Stranded RNA Sample preparation kit. All libraries were prepared and sequenced using the Wellcome Trust Sanger Institute sample preparation pipeline. Sequencing quality control and data quality checks were performed by the Sanger Sequencing facility. Downstream data analysis (Alignment, Mapping and counting reads) was performed as described [14].

Datasets without cell cycle information 2.5.1 Liver cells
We tested the algorithms on two independently generated sets of individually sequenced liver cells, one previously published [29], one generated for this study (see Appendix A). Since most liver cells do not proliferate (see, e.g., [30]), they are expected to be in G1 phase.
Smart-Seq protocol. We used normalized gene expression data of five liver cells from the primary publication [29]. All cells were sequenced using the Smart-Seq protocol.
C1 protocol. In addition we generated the individual transcriptomes of 70 liver cells using the Fluidigm C1 platform. In brief, a suspension of cells was prepared from the liver of a 14-week old B6CastF1 (C57Bl/6J mother x CAST/Ei father) female mouse and loaded onto a 10-17 µm C1 Single-Cell Auto Prep IFC (Fluidigm), and cell capture was performed according to the manufacturer's instructions (see Appendix A for the detailed protocol).
Paired-end reads were mapped simultaneously to the M. musculus genome (Ensembl version 38.75) and the ERCC sequences using GSNAP (version 2014-05-15) with default parameters. Htseq-count [31] was used to count the number of reads mapped to each gene (default options).

Blastomeres
We applied our algorithm to the transcriptomes of a total of 30 individual cells dissociated from early, mid and late 2-cell stage mouse embryos and sequenced using the Smart-seq protocol [29]. Most of these cells are expected to be in G2 phase, as blastomeres from 2-cell stage embryos have a very short G1 phase and spend more than half of their cell-cycle in G2 phase [32,33].

T-cells
Single-cell RNAseq data -Finally, we applied our approach to 81 T-cells [34].
We used normalized gene expression data as in [14]. We evaluated our algorithm by comparing the fraction of cells assigned to individual cell-cycle phases in silico to the respective fractions obtained from flow cytometry analysis of Ruby-stained T-cells. First, we assessed the performance of the different prediction algorithms as well as all sets of marker genes using a cross-validation approach. In this "holdout" experiment a fraction of the training data is removed when fitting the model, which is then applied to the withheld data to assess its performance. This resulted in high precision and recall for all cell cycle phases ( Fig. 2.a-c) for all classifiers and gene sets, indicating that all models fit the data well and are able to generalize well when applied to the same type of data (i.e., mESCs cultured in 2i+LIF).
Poor generalizability to independent test data for many methods but PCA and pairs method To assess how well the different approaches generalize to independent data sets, we tested the six predictors derived on an independent test set of 35 mESCs sequenced using a different protocol (Quartz-seq) and cultured in a different medium (serum; see Fig. 2.d-f). Two general features are shared by all predictors: all of them perform worst on S phase prediction (see also below), and their overall predictive power substantially increases when they are trained on cell-cycle annotated genes ( Fig. 2.e-f), which indicates the importance of the inclusion of prior information. The best performance on the independent mESC test set was achieved by the PCA-based Naive Bayes Classifier and the custom predictor (the "Pairs" method) which had similar predictive power. As all methods yielded good performance on the cross-validation, this indicates that all methods but PCA and Pairs overfit to the training data without being able to generalize to cells from different conditions. For the large set of all variable genes, this over-fitting-effect is particularly strong and occurs for all methods, again reflecting the importance of using prior knowledge.
Alternative normalization strategy results in poor generalizability We also assessed the influence of the normalization step by training the predictors on the total read count normalised gene expression data. This resulted in a notable decrease in performance and highlights the importance of robust normalisation strategies which hold for different experimental protocols ( Supplementary Fig. B.3). While the majority of the most relevant genes are well known markers for specific cell-cycle phases (e.g., Plk1, Aurka, etc.), we could identify several genes that were not previously annotated to any particular stage of the cell cycle but were among the most important for classification. For example, Tmem2 and Tex14, which have the highest negative loadings on PC1 (i.e., the two strongest G1 markers) were not annotated with a specific peak time or phenotype in Cyclebase.

Bulk data
We also applied each approach to predict cell- Consequently, we concluded that of the six predictors evaluated and the gene sets used, the PCA-based and the pairs approaches trained on cell-cycle annotated genes had the best overall performance and, importantly, had the strongest ability to distinguish S-phase cells .

Application to datasets without ground truth
We next applied the PCA-based approach to a variety of datasets including liver cells, T-cells and blastomeres to predict their cell-cycle phases (see Appendix B for the application of the pairs method to the same datasets). For all data sets, scatter plots with G1 and G2M score of individual cells with decision boundaries are shown in Figure 4.

Liver cells
As expected, given the non-proliferative state of most liver cells [30], all profiled liver cells were allocated with a high degree of confidence to G1 phase ( Figure 4; blue colour). This applied to cells profiled in two independent laboratories, suggesting that the PCA-based predictor is relatively robust to technical biases that may arise during sample preparation.

Blastomeres
The cell cycle of blastomeres from 2-cell stage embryos takes about ∼ 20h to complete, with the S-phase starting ∼ 1h after the first mitosis [32] and lasting approximately ∼ 6h. G2 phase is very long and its length varies between ∼ 12 and ∼ 16 hours [33]. Apart from a few cells allocated to G1, most of the blastomeres analysed were predicted to be in S phase by the PCA-based method (see figure 4), and in G2M phase by the pairs method (see figure B.1.A), which agrees better with our prior expectations. Despite the difference in allocation probabilities, the G2M scores from the two methods have a high rank correlation ( Figure B.8)), suggesting that in the PCA-based approach, due to the weak signal for the S phase, the probability for a cell being in S is less reliable than the G1/G2M probability and can also result in a badly calibrated score, in particular for cell types other than those in the training set. The pairs method is less affected by this issue, possibly due to the higher robustness of the signal captured by the relative rankings of pairs of genes [23,24].

Conclusions
We evaluated six computational methods for predicting the cell-cycle stage of single cells from their transcriptome. To train the algorithms, we used a scRNA-seq dataset where cells had been previously sorted by their cell-cycle phase, and exploited available databases of genes known to be involved in cell cycle progression (GO and cyclebase [16]) to optimise the set of predictors. We quantified the predictive power of all methods for a wide range of single cells as well as bulk data from different organisms (mouse and human) and show that a parameter-free PCA-based approach and the custom predictor (the "Pairs" method) performed best and correctly allocated cells from all data sets to their cell-cycle phase.
In order for our method to be broadly applicable to a wide range of experimental protocols, we used a rank-based approach to normalize our data, resulting in a good performance on a large variety of data-set. We also explored total count normalisation where each cell was normalised by the total number of reads mapped to the genes used as input for the predictors. This resulted in a notable decrease in performance for all methods, which may be explained by two factors. First, the cell-cycle signature is considerably weaker without rank normalisation (Supplementary Figure B.3c and B.2). Second, rank-based normalization robustly preserves the cellcycle related information across different experimental protocols and, particularly in combination with PCA, results in a highly regularized cell-cycle allocator with good generalizability across data sets. In contrast, more sophisticated approaches such as random forest or SVM in combination with total count normalised data, easily overfit to a specific data-set.
Similarly, the strong regularization enforced by the PCA explains the good performance of the PCA-based predictor on all datasets. All predictors achieve very high F1-scores in the cross-validation, which indicates that the predictors do not overfit within the training set and would generalize well to similar data-sets generated using the same cell type and experimental protocol. However, in order to be useful in practice, it is crucial that a predictor will also generalize well to different cell types, experimental conditions and sequencing techniques. Here we show that only the PCA-based predictor and the pairs method achieve a strong enough regularization to robustly capture a generalizable cell-cycle signature in the transcriptome.
Interestingly, the signal captured by the pairs method based on the relative rankings of pairs of genes is probably the most robust and generalizable (see also [23,24]), as it is shown by the analysis of the 2-cell stage blastomeres.
Between the three phases, the S phase proved to be the most challenging to identify. This can be explained by the least specific transcriptional signature of S-phase markers at the single-cell level ( Supplementary Fig. B.2) along with the poor resolution of S phase in flow cytometry data affecting both training and testing datasets.
While the dataset we used for training only provided information on cell-cycle phase, without being able to monitor the progress within a given stage, the methods we tested assign a continuous score to each cell that can potentially provide

A.2 The pairs method
We describe below the selection of marker pairs and the computation of the score for the G1 phase with the pairs method. An analogous procedure can be followed for the other two phases.
Selection of marker pairs -The training dataset was used to select pairs of genes whose relative expression levels differ in G1 phase compared to cells in S and G2M. In other words, we found pairs of genes, g 1 and g 2 , such that: Computing the score -Once the marker pairs for G1 have been selected from the training dataset, they can be used to compute a "G1 score" for a given single cell, which is calculated as follows: 1. The number N G1 of "hits" in the list of G1 marker pairs is counted, i.e., the number of marker pairs where the first gene is expressed at a level higher than the second.
2. The probability distribution R(N ) of the number of "hits" with randomised lists of marker pairs is obtained.
3. The G1 score is defined as the probability of getting a number of hits lower than N G1 with a randomised list of markers: Cell-cycle phase allocation -The scores S G1 , S S , S G2M for each of the three phases can be calculated following the procedure described above. Cells can be allocated to the phase that corresponds to the highest score.
However, we found that the method performs better when the allocation is carried out only on the basis of the G1 and G2M scores (see also appendix B.2). Therefore, if either S G1 or S G2M is greater than 0.5, cells are allocated to the phase with the highest score among G1 and G2M. Conversely, if both S G1 , S G2M < 0.5, the cell is allocated to the S phase.  Interestingly, the difference between early (yellow circles) and late (green circles) is also detected, as the average G2M score is higher for the late blastomeres compared to early and mid blastomeres.
Liver cells -Most of them predicted to be in G1, consistently with the expectation ( fig. B.1.B).

B.2 The average expression levels of cell-cycle markers in different phases
We considered the cell-cycle markers listed in Cyclebase and checked their average expression levels in the cells in the different phases in our training dataset. A quantile normalization was carried out before calculating the averages.

B.3 Feature importance
We used the loadings of PC1 as a measure for feature relevance in the PCA-based method. In figure B.4, the top 40 genes with the largest loadings are shown.
In the pairs method, we evaluated the relevance of each gene pair with a score com-  puted as follows: for each given pair of genes g i and g j , we calculated the quantities p ij (C) = P rob(g i > g j ), i.e., the fraction of samples in the training data set annotated to cell-cycle phase C = {G1, S, G2M } where the expression level of gene g i is higher than that of gene g j . The score of a marker pair (g i , g j ) for a phase C 1 is defined as ∆ ij = |p ij (C 1 ) − M ean(p ij (C 2 ), p ij (C 3 ))|. Figure B.5 shows the ten disjoint pairs (i.e., pairs involving different genes, see [24]) with the highest scores in each cell-cycle phase. the cross-validated F1-score on the training data as well as the F1-score on the independent test data for up to five PCs. While the cross-validated F1-score varied little with the inclusion of more PCs, the performance on the test-set degraded in comparison to using only one PC, indicating that while the first PC captures generalizable cell-cycle effects, higher PCs capture more data-set specific properties.
Similarly, we assessed how the performance of the other classifiers (random forest, SVM and logistic regression/lasso) would be affected when including the first 5 PCs as features and found no improvement in the predictive power (data not shown).

B.5 Training on the combined C1 and Quartz-Seq data
We reasoned that by increasing the size of the training data set we would be able to train a classifier with better generalizability. However, the drawback of using both data-sets for training is that no independent single-cell data with known cell-cycle stage set is available for external validation. Therefore, we evaluated the classifiers based on the combined C1 and Quartz-Seq data using 10-fold cross validation and observed similar performance as for using the C1 data for training only (fig. B.7 A).
We then used the classifier trained on the combined mESC data to predict the cellcycle stage for the same single-cell and bulk data-sets as described in the main text.
This again yielded a similar performance as after training on the C1 data only (fig.