Assessment of Single Cell RNA-Seq Normalization Methods

We have assessed the performance of seven normalization methods for single cell RNA-seq using data generated from dilution of RNA samples. Our analyses showed that methods considering spike-in External RNA Control Consortium (ERCC) RNA molecules significantly outperformed those not considering ERCCs. This work provides a guidance of selecting normalization methods to remove technical noise in single cell RNA-seq data.

methods are implemented in the edgeR Bioconductor package. The DESeq (Anders and Huber 2010) method is implemented in the Bioconductor package. For all methods, we ran the R function using default parameters. Remove unwanted variants (RUV) (Risso et al. 2014) is included in the RUVnormalize Bioconductor packages. The model sets up a generalized linear regression model n   Figure S1 in File S1. A, aRNA; N, Nugen Ovation RNASeq V2; S, C1 SMARTer.
between observed RNA-seq read counts and known covariates of interest, along with unknown unwanted variation factors. Different options of RUV, RUVr, and RUVg have different ways to estimate the unwanted variant factors. RUVr uses residuals from a first-pass regression of read counts (Risso et al. 2014), and considers the least differentially expressed genes across samples. RUVg considers negative controls such as External RNA Control Consortium (ERCC) spike-ins, and assumes that they are not differentially expressed across samples. All the R functions were run using default parameters. For empirical undifferentially expressed genes in running RUVr, we chose genes not in the top 6000 differentially expressed genes. GRM (Ding et al. 2015) fits a gamma regression model between FPKM values of reads and the concentration of ERCC spike-ins, and then make estimates of the molecular concentration of the genes from the reads.

Assessment of clustering performance
After normalization, we clustered the normalized gene reads using hierarchical clustering with the Ward method, and the metric was Pearson correlation between normalized gene reads. We assessed the clustering results using the following statistical indices: Rand index: The Rand index (Rand 1971) evaluates the correctness of clustering using prior labels. Given a set of n elements S = fs 1 ; s 2 ; ⋯; s n g; and two partitions needed to be compared, a partition of S into M clusters X ¼ fX 1 ; X 2 ; ⋯; X M g; and a partition of S into N clusters Y ¼ fY 1 ; Y 2 ; ⋯; Y N g; for certain 1 # i; j # nði 6 ¼ jÞ; 1 # k; k 1 ; k 2 # Mðk 1 6 ¼ k 2 Þ; 1 # l; l 1 ; l 2 # Nðl 1 6 ¼ l 2 Þ; a; b; c; and d are defined as follows: Then Rand index can be computed as follows: is the number of agreements between X and Y, while (c + d) is the number of disagreements between X and Y. The higher the Rand index, the more similar the two partitions. As we compared the clustering partitions to the prior known labels, the higher Rand index indicates the better the clustering is.
Dunn index: The Dunn index (Dunn 1974) evaluates the compactness and separation of the clustering. Specifically, given a set of n elements  Figure S2 in File S1). S = fs 1 ; s 2 ; ⋯; s n g; and the clustering partition as z ¼ fC 1 ; C 2 ; ⋯; C K g; the Dunn index is computed as follows: By definition, the higher the Dunn index, the better the clustering, considering enough separation compared to diameter of individual clusters. Here, the distance between two samples is defined as 1 2 the Pearson correlation coefficient between two samples.
Jaccard index: The Jaccard index evaluates the stability of clustering that measures the similarity between two finite subsets (Jaccard 1912a;Tan et al. 2006). Given two subsets, A and B, the Jaccard index is computed as: We used the Flexible Procedures for Clustering (fpc) package in R to calculate the Jaccard index, with random subsetting without replacement of samples. Each time a subset of samples was drawn, we clustered the samples, and then computed the maximum Jaccard index between the new cluster and the prior known cluster. Repeating B times for drawing B random subsets (here we chose B = 100), we then computed the mean of all the maximum Jaccard index (Hennig 2007). The higher the Jaccard index, the more robust the clustering.

Data availability
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

RESULTS AND DISCUSSION
We selected seven methods to normalize the data of the 120 RNA-seq samples. These methods fall into two categories, depending on whether or not spike-in ERCC molecules were taken into consideration in normalization. Methods for the details of setup for running each method). A total of 13,375 genes was selected with log(fpkm) .2 in at least two of the 114 nonbulk samples (100 or 10 pg). Using Pearson correlation as the similarity metric, we clustered all 120 samples (114 nonbulk and six bulk samples) using hierarchical clustering (Figure 1 and Supplemental Material, Figure S1 in File S1). For all methods, the UHR and HBR samples were well separated as expected, and we thus focused on evaluating how well the HBR and UHR samples were further clustered. In either the UHR or HBR group, samples normalized by FPKM, UQ, TMM, and DESeq tended to cluster by sequencing protocol rather than RNA amount, which indicates that the differences introduced by the protocols were not completely removed by normalization. In contrast, the HBR samples normalized by RUVr were largely clustered by RNA amount rather than sequencing protocol, although a significant portion of 10 pg samples sequenced using aRNA were clustered separately (Figure 1 and Figure  S1 in File S1). Qualitatively, RUVr normalization alleviates the difference between scRNA-seq protocols and the clustering results are closer to the ground truth than the other methods, i.e., the samples are clustered based on the source (HBR) and the RNA amount (bulk, 100 or 10 pg). However, the UHR samples normalized with RUVr were still clustered according to protocols rather than RNA amount. The other four methods showed worse clustering results than RUVr because 10 and 100 pg samples were mixed in each protocol group (Figure 1 and Figure S1 in File S1).

Evaluation of the methods not considering ERCC
To quantify the similarity between the clusters generated from different normalization methods and the ground truth, we cutthe clusters at different hierarchy levels, and calculated the Rand index between the clusters and the ground truth, which reflects the correctness of clustering. The Rand index is the ratio between the sample pairs that are correctly clustered and the total sample pairs (Rand 1971). Because all the methods successfully separated HBR and UHR samples, as discussed above, we calculated the Rand index on HBR and UHR samples separately. For either the HBR or UHR samples, we varied the cutoffs to cut the hierarchical trees into two to seven clusters, and calculated Rand index for each cutoff (Figure 2A and Figure S2 in File S1). RUVr clearly outperformed the other methods.
Next, we evaluated the compactness and separation of clusters using the Dunn index, which is the ratio of the smallest distance between clusters to the largest intracluster distance, and the distance between the two samples is defined as 1 2 the Pearson correlation between the samples (Dunn 1973). We computed the Dunn index on HBR and UHR samples separately. RUVr had a much higher Dunn index than the other methods when all the samples were clustered into two groups. For more clusters, all the methods had comparable Dunn index. This observation suggests that the separation between clusters is insensitive to normalization methods when the 120 samples were clustered into more than three clusters.
It is important to examine whether the clustering results are robust to the selection of samples and genes. We first divided HBR or UHR samples into three clusters: bulk, 100 and 10 pg. Then, we randomly selected 50% of the total samples, clustered them into three groups using the 13,375 genes selected above. We computed the maximum Jaccard index (Jaccard 1912b) (see definition in Materials and Methods) between the new cluster and the ground truth. The Jaccard index reflects the correctly clustered samples using the subsets of samples, and thus the robustness of the clustering. All methods except RUVr showed similar Jaccard index values. Next, to evaluate the robustness of clustering to genes used to calculate Pearson correlation coefficient between samples, we ranked the 13,375 genes using their coefficients of variance (CV) (Brennecke et al. 2013), which is the standard deviation of gene expression divided by the mean, reflecting the variation of a gene's expression across samples. We selected 10 sets of genes: top 10%, top 30%, top 50%, top 70%, top 90%, and bottom 10%, bottom 30%, bottom 50%, bottom 70%, and bottom 90%. In general, using the most variable genes achieved the most correct clustering, i.e., high Rand index, and using a sufficient number of the most variable genes (top 10% for RUVr and FPKM, top 30% for UQ and TMM, and top 50% for DeSeq) gave the highest Dunn index, which represents the best separation of the clusters (Figure 2, D-G). Using the least variable genes (most invariable genes), such as the bottom 10% variable genes, normally showed lower Rand and Dunn index than using more variable genes, such as the bottom 90% variable genes (Figure 2, D-G).

Evaluation of methods considering ERCC
We next evaluated the performance of two methods considering ERCCs: RUVg (RUV model considering ERCC) (Risso et al. 2014) and GRM (Ding et al. 2015) (see Materials and Methods for the details of the setup for running each method). Among the 120 RNA-seq runs, 45 samples containing spike-in ERCCs were normalized using the two methods, and then clustered with six bulk samples together ( Figure  3). The clustering results were similar to those obtained using the 13,375 selected genes. UHR and HBR samples were clearly separated, and the samples with the same amount of RNA were largely clustered together. One outlier in the RUVg cluster is a HBR 10 pg sample sequenced using the aRNA protocol that was clustered with HBR 100 pg samples.
The GRM showed a slightly higher Rand index than RUVg when the number of clusters was the same as the ground truth (6) or more ( Figure  4A). GRM also had a better Dunn index and comparable Jaccard index ( Figure 4, B and C), which suggests that GRM achieves better separation of clusters (Dunn index). The robustness to sample selection is comparable for the two methods, as indicated by their similar Jaccard index. Furthermore, we assessed how sensitive the two methods are to gene selection. For the Dunn index, GRM outperformed RUVg, regardless of whether the most variable, or the least variable, genes were selected (Figure 4,F and G). For the Rand index, using the most variable genes, GRM consistently showed higher values than RUVg; using the least variable genes, GRM achieved higher Rand index using the bottom 90 and 70% genes, but lower values using the bottom 50, 30 or 10% genes than RUVg and RUVr (six clusters of bulk, 100 and 10 pg samples for HBR and UHR were considered as the ground truth) (Figure 4, D and E). We examined the clusters generated using the bottom 30% variable genes. GRM still correctly clustered all the samples, but RUVg incorrectly clustered some HBR samples with UHR, despite a higher Rand index ( Figure 4E and Figure 5). We then used the bottom 10% variable genes (the most invariable genes) for a further comparison. Both methods misclustered UHR bulk with the HBR samples, but RUVg had additional HBR samples mistakenly clustered with UHR samples (Figure 5). For the Jaccard index, GRM performed better than RUVg, and RUVr with genes selected from top 10% through top 90% and bottom 90%, bottom 70% and bottom 50%. When the least variable genes were selected (bottom 10%), RUVg and RUVr significantly outperformed GRM ( Figure S3 in File S1).
Notably, both methods considering ERCC outperformed the methods not considering ERCC on the 51 samples (45 10 or 100 pg and six bulk samples) containing spike-in ERCCs ( Figure 4). RUVr, as a representative method, largely outperforms the other methods not considering ERCCs. Obviously, both GRM and RUVg showed higher Rand, Dunn and Jaccard index than RUVr, which indicates the value of considering ERCC in normalization and removing noise from scRNA-seq. Normalization of bulk RNA-seq using ERCCs can be useful, but may be less critical because the technical noise is smaller in bulk RNA-seq than in scRNA-seq.

Conclusions
Here, we evaluated the performance of seven normalization methods on 120 RNA-seq runs in terms of correctness (Rand index), compactness (Dunn index), and robustness (Jaccard index, robustness analysis using different sets of samples and genes) of clustering. The results showed that, for methods not considering spike-in ERCCs, RUVr showed higher Rand index and lower Jaccard index than FPKM, UQ, DESeq, and TMM; all methods showed similar Dunn index values. Considering ERCC, such as in the models of RUVg and GRM, significantly improved performance. Between RUVg and GRM, GRM is more robust in terms of selecting different sets of genes that generate similar clusters. Spike-in ERCCs would reduce the sequencing depth of mRNAs of interest, and there is also a concern about whether the synthesized ERCC molecules behave the same as mRNAs from the cell. Our analyses suggest that calibration of scRNA-seq to the spike-in ERCC is a powerful means of removing technical noise when the ERCCs are correctly modeled.

ACKNOWLEDGMENTS
We are grateful to the SCAP-T Consortium for providing access to the dilution data. This work is partially supported by the National Institutes of Health (U01MH098977).