Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

Normalization of single-cell RNA sequencing data is necessary to eliminate cell-specific biases prior to downstream analyses. However, this is not straightforward for noisy single-cell data where many counts are zero. We present a novel approach where expression values are summed across pools of cells, and the summed values are used for normalization. Pool-based size factors are then deconvolved to yield cell-based factors. Our deconvolution approach outperforms existing methods for accurate normalization of cell-specific biases in simulated data. Similar behavior is observed in real data, where deconvolution improves the relevance of results of downstream analyses. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0947-7) contains supplementary material, which is available to authorized users.


Justifying the choice of pooling strategy
The performance of a ring arrangement and sliding window can be compared to selection of random pools of cells. Count data was simulated as described in the main text, with some modifications -specifically, true values of θ j for all cells were sampled from a Uniform(0.1, 1) distribution instead, to provide a larger range of size factors; counts were only generated for a single subpopulation of 200 cells; and no DE genes were introduced. To estimate the size factors, deconvolution was performed using pools of 20 cells, assembled either with the ring arrangement in Figure S3 or after randomly permuting the order of cells in the ring. (For demonstration purposes, only one pool size was used to construct the linear system. This ensures that the estimation errors are large enough to reliably observe any changes in precision.) Precision was quantified as the median absolute deviation (MAD) of the log-fold differences between the estimated and true θ j . A completely precise method should have a MAD of zero, as the estimated and true θ j would be proportionally identical for all j. In this simulation, using the ring arrangement yields a MAD of 0.051 while using random pools yields a MAD of 0.079 (standard errors of 0.002 and 0.003, respectively, across 10 simulation iterations). This suggests that the ring arrangement provides a modest improvement in estimation precision.

Resolving linear dependencies in the constructed system
Consider the application of the deconvolution method on a data set with four cells using a sliding window of size 2. Assuming cells j = 1 to 4 were placed consecutively on the ring, this would yield the linear system     1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 1 for pool-based size factors θ A to θ D . We set all t j to unity for simplicity, though the reasoning below applies when solving for both θ j and θ j t −1 j . Assume that the pool-based factors are estimated accurately and precisely, such that the minimum value of the residual sum of squares for this system can be obtained near the set of true values for the cell-based factors. This system has no unique solution -for the true factors (θ 1 , θ 2 , θ 3 , θ 4 ) T , an equally good fit can be obtained with (θ 1 + x, θ 2 − x, θ 3 + x, θ 4 − x) T for any real x.
The addition of equations relating each size factor to its direct estimateθ j ensures identifiability, as a value of x will be chosen that minimizes the residual sum of squares of (θ 1 + x, θ 2 − x, θ 3 + x, θ 4 − x) T from the direct estimates. In this example, the residual sum of squares for the possible solutions is written as where J 1 = {1, 3} and J 2 = {2, 4}. For these 4 cells, the above expression is minimized in terms of x at This is true regardless of the relative weighting applied to the set of additional equations. The direct estimator of each size factor is also unbiased as stochastic zeroes are not removed, i.e., E(Θ j ) = θ j forθ j ∼Θ j . As the number of cells in the system increases, the sizes of J 1 and J 2 will increase. For example, deconvolution is typically applied in data sets involving at least 200 cells and a sliding window of size 20, such that both J 1 and J 2 will consist of at least 10 cells. With more cells, the sum of the errors θ j −θ j will approach the sum of their expected values, i.e., zero. This means that the minimum residual sum of squares will be obtained at x = 0, i.e., solving the linear system will yield the true values of the size factors. Thus, the additional equations will not affect accurate estimation of the size factors in the deconvolution method.
there will be a concomitant increase in the mean and variability of the counts. This is represented in the simulation by a larger dispersion value, especially at low abundances where dropouts are more likely.
The count for each gene in each cell was sampled from a NB distribution with a mean of θ j λ is and dispersion ϕ i , using the same experimental design as described in the main text, i.e., three subpopulations of 250 cells with varying levels of DE. Size factors were then computed from the counts using the available methods. For the existing methods, similar performance was observed compared to the original simulations. DESeq and TMM normalization were consistently biased (Figures S4a and S4b), suggesting that zero counts are still problematic in high coverage data sets. Library size normalization continued to fail in the presence of DE genes ( Figure S4c). In contrast, deconvolution performed well in all simulation scenarios ( Figure S5). Greater divergence from the true size factors is observed in Figure S5c, but deconvolution is still more accurate than the existing methods in the equivalent scenario ( Figure S4, third row). These results suggest that the increase in normalization accuracy from deconvolution is still present in higher-coverage experiments.

Computational complexity of the deconvolution method
The computational time and memory required by deconvolution depend on the number of cells. An equation is constructed for each cell, so the size of the linear system (in terms of the number of equations and coefficients) will increase quadratically with the number of cells. Furthermore, common implementations of the QR decomposition have O(n 3 ) time complexity for n cells. This is roughly consistent with our empirical timings ( Figure S6). In practice, the cubic time complexity is mitigated by the use of clustering to break up the linear system. As deconvolution is performed within each cluster, it is cubic only with respect to the size of each cluster, and is linear with respect to the number of clusters. The overall time complexity lies between the linear and cubic extremes, depending on the number of clusters and size of each cluster -a few large clusters will result in cubic complexity, while many small clusters will result in linear complexity.
Memory usage of the deconvolution method is harder to measure due to the unpredictability of garbage collection in R. We expect that usage will increase quadratically with the number of cells, due to increases to the size of the linear system. Again, this can be mitigated by clustering for large data sets. Further savings may be possible by using sparse matrices given that most of the entries in the system are equal to zero.
We note that most clustering algorithms have quadratic time complexity with respect to the number of cells, e.g., when building the distance matrix. However, we expect that clustering would be performed anyway as a routine part of the scRNA-seq data analysis. Thus, no additional time is added by normalization.
6 Comparing normalization accuracy on real data 6.1 Overview of the assessment framework Consider two estimators Θ j and Θ j of the size factor in cell j, where each estimator represents a different normalization method. We assume that Θ j is an unbiased estimator of the true size factor θ j , i.e., E(Θ j ) = θ j . In contrast, Θ j is biased such that a power relationship exists between the estimates and true values. This is not unreasonable given the simulation results ( Figure S4), and can be written as for gradient b. Any bias will manifest as a non-unity gradient indicating divergence between the true and estimated values. (For simplicity, we do not include any intercept term in the above expression. This does not affect the generality of the framework, as the absolute scale of the size factors is irrelevant to their interpretation.) We define the true mean for gene i in cell j as µ ij = θ j λ i0 , where λ i0 represents the expected number of transcripts. We assume that no DE is present so a constant λ i0 can be used for all cells. Now, consider fitting a log-link GLM to the counts using log[E(Θ j )] as an offset and log[E(Θ j )] as the only covariate in a model with an intercept term. The systematic component of the GLM is written as where β i0 and β i1 are the values of the coefficients for the intercept and covariate terms, respectively. Given the relationships between E(Θ j ), E(Θ j ) and θ j , the above expression can be rewritten to obtain This relationship must hold for all cells as the systematic component of the GLM describes all contributions to µ ij for all j. This means that, if b = 0, β i0 must be equal to log(λ i0 ) and β i1 must be equal to zero. We then perform a likelihood ratio test (LRT) against the null hypothesis of β i1 = 0 for the alternative hypothesis of β i1 = 0. We should obtain only a small number of rejections as the null hypothesis is true for each gene. (Similarly, if b = 0, few rejections should be observed as the null and alternative models are identical for an all-zero covariate.) In contrast, using log[E(Θ j )] as the offset and log[E(Θ j )] as the covariate leads to The above relationship only holds for all cells when β i0 = log(λ i0 ) and β i1 = 1 − b, where β i0 and β i1 are the values of the coefficients in this "switched" configuration. Repeating the LRT to test for a non-zero covariate term should yield more rejections than in the original configuration, as β i1 = 0 when b = 1. These rejections can also be treated as DE genes where the DE log-fold change represents the normalization bias.
In practice, E(Θ j ) and E(Θ j ) are unknown so we replace them with the observed size factor estimateŝ θ j andθ j , respectively, in the GLM fit. This assumes that the variances of Θ j and Θ j are low such that any observed estimate will be close to the expected value. The substitution is justified by the nature of the trends in Figures 1, 4, S4 and S5. For most scenarios, the estimates lie on a well-defined curve with little scatter which suggests that the estimation error is low. We also repeated the no-DE simulation in Figures 1  and 4 and computed the MAD of the log-estimates for each cell across 10 iterations. These MADs ranged from 0.007 to 0.021 for TMM normalization (using the interquartile range across all cells); 0.005 to 0.011 for DESeq normalization; 0.007 to 0.013 for library size normalization; and 0.025 to 0.043 for deconvolution. In contrast, the MAD of the log-transformed true size factors across cells was 0.39. This indicates that the estimation error of each size factor is small relative to the variability of the size factors across cells.

Deconvolution outperforms existing methods on real data
The above framework was applied to assess the performance of each existing method compared to the deconvolution method on real data. For each data set, cells were partitioned into pre-defined groups in which no DE was expected. The deconvolution size factors were log-transformed and used as offsets, while the log-size factors from an existing method were used as the covariates. NB dispersion estimation and GLM fitting was performed on the counts for each group separately, using methods from edgeR. The LRT was used to test for DE and the number of DE genes detected at a FDR of 5% was counted. This was repeated after switching the size factors, i.e., using the deconvolution size factors as the covariates and the existing size factors as the offsets. To ensure a valid comparison, the dispersion estimates from the original fit were re-used to fit the GLM after switching. Similarly, the p-value corresponding to an FDR threshold of 5% in the initial set of DE genes was used to define the set of DE genes in the switched fit. This means that the differences between β i1 and β i1 are the sole cause of any difference in the number of DE genes. The entire procedure was then repeated for each existing method and for each group of cells. Examination of Figure S7 indicates that fewer DE genes are consistently detected in each group when the deconvolution size factors are used as the offsets, relative to the number detected when they are used as covariates. This suggests that the deconvolution size factors are equivalent to the unbiased estimatesθ j and are more accurate than the existing methods.
For these results, there are a number of caveats that should be mentioned. Firstly, we assume one set of size factor estimates is unbiased, while the other set is not. The performance of deconvolution in the simulations suggests that it can be treated as the unbiased method. However, this is unlikely to be completely true as both sets will exhibit some bias. The above evaluation will be less effective when large biases are present in both the offsets and covariates. Secondly, we assume that the biased size factors follow a power relationship with respect to the true values. The effect of other relationships is less predictable, though we would still expect them to yield non-zero values for β i1 . Large var(Θ j ) and var(Θ j ) may also increase the noise in the expected relationships of β i1 = 0 and β i1 = 1 − b. Finally, we assume that there is no DE within the group of cells -or, at least, no DE that is associated with the size factors. We select cells in homogeneous groups to weaken this assumption, but there may be additional DE within each group that we have not considered. In summary, a large number of assumptions are used here, but this is likely to be inevitable given that the true size factors (which would facilitate a direct assessment of estimation accuracy) are unknown in real data.  Figure S2: Performance of DESeq normalization after addition of a pseudo-count. A pseudo-count of unity was added to (a) all counts directly, or (b) after scaling the pseudo-count by the relative library size, i.e., the pseudo-count added to each cell was equal to the ratio of its library size to the mean library size. Simulations were performed with no DE (first row) and varying magnitudes of DE (second row). Axes are shown on a log-scale. For comparison, each set of size factors was scaled such that the grand mean across cells was the same as that for the true values. The red line represents equality between the rescaled estimates and true factors. Cells in the first, second and third subpopulations are shown in black, blue and orange, respectively.

Sliding window of length 4
Largest library

Smallest library
Even ranks Odd ranks Figure S3: Ring arrangement of cells ordered by library size. Each circle represents a cell where the size of the circle corresponds to the library size of that cell. Even-and odd-ranking cells lie on opposite sides, with the largest and smallest libraries at the top and bottom, respectively. Cells lying in a window of length 4 are highlighted in red. Different instances of the window are obtained by sliding the window across the ring.