Accounting for technical noise in single-cell RNA sequencing analysis

Recent technological breakthroughs have made it possible to measure RNA expression at the single-cell level, thus paving the way for exploring expression heterogeneity among individual cells. Current single-cell RNA sequencing (scRNA-seq) protocols are complex and introduce technical biases that vary across cells, which can bias downstream analysis without proper adjustment. To account for cell-to-cell technical differences, we propose a statistical framework, TASC (Toolkit for Analysis of Single Cell RNA-seq), an empirical Bayes approach to reliably model the cell-specific dropout rates and amplification bias by use of external RNA spike-ins. TASC incorporates the technical parameters, which reflect cell-to-cell batch effects, into a hierarchical mixture model to estimate the biological variance of a gene and detect differentially expressed genes. More importantly, TASC is able to adjust for covariates to further eliminate confounding that may originate from cell size and cell cycle differences. In simulation and real scRNA-seq data, TASC achieves accurate Type I error control and displays competitive sensitivity and improved robustness to batch effects in differential expression analysis, compared to existing methods. TASC is programmed to be computationally efficient, taking advantage of multi-threaded parallelization. We believe that TASC will provide a robust platform for researchers to leverage the power of scRNA-seq.


INTRODUCTION
Recent technological breakthroughs have made it possible to measure RNA expression at the single-cell level, thus paving the way for exploring gene expression heterogeneity among individual cells (1-4). The collection of abundances of all RNA species in a cell forms its "molecular fingerprint", enabling the investigation of many fundamental biological questions beyond those possible by traditional bulk RNA sequencing experiments (5). With scRNA-seq data, one can better characterize the phenotypic state of a cell and more accurately describe its lineage and type.
Current scRNA-seq protocols are complex, often introducing technical biases that vary across cells (http://biorxiv.org/content/early/2015/08/25/025528), which, if not properly removed, can lead to severe type I error inflation in differential expression analysis. Compared to bulk RNA sequencing, in scRNA-seq the reverse transcription and preamplification steps lead to dropout events and amplification bias, the former describing the scenario in which a transcript expressed in the cell is lost during library preparation and is thus undetectable at any sequencing depth. In particular, due to the high prevalence of dropout events in scRNA-seq, it is crucial to account for them in data analysis, especially if conclusions involving low to moderately expressed genes are being drawn (7). In handling dropout events, existing studies take varying approaches: some ignore dropouts by focusing only on highly expressed genes (8,9), some model dropouts in a cell-specific manner (10-13), while others use a global zero-inflation parameter to account for dropouts (7).
Since each cell is processed individually within its own compartment during the key initial steps of library preparation, technical parameters that describe amplification bias and dropout rates should be cell-specific in order to adjust for the possible presence of systematic differences across cells. For example, a recent article by Leng et al. found significantly increased gene expression in cells captured from sites with small or large plate output IDs for data generated by the Fluidigm C1 platform (14). One way to quantify these biases, adopted by existing noise models (10-13), is to make use of spike-in molecules that comprise a set of external RNA sequences such as the commonly used external RNA Controls Consortium (ERCC) spike-ins (15), which are added to the cell lysis buffer at known concentrations (4,16). However, a challenge that cannot be ignored in the single-cell setting is that the wide range of concentrations of ERCC spike-ins makes it difficult to measure spike-ins with low concentrations, leading to the lack of reliable spike-in data for estimation of the dropout rates.
For this reason, existing methods that model cell-specific dropout rates using spike-ins do not produce reliable estimates.
We propose here a new statistical framework that allows a more robust utilization of spike-ins to account for cell-specific technical noise. To obtain reliable estimates of cell-specific dropout parameters, we develop an empirical Bayes procedure that borrows information across cells. This is motivated by the observation that, although each cell has its own set of parameters for characterizing its technical noise, these parameters share a common distribution across cells which can be used to make the cell-specific estimates more stable. We demonstrate an application of this general framework by a likelihood-based test for differential expression. An advantage of the proposed framework over the existing approaches is that it can flexibly and efficiently adjust for cell-specific covariates, such as cell cycle stage or cell size, which may confound differential expression analysis.

Data sets and pre-processing
Zeisel et al. data: scRNA-seq data from murine brain cells are acquired from Zeisel et al. (5).
This data set, which employs UMIs, contains counts of 19,972 endogenous genes and 57 ERCC spike-ins of 3,005 cells from various regions of mouse brain. The cells are categorized into nine level-1 classes and 48 level-2 classes, with the level-2 classes considered relatively homogenous. In this paper, we focus our analyses on two level-2 classes, CA1Pyr1 and CA1Pyr2, which respectively contain 447 and 380 cells. The counts are preprocessed by selecting the top 25% of genes in total read account across the 827 cells, resulting in 6,405 genes in real data two-group comparison analysis. For studies involving class CA1Pyr2 only, selection of the top 25% of genes in the 447 cells yield 5,018 genes in the data set. SCAP-T data: scRNA-seq data from murine brain cells are acquired from the SCAP-T study (dbGaP Study Accession phs000835.v4.p1). This data set, which does not have UMIs, contains counts of 46,422 endogenous genes and 87 ERCC spike-ins of 198 neurons and 26 astrocytes from mouse brain. The counts are preprocessed by two filtering procedures: Filter 1 keeps the top 25% of genes in total read account across all the cells. Filter 2 keeps all the genes with nonzero counts in 5 cells or more. Since neurons and astrocytes are processed on different days, this allows us to evaluate whether our model is able to capture and control batch effect.

Modeling of technical variation
In scRNA-seq data, we have observed that the relationship between the mapped read count for a gene and its true expression level in a cell can be characterized using two functions, shown in   (2) Figure 1 shows examples of the relationships depicted in (1) and (2) (1) and (2) characterize the technical noise specific to each cell.

Modeling of biological variation
The above observations have motivated the model shown in Figure 2 ߢ and ߬ are estimated using an empirical Bayes approach that can be summarized briefly into the following steps. • Step 1: ߢ and ߬ , are estimated using logistic regression, is the true expression level of spike-in g and it is assumed to be the same across all cells. • Step 2: A bivariate normal distribution is fit to the estimated  The presence of substantial variation in the magnitude of technical noise across cells underscores the need to account for such variation in downstream analyses.

RESULTS
In this section, we evaluate the performance of TASC on both simulated and two real scRNAseq data sets and compare it with four existing methods, including SCDE (10), MAST (11), and DESeq2 (23), and SCRAN (24). As SCRAN only provides normalized read counts, we perform differential expression analysis using DESeq2 with SCRAN normalized read counts. We include two versions of SCRAN in our evaluation, the original SCRAN, and SCRAN.SP that utilizes ERCC spike-ins in normalization. These methods are rated in terms of type I error rate and power in detecting DE genes, and their results on a real data set with genuine gene expression difference.

Type I error rates in the absence of batch effects
To assess the accuracy of type I error control of TASC and other existing methods, 447 cells from the level-2 class "CA1Pyr2" from the Zeisel et al. data, which is the largest level-2 class, are randomly split into two groups of roughly equal size. value is plotted on heatmaps to reflect the type I error rates under varying severity of batch effects. Figure 5 shows that TASC has well controlled type I error rates across a wide range of batch effect severity, whereas SCDE appears to be conservative overall, and MAST, DESeq2, SCRAN and SCRAN.SP are anti-conservative and susceptible to batch effects.
To compare the methods with regards to their type I error rate under a real data scenario, we analyzed the SCAP-T data, which includes astrocytes and neurons that were processed on different days. This data set provides a perfect example to illustrate the impact of batch effect.
As shown in Supplementary Figure S23  Even SCDE, which tends to be conservative when there are no batch effects, suffer from type I inflation in this real data scenario that contains a possible batch effect. The patterns are similar when we consider all genes in the evaluations.

Evaluation of power
To investigate the power of the methods under realistic scenarios, we continue to utilize the 5,018 genes from the "CA1Pyr2" class in Zeisel et al. data set. Among them, 4,018 genes are designated as "true non-DE", whose counts are directly extracted from the Zeisel et al. data set after group membership randomization. The remaining 1,000 are designated as "true DE", whose counts are simulated from parameters estimated with real data, with an induced between-group fold change that is randomly sampled from a distribution that generates more genes with weak to moderate expression difference than strong difference (details are given in Section 2.3 of Supplementary Data). DE analyses are performed with all methods, and raw pvalues are used to estimate the power of each method at various significance levels. The average power curves in Figure 6A are obtained by smoothing the estimated power across genes with similar fold change. Our results demonstrate that TASC has the highest power, followed by SCRAN.SP, SCRAN, DESeq2, MAST, and SCDE. Figure 6B shows that the higher sensitivity of TASC is more pronounced when fold change is moderate; for example, when fold change is 1.75, at the 0.0001 significance level, the average power of TASC is 8%, 20%, 25%, 37%, and 428% higher than SCRAN.SP, SCRAN, DESeq2, MAST, and SCDE, respectively.
The analysis also indicates the importance of sample size in DE analysis in scRNA-seq (Supplementary Figures S21-22).

Differential Expression analysis on real data
We continue to use the Zeisel et al. data set to evaluate methods performance when true differential expression is present. All five methods are applied to detect DE genes between the two level-2 classes "CA1Pyr2" (݊ ൌ 3 8 0 randomly chosen from 447 cells) and "CA1Pyr1" . Since these two level-2 classes represent different cell type groups, we expect genuine gene expression differences between them. To evaluate the impact of sample size, the two groups are subsampled to

DISCUSSION
The advent of scRNA-seq has made it possible to explore cellular heterogeneity with unprecedented resolution. For the first time, we are able to measure the cell-to-cell variation of RNA expression of all genes in the genome across hundreds to thousands of cells. However, due to the limitations of current technology, scRNA-seq data are often noisy. Failure to account 1 for technical noise can lead to biased downstream analyses and misleading results. To take full advantage of scRNA-seq, it is crucial to account for technical noise so as to better quantify biological variation. Here we have described a statistical framework, TASC, that accurately estimates cell-specific technical biases, adjusts for them in differential expression analysis, and consequently produces results that are more robust to batch effects that exhibit as systematic differences between cells.
TASC utilizes information in spike-ins to account for technical noise in a cell-specific manner.
Compared to the traditional bulk RNA sequencing, in scRNA-seq the reverse transcription and preamplification steps can lead to pervasive dropout events and amplification bias. While amplification bias can be alleviated by the use of UMIs, dropout events are harder to control. To reliably estimate cell-specific dropout parameters under the paucity of reliable spike-ins at low concentrations, we have developed an empirical Bayes procedure that borrows information across cells. The accuracy of this empirical Bayes procedure has been examined in simulations based on real scRNA-seq data.
Our evaluations show that TASC is always slightly conservative. However, we are willing to accept this slight conservativeness, since the data used in our evaluations are generated under an ideal null distribution, and we believe it is more meaningful to examine each method's performance in noisier data where strong deviations from the null can be observed. This is the motivation for our analysis of the data set from SCAP-T involving the comparisons of two groups of neurons with batch effects. Our results show that TASC achieves accurate type I error control under this noisy setting, whereas other methods have substantially inflated type I errors.
Since the performance of TASC relies on spike-ins, it is important to determine if the spike-in data are reliable so that the algorithm can give reasonable results. Before applying TASC, we would always recommend a visual examination of the spike-in data using plots similar to  Figures S24 and S25). If this ܴ ଶ is too low, the spike-in data for that cell may give poor estimates of technical parameters. One could throw away cells with ܴ ଶ values below a threshold, on the basis of unreliable spike-ins, but we believe that is not necessary due to the shrinkage imposed by our empirical Bayes model. For those cells where the observed spike-in counts are too noisy, estimates of the technical parameters are shrunk more aggressively to the cross-cell means, and thus for these less-than-ideal cases one could still have usable technical parameters for normalization.
In scRNA-seq data analysis, it is critical to filter out cells with low quality. A commonly used strategy for filtering is to remove cells in which more than 20% reads come from ERCC spikeins, on the basis that these cells may have been compromised during the experiment. We are not against this filtering strategy, however, we believe that this pre-filtering is not strictly necessary if the adjustment is made for technical dropout and cell size in downstream analyses.
Since the cells with >20% ERCC reads would have small estimated size, and since most genes in a small cell would have zero count or small count prone to dropout, the zeros in such a cell would no longer be outliers in a differential expression analysis. Hence, by careful modeling of technical bias and by allowing for cell size as a covariate, TASC reduces the contribution of these low quality cells to the analysis and avoids the use of an arbitrary cutoff in eliminating cells.
An important feature of TASC is the ability to adjust for covariates such as cell size and cell cycle. Supplementary Figures S26 and S27  The current implementation of TASC assumes that the true expression levels of a gene follow a logNormal distribution in cells. We recognize that logNormal does not entirely reflect the true distribution as transcriptional bursting could lead to zeros in the expression. A more realistic distribution is zero-inflated logNormal, which accounts for true zeros in gene expression. We 2.60GHz) with Laplacian approximation using the binary we provided. Better performance can be achieved when using binaries compiled on the user's hardware. We believe that TASC will provide a robust platform for researchers to leverage the power of scRNA-seq.

SUPPLEMENATARY DATA
Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENT
The authors thank Drs. James Eberwine and Arjun Raj for helpful discussions.

ACCESSION NUMBER
The SCAP-T sequencing data have been deposited in dbGap (accession number phs000835.v4.p1).  u  d  e  n  a  a  r  d  e  n  ,  A  .  (  2  0  1  4  )  V  a  l  i  d  a  t  i  o  n  o  f  n  o  i  s  e  m  o  d  e  l  s  f  o  r  s  i  n  g  l  e  -c  e  l  l  t  r  a  n  s  c  r  i  p  t  o  m  i  c  s  .   N  a  t  u  r  e  m  e  t  h  o  d  s   ,  1  1  ,  6  3  7  -6  4  0  .  2  0  .  P  a  d  o  v  a  n  -M  e  r  h  a  r  ,  O  .  ,  N  a  i  r  ,  G  .  P  .  ,  B  i  a  e  s  c  h  ,  A  .  G  .  ,  M  a  y  e  r  ,  A  .  ,  S  c  a  r  f  o  n  e  ,  S  .  ,  F  o  l  e  y  ,  S  .  W  .  ,  W  u  ,  A  .  R  .  ,  C  h  u  r  c  h  m  a  n  ,  L  .  S  .  ,  S  i  n  g  h  ,  A  .  a  n  d  R  a  j  ,  A  .  (  2  0  1  5  )  S  i  n  g  l  e  m  a  m  m  a  l  i  a  n  c  e  l  l  s  c  o  m  p  e  n  s  a  t  e  f  o  r  d  i  f  f  e  r  e  n  c  e  s  i  n  c  e  l          MAST 4.89e-2 1.12e-2 1.81e-3 7.75e-4 2.58e-4 7.42e-2 2.62e-2 1.21e-2 8.45e-3 6.25e-3 DESeq2 1.70e-1 8.44e-2 3.52e-2 1.69e-2 8.27e-3 1.25e-1 5.91e-2 2.32e-2 1.01e-2 4.79e-3 SCRAN.SP 2.50e-1 1.39e-1 6.63e-2 3.67e-2 2.11e-2 1.69e-1 9.13e-2 4.19e-2 2.21e-2 1.21e-2 Table 1: Proportion of DE genes identified by each method in SCAP-T data at varying significance levels. Filter 1 keeps the top 25% of genes in total read account across all the cells. Filter 2 keeps all the genes with non-zero counts in 5 cells or more. Naïve SCRAN without the use of spike-ins is not included in this comparison, for the package fails to run due to there being "not enough cells in each cluster for specified 'sizes'".