Low-cost, low-input RNA-seq protocols perform nearly as well as high-input protocols

Recently, a number of protocols extending RNA-sequencing to the single-cell regime have been published. However, we were concerned that the additional steps to deal with such minute quantities of input sample would introduce serious biases that would make analysis of the data using existing approaches invalid. In this study, we performed a critical evaluation of several of these low-volume RNA-seq protocols, and found that they performed slightly less well in per-gene linearity of response, but with at least two orders of magnitude less sample required. We also explored a simple modification to one of these protocols that, for many samples, reduced the cost of library preparation to approximately $20/sample.

The presence of a PCR amplification step in most RNA-seq protocols has the potential to introduce a significant number of duplicated reads that arise from the same cDNA fragment (Benjamini and Speed 2012; Xu et al. 2012). Although there are experimental approaches to mitigate this effect (Mamanova et al. 2010), another computational option is to simply remove reads that map to identical locations (Li et al. 2009;Xu et al. 2012). However, this has the potential to remove bonafide duplicates-those that came from different cDNA fragments, but due perhaps to the high expression level of a gene or deep sequencing for a library, came from the same location along a transcript. Moreover, we have found that libraries with high levels of PCR duplication tend to be less reliable when compared to replicates, and so prefer to simply discard those libraries, regenerating them from the initial RNA when possible.
However, there is not to our knowledge an accepted method to determine whether a sample has a high level of duplication. While examining GBrowse tracks for "blockiness" can qualitatively identify problems (as in Figure 1), it is both time consuming and not particularly rigorous. Noting the fraction of reads that map to unique locations is imperfect, as it will be sensitive to the depth of sequencing, and therefore potentially difficult to compare across libraries.
Comparing the number of unique read start sites to the total number of bases in a gene offers one potential way to quantify the amount of PCR duplication actually present in the library. We simulated drawing unique, independent positions from a 1.5kb transcript (the average Drosophila transcript size, according to Daines et al. (2011))., and noting the fraction of unique start sites. We recognize that this simulation is highly idealized: there is a position-dependent bias, most often favoring the 5 end of the read ; not all read start sites are equally likely, as fragmentation followed size selection can lose the reads closest to either end; and in a stranded protocol (which we have not tested here) the "forward" read must, necessarily, come before the "reverse" read. Nevertheless, this captures the essence of the problem, and we will show that it closely matches real data.
For each gene, we calculated f , the fraction of bases in that gene that had a read start at that position. We plotted this quantity in figure 2 against the number of reads mapping to that gene divided by the gene length (that is, Figure 1: GBrowse views of the same 2kb region of Chromosome 3R in two libraries with different levels of duplication. Despite the similar number of reads in each sample, the lower library has much less information, due to a relatively high level of PCR duplication. the total coverage, c). As expected, as the average coverage for each gene increased, the fraction of unique start sites increased as well, until the coverage approached 1x, at which point the available start sites became saturated. Crucially, when we plotted these on a log-log plot, the region below about 10% of the occupied start sites was approximately linear for at least 3 logs below that. The fit equation was: We expect the slope m to be slightly less than 1, to indicate that increasing the coverage should increase the fraction of start sites occupied, but with some chance of multiple, independent reads coming from the same location, even in the absence of duplication. This leads to an easy interpretation of the score B = 10 b sim −b as the average level of PCR duplication. A lower intercept corresponds to a shift to the right on the plot, which in turn means that the coverage to yield a given number of unique sites must be higher.
When we calculated the B-score for samples from previously published datasets, both from our lab (Lott et al. 2011) and the modENCODE consortium (modENCODE Consortium et al. 2010), we found that the scores were all less than 3 ( Figure 3A). By contrast, previous libraries from our lab that have been unpublished due to our lack of confidence that the data was not highly enriched for duplicates (including the lower panel of Figure 1) had B-scores in excess of 5 and up to 30 ( Figure 3B). RNAseq data used in the main text of this study had B-scores as shown in table 1 in this document. Although in many cases they are higher than ideal, they are comparable to previously published data, and only one is greater than 4.3. We also simulated the fits for various sizes between 100bp and 10kb. Although the fit parameters did have a clear, increasing trend in response to increasing the simulated transcript size (Figure 4), the increase was small compared to the actual values. A variation of 0.005 in the intercept, which is used to calculate the B-score, corresponds to an actual difference of about 1%. We are thus not concerned about the choice of 1.5kb to simulate the ideal scenario.