LiBiNorm: an htseq-count analogue with improved normalisation of Smart-seq2 data and library preparation diagnostics

Protocols for preparing RNA sequencing (RNA-seq) libraries, most prominently “Smart-seq” variations, introduce global biases that can have a significant impact on the quantification of gene expression levels. This global bias can lead to drastic over- or under-representation of RNA in non-linear length-dependent fashion due to enzymatic reactions during cDNA production. It is currently not corrected by any RNA-seq software, which mostly focus on local bias in coverage along RNAs. This paper describes LiBiNorm, a simple command line program that mimics the popular htseq-count software and allows diagnostics, quantification, and global bias removal. LiBiNorm outputs gene expression data that has been normalized to correct for global bias introduced by the Smart-seq2 protocol. In addition, it produces data and several plots that allow insights into the experimental history underlying library preparation. The LiBiNorm package includes an R script that allows visualization of the main results. LiBiNorm is the first software application to correct for the global bias that is introduced by the Smart-seq2 protocol. It is freely downloadable at http://www2.warwick.ac.uk/fac/sci/lifesci/research/libinorm.

There are a small number of bias corrections that do recognise that there may be a length related global bias in RNA-seq data. Sailfish [14,15] and its successor Salmon [16], include a single length parameter in its multi-bias correction model, assuming a simple linear length associated bias. This therefore fails to recognise the multi-parameter complexity of the global bias described by N Archer, MD Walsh, V Shahrezaei and D Hebenstreit [17]. The flux simulator [18] recognises the significance of global bias but only allows the effects of this to be simulated and does not provide a method for inferring parameters from a dataset or correcting for the bias in the dataset, aside from some shortcomings of its bias models [17]. Maxcounts [19] recognises the errors introduced by global bias but then uses the maximum number of overlapping reads in a transcript as a bias-independent means of measuring expression levels.
The resulting values are thus a function of a very small proportion of the reads, increasing the uncertainty associated with the expression levels and making them vulnerable to local sequence related bias.

Bias models
Bias correction in LiBiNorm is based on fitting functions to the coverage by sequencing reads along transcripts. Different functions are available that correspond to different models of how the coverages arise. The functions/models in general depend on transcript length and describe how the coverage shapes change with length, which we call the 'global bias' (in contrast to local coverage variation within transcripts).
This global bias is mainly introduced by cDNA conversion, which can be done in different ways and which is the main feature that discriminates different experimental protocols for RNA-seq library preparation. Our various models are designed to cater to those differences. Note that the main target of LiBiNorm is bias correction for Smart-seq2 datasets, requiring model BD (see below). The others are included for completeness and can be useful for testing purposes.
Common variations of cDNA production include the starting points of reverse transcription/1 st strand synthesis (e.g. at the 3' ends of transcripts by primers targeting the poly-A tail or internally by random priming etc) or 2 nd strand synthesis (e.g. at the end of 1 st strands by poly-A tagging etc). A central feature of our models is the description of the cDNA conversion as a stochastic process, where synthesis endpoints are probability distributions that depend on the enzymes' processivities.
Our different models predict characteristic overall shapes of the length dependent coverages, but specific aspects, such as the slopes of the coverage curves, are subject to parameters. It is the estimation of these parameters through fitting of our models that allows the global bias to be corrected; this is because the two main parameters, t1 and t2, correspond to the 1 st and 2 nd strand processivities, a major determinant for the varying effectiveness to produce cDNA along transcripts, thus introducing global bias.
A summary of the models LiBiNorm includes, and the coverage functions, library preparation characteristics and typical coverage curves associated with these, are shown below. Derivation of the models and more detailed descriptions can be found in our previous work [17]. Models A and C are included for completeness only and are based on unrealistic assumptions (perfect enzymatic conversions, i.e. infinite processivities), while model E applies to random-priming based datasets, which is hardly in use anymore. Most relevant is model BD, which builds on the models B and D (see below).
For simplicity, we did not include here the additional parameters of d and h, which describe a reduced coverage (by factor d+1) at the ends of transcripts (over h bases) due to reduced fragmentation efficiency. l corresponds to the transcript length, x is the position within the transcript, t1 and t2 correspond to the 1 st and 2 nd strand processivities, respectively. a corresponds to the proportion of PCR enrichment of full-length 2 nd strands for SMART protocols. Processivity parameters for model E (t1', t2') have slightly different interpretations and include further scaling factors a1 and a2, please see [17] for details.

Evaluation of bias Removal -Methods
The evaluation of bias removal was done using the Drosophila RNA-seq data generated using the Smart-seq2 and TruSeq protocols as part of an investigation into low cost RNA-seq protocols [20]. As part of this study, the authors introduced up to 20% D. virilise RNA in the D. Melanogaster RNA as a natural RNA spike-in in order to assess expression quantification accuracy. They also performed the Smart-seq2 protocol with 2.5-fold and 5-fold dilution of the Nextera reagents as well as at the recommended concentration in order to gauge potential cost savings. These dilutions provide variation in the associated conditions for both the TruSeq and the Smart-seq2 protocol and allow the global bias normalisation to be tested across this range of conditions.
A simple R 2 measure of correlation was used to evaluate the reduction in global bias associated with Smart-seq2 data using LiBiNorm and this same measure was used to evaluate the performance of other expression quantification packages which contain some degree of bias removal.
The reference expression levels were the expression levels in Transcripts per Kilobase Million (TPM) [5] for the four TruSeq samples (SRR1743167 to SRR1743170). The R 2 statistic was then used to quantify the correlation between the log2 of these reference expression levels and those of the 14 Smart-seq2 samples (SRR1743153 to SRR1743166). There will be a number of factors that result in the correlation being less than perfect. As well as the biological and technical noise associated with such measurements, the effect of global bias, which is particularly pronounced in the Smart-seq data, will decrease the correlation. Any reduction in the global bias upon correction efforts will improve the correlation, which would be seen as an increase in the R 2 statistic towards 1.0, indicating a perfect correlation.
These R 2 statistics were calculated before and after LiBiNorm was used to reduce the global bias present in the Smart-seq2 data. The improvement was expressed as a percentage where an R 2 value of 1.0 would be a 100% improvement using the following formula: , where I(%) is the percentage improvement, is the R 2 value for the reference linear TPM expression levels and R 2 is the R 2 value with the global bias reduced Smart-seq2 data.
We repeated this with four other expression quantification packages; Cufflinks [1], Salmon [16], Mix 2 [4] and MaxCount [19]. In the case of Cufflinks and Salmon, the performance was assessed with and without the optional additional bias compensation that is available within the packages.
In all cases, the same gene definitions were used as defined in Drosophila_melanogaster. BDGP6.91.gtf, release 91 of the Ensemble Drosophila melanogaster gene annotation, which is based on the BDGP6 reference genome [21]. The details of how the gene annotation information is used is dependent on the software package being assessed.
When the RNA-seq data is analysed by LiBiNorm, the <fileroot>_counts.txt file produced by LiBiNorm shows the length and the read count for each gene and this is used to calculate the standard TPM expression values for each gene. Only those genes where the read counts for both the TruSeq and the Smart-seq2 data was greater than nine were used to quantify the correlation between the two sets of expression values.
The quantification of correlation values for the other packages for each Smart-seq2/TruSeq combination was restricted to the same set of genes in order help ensure the comparisons were as equivalent as possible.

LiBiNorm
Read alignment and expression quantification were performed as described in the main body of the paper.
The <filename>_counts.txt provides the bias corrected TPM values as well as the transcript lengths and raw counts for each gene. An excel spreadsheet was used to calculate the standard linear TPM values, find the log2 expression values, calculate the R 2 value for those genes where the count exceeded 9 for both samples and plot the correlation graph.

Cufflinks
The following was used to evaluate the gene expression using Cufflinks: The -G option ensures that Cufflinks aligns and counts the reads just to the same transcript definitions as was used for the LiBiNorm evaluation.
Bias correction in Cufflinks is optional; the following was used to determine gene expression with this feature enabled: The FKPM column from the genes.fpkm_tracking file was used to calculate the correlation between the TruSeq and Smart-seq2 results. The correlation is identical to that which would be found using TPM values because FPKM and TPM values for any RNA-seq dataset differ only by a single scaling factor.

Salmon
Salmon is not designed to work with reads that are aligned to a full genome but instead works with reads that have been aligned to a reference transcript set.
LiBiNorm includes a feature to create a fasta file that corresponds to the transcripts that it is using to determine gene expression. Such a fasta file, corresponding to the transcripts used for the LiBiNorm evaluation, was generated using: LiBiNorm refSeq -i gene_id Drosophila_melanogaster.BDGP6.91.gtf bdgp6_tran\genome_tran > refseq.fa where bdgp6_tran\genome_tran is the root filename of the HISAT2 reference genome previously used to align the RNA-seq data.
Salmon was then used to convert the fasta file into a suitable index using: We calculated the R 2 statistic using the TPM values in the quant.sf files that were generated.

Mix 2
The same aligned reads used for the LiBiNorm analysis were used to evaluate Mix 2 . Gene expression was then quantified using: We calculated the R 2 statistic using the FPKM_CHN column of the genes_summary_<fileroot>.dat files.

MaxCount
The same bamfiles of aligned reads were used as for the LiBiNorm analysis. The count of the maximum number of overlapping reads within each exon was then calculated using bedtools coverage -max -abam <fileroot>.bam -b genome.bed > <fileroot>_counts.txt This produces a text file listing the maximum count for each exon in the bed file. An Excel pivot table was then used to find the maximum count across all of the exons associated with a specific gene.

Evaluation of bias removal -results
The reference for the analysis of bias removal was the R 2 correlation statistic of the comparison of the log2 of the linear TPM expression values for all 56 combinations of the Smart-seq2 and TruSeq data (Supplementary Figure 1 and Table 1).
These were then compared with the R 2 correlation statistics for the same 56 combinations when LiBiNorm was used to correct for the global bias in the Smart-seq2 data (Supplementary Figure 2 and Table 2). For each of these combinations the improvement in R 2 was calculated as a percentage and the mean improvement used as a measure of the effectiveness of the bias correction.
For comparison, a similar process was performed with Cufflinks, with and without the '-b' bias correction option (Supplementary Figures 3 & 4 and Tables 3 & 4) Table 7) and MaxCount (Supplementary Figure 8 and Table 8).
In all of the following tables the sample identifiers are abbreviated for clarity, such that SRR1743153 is identified as ..153.
The following colour scales are used for all of the tables: Supplementary Figure 2 (Figure 2b in the main paper). Correlation for the same SRR1743160/SRR1743167 combination as Supplementary  Figure 1 with global bias correction applied to Smart-seq2 data.  Figure  3. Correlation of the same SRR1743160/SRR1743167 as Supplementary Figure 1 with expression quantified using Cufflinks.      Table 5 but with --seqBias --posBias bias correction options applied when processing the data using Salmon.

Cufflinks with Bias Correction
Supplementary Figure 6. Gene expression correlation as Supplementary Figure 5 but with the --seqBias --posBias bias correction options applied when processing the data using Salmon.  Figure 8. Gene expression correlation for the same SRR1743160/SRR1743167 combination as Supplementary Figure 1 but with expression quantified using MaxCount.