The case for using Mapped Exonic Non-Duplicate (MEND) read counts in RNA-Seq experiments: examples from pediatric cancer datasets

Background The accuracy of gene expression as measured by RNA sequencing (RNA-Seq) is dependent on the amount of sequencing performed. However, some types of reads are not informative for determining this accuracy. Unmapped and non-exonic reads do not contribute to gene expression quantification. Duplicate reads can be the product of high gene expression or technical errors. Findings We surveyed bulk RNA-Seq datasets from 2179 tumors in 48 cohorts to determine the fractions of uninformative reads. Total sequence depth was 0.2-668 million reads (median (med.) 61 million; interquartile range (IQR) 53 million). Unmapped reads constitute 1-77% of all reads (med. 3%; IQR 3%); duplicate reads constitute 3-100% of mapped reads (med. 27%; IQR 30%); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (med. 25%; IQR 21%). Informative reads--Mapped, Exonic, Non-duplicate (MEND) reads--constitute 0-79% of total reads (med. 50%; IQR 31%). Further, we find that MEND read counts have a 0.22 Pearson correlation to the number of genes expressed above 1 Transcript Per Million, while total reads have a correlation of −0.05. Conclusions Since the fraction of uninformative reads vary, we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment. We provide a Docker image containing 1) the existing required tools (RSeQC, sambamba and samblaster) and 2) a custom script. We recommend that all results, sensitivity studies and depth recommendations use MEND units.

Informative reads--Mapped, Exonic, Non-duplicate (MEND) reads--constitute 0-79% of total reads (med. 50%; IQR 31%). Further, we find that MEND read counts have a 0.22 Pearson correlation to the number of genes expressed above 1 Transcript Per Million, while total reads have a correlation of -0.05.

Conclusions
Since the fraction of uninformative reads vary, we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment. We provide a Docker image containing 1) the existing required tools (RSeQC, sambamba and samblaster) and 2) a custom script. We recommend that all results, sensitivity studies and depth recommendations use MEND units.
Keywords RNA-Seq "sequencing depth" duplicate unmapped exonic quality 3 Background Assessing the accuracy and reproducibility of gene expression results obtained from the analysis of RNA-Seq data has been a priority since the development of the assay. Seminal studies showed the relationship between the amount of sequence data generated during an experiment (depth of sequencing) and the reproducibility of the resulting gene expression measurements [1,2] . However, RNA-Seq data is not homogenous. Of the tens of millions of sequences (reads) in a typical RNA-Seq dataset (the data generated from one biological sample), some reads cannot be mapped back to the reference transcriptome. Others map to genome regions outside of exons or have been duplicated during the library construction process or sequencing. Nearly all methods for quantifying gene expression in bulk RNA-Seq data count reads that align to exons in a gene; thus, unmapped and non-exonic reads do not contribute to measurements and are uninformative regarding the accuracy of the experiment [3,4] . Therefore, considering the total number of reads as a proxy for RNA-Seq gene expression accuracy can result in inflated accuracy estimates.
Duplicate reads may be due to either highly abundant transcripts or technical error. The process of preparing RNA-Seq libraries involves PCR amplification. This step can generate duplicated identical or nearly identical reads. While the original read represents gene expression in the experimental system, the artifactual duplicate reads do not. However, duplicate reads are also generated by very highly expressed genes since each gene has a finite number of unique read sequences that can be generated from it [5] .
Here we analyze 2179 bulk, paired end, polyA-selected RNA-Seq datasets to characterize the read types present in the datasets and evaluate what fraction of commonly reported data is unequivocally relevant to the accuracy of gene expression measurements. We compare the correlation of total reads and MEND reads to the number of measured genes.
Briefly, input is a genome-aligned bam file containing RNA-Seq read data. Duplicates are marked with Samblaster v0.1.22 [7] , and the RSeQC v2.7.10 [8] script read_distribution.py quantifies exonic read and tag counts, excluding QC fail and duplicate reads as well as secondary alignments. The script parseReadDist.R, which we wrote, estimates the number of MEND reads based on RSeQC output by summing the tag counts in CDS exons, 5' UTR exons and 3' UTR exons and multiplying by reads per tag. Since a pair of reads provides information about two nearby sequences in a single transcript, read counts are reported in pairs. For example, 20 reads means that there are 20 pairs of reads. The process for estimating MEND read counts is available as a stand-alone docker image [9] and can be executed on CodeOcean [10] . The source code is freely available on GitHub [11] .

Data description
Here we discuss 2179 publicly available, bulk RNA-Seq datasets we gathered for the RNA-Seq compendium [12] used for comparative single-patient analysis [6] . Accession numbers, clinical data and read counts for each dataset are in Table S1. Repositories and cohort information is aggregated in Tables   The datasets came from five repositories (Table S2). Each was assigned to a cohort based on 1) project accession (for EGA and SRA datasets), 2) disease sub-study for NCI Therapeutically Applicable Research to Generate Effective Treatments (TARGET), or 3) disease for datasets in the St Jude Cloud. Cohorts were assigned IDs in descending order of size. Cohort assignments were intended to approximate a typical sequencing project performed by one research group at one sequencing center. The cohorts range in size from 3 to 337 datasets ( Figure 1A); the median number of datasets in a cohort is 24.5.
All libraries were prepared with polyA selection. All data were generated via paired-end Illumina sequencing technology. The median sequence length is 101 bases ( Figure 1B). Data analysis RNA-Seq read data was aligned to the genome with the TOIL RNA-Seq pipeline previously described [13] . Briefly, adapters were removed with CutAdapt v1.9. Reads were then aligned with STAR v2.4.2a with indices based on GRCh38 and gencode v23. RSEM v1.2.25 was used to quantify gene expression. The source code of the pipeline is available [14] . MEND read counts were calculated with MEND qc release v1.1.1.

Read types in RNA-Seq data
We interrogated the read types present in our RNA-Seq datasets in our gene expression quantification pipeline (Fig 1A). We obtained the number of total and mapped reads from the aligner log. We marked duplicates in the aligned BAM file, and counted them, along with exonic reads, using RSeQC. Duplicate reads are reported as a fraction of mapped reads, and exonic reads are reported as a fraction of non-duplicate reads. Most RNA-Seq datasets contain a small percentage of unmapped reads (Fig 2B). In the data from 2179 datasets, 75% of datasets have fewer than 6% unmapped reads (Fig 3A). The distribution is left-skewed with a long right tail. The value of excluding these from read counts is apparent, as these reads do not correspond to any known expressed gene; in 77 datasets, more than 25% of reads are unmapped. Including those reads in any measure of the sensitivity of gene expression measurement would misguide the researcher.
The percentage of mapped reads that are duplicate reads ("percent duplicates") is more varied.
426 datasets have more than 50% duplicates (Fig 3A). Some cohorts are characterized by high duplicate fractions (Fig 3B). For example, 72 of the 127 datasets in Cohort 4 have more than 98% duplicates. All 72 have a total sequencing depth above 170 million reads.
However, even cohorts with generally low duplicate fractions can contain anomalous datasets; of the 41 cohorts with a median of less than 50% duplicates, 26 contain at least one dataset with more than 50% duplicates.
If duplicate reads were only a function of genes being especially deeply sequenced, we would expect sequencing depth to explain most of the variability in the fraction of duplicate reads. The total sequencing depth has a 0.58 Pearson correlation with the fraction of duplicate reads, explaining 34% of the variability (Supplemental Figure 1). The majority of the variability in the fraction of duplicate reads is independent of read depth. Consequently, the fraction of duplicate reads cannot be inferred from the total read depth.
Like percent of duplicates, the percent of non-exonic reads among all mapped, non-duplicate reads ("percent non-exonic") has a broad distribution compared to other read type fractions, with an IQR of 21%. 330 datasets have a fraction of non-exonic reads above 50%. Correlations to the number of genes expressed above 1, 2 and 3 TPM are shown.

Correlation of read counts with the number of genes measured
If total or all mapped read counts were informative about the sensitivity of gene expression measurements, we would expect them to correlate with the number of expressed genes. When calculated using all 2179 datasets, total and mapped reads were inversely correlated (Pearson r=-0.4 for both) with the number of genes with expression above 1 Transcripts Per Million (TPM) (Supplemental Figure 2). We recalculated the correlations after excluding 1) the 78 datasets with fewer than 100 measured genes (all of which had more than 170 million total reads) and 2) the 105 datasets sequenced to more than 250 million total reads (such deep sequencing is usually intended for detecting rare events rather than measuring gene-level expression). In the resulting data from 1996 datasets, total and mapped reads are not correlated with the number of genes with expression above 1 TPM (Pearson r = -0.05 and -.0.04, respectively; Figure 2C).
MND and MEND read counts are correlated at Pearson r = 0.22 and 0.23, respectively. When genes with higher expression are counted, the correlation of MEND counts to measured genes increases, while the correlation for MND decreases.

Conclusion
Researchers want to know that their data is sufficient for the measurements they're making.
Here we show that, for the purpose of determining whether an RNA-Seq dataset is sufficient for accurately measuring expression of known genes, the fraction of relevant content of an RNA-Seq dataset (percent of MEND reads) varies substantially within and between cohorts. We confirm the relevance of MEND read counts to gene expression measurements by demonstrating that MEND read counts are correlated to the number of measured genes and total read counts are not.
This work was performed in pediatric samples as part of the development of our comparative RNA-Seq assay for pediatric cancer patients [6,12] . Since the factors that reduce the quality of RNA-Seq datasets (e.g. degradation, low sample volume, contamination, low base quality) are not specific to pediatric cancer samples, we predict that other kinds of RNA-Seq datasets would also show compositional variability. The MEND read counting tool is independent of species and genome version; it can be used on any bulk RNA-Seq dataset.
There are several reasons why a survey of this breadth has not been previously performed.
Obtaining and processing clinical datasets from multiple sources is an intensive effort. Tumor datasets are usually controlled access, and obtaining the 48 cohorts we report on here required multiple legal agreements [25] . Analyzing read types requires genome-aligned reads; the files containing genome-aligned reads are large and are not generated when using the much faster pipelines that quantify gene expression via pseudoalignment. Large RNA-Seq cohorts such as GTEX and TCGA use consistent methods and exclude datasets that fail their stringent and consistent quality control [26,27] . They lack the kind of heterogeneity observed in our cohorts. In short, generating this data for more than 2000 datasets is time-consuming, expensive, and requires staff with a variety of expertises.
Measuring the number of MEND reads in a dataset is specific to the alignment parameters and gene model. We use Gencode v23, which is inclusive, defining more than 60,000 genes. By default, the aligner we use, STAR, defines reads that map to as many as 20 positions as mappable. If we changed our pipeline, asking STAR to exclude reads mapping to more than 2 positions and using a more conservative gene model with 30,000 genes, the same dataset would have fewer MEND reads due to the loss of reads that map too much or map only to regions newly defined as non-exonic.
People planning RNA-Seq experiments look for guidance on how much sequencing their experiment requires. For comparing gene expression measurements between datasets, ENCODE recommends a minimum of 30 million mapped reads [28] ; the GEUVADIS consortium study had a minimum goal of 20 million reads [29] . However, of the 2078 datasets in this study with more than 30 million mapped reads, 16% contain fewer than 25% informative (MEND) reads. We speculate that these guidelines were not intended to include those datasets, some of which measure fewer than 100 genes.  Supplemental Table 1 https://docs.google.com/spreadsheets/d/1awKt5e3wYMWiliMwida1-8HLYrhUaOWP1GK26uV5a gs/edit?usp=sharing The accession numbers in Table S1 are the definitive sources of the RNA sequencing data. The DOI links to citations are provided for convenience. They may reflect the citation provided by the data provider, a citation we identified referring to the RNA sequencing data, or a citation we identified referring to the patient whose tumor was sequenced. Supplemental