Analysis of stranded information using an automated procedure for strand specific RNA sequencing

Sigurgeirsson, Benjamín; Emanuelsson, Olof; Lundeberg, Joakim

doi:10.1186/1471-2164-15-631

Methodology article
Open access
Published: 28 July 2014

Analysis of stranded information using an automated procedure for strand specific RNA sequencing

Benjamín Sigurgeirsson¹,
Olof Emanuelsson¹ &
Joakim Lundeberg¹

BMC Genomics volume 15, Article number: 631 (2014) Cite this article

6479 Accesses
27 Citations
1 Altmetric
Metrics details

Abstract

Background

Strand specific RNA sequencing is rapidly replacing conventional cDNA sequencing as an approach for assessing information about the transcriptome. Alongside improved laboratory protocols the development of bioinformatical tools is steadily progressing. In the current procedure the Illumina TruSeq library preparation kit is used, along with additional reagents, to make stranded libraries in an automated fashion which are then sequenced on Illumina HiSeq 2000. By the use of freely available bioinformatical tools we show, through quality metrics, that the protocol is robust and reproducible. We further highlight the practicality of strand specific libraries by comparing expression of strand specific libraries to non-stranded libraries, by looking at known antisense transcription of pseudogenes and by identifying novel transcription. Furthermore, two ribosomal depletion kits, RiboMinus and RiboZero, are compared and two sequence aligners, Tophat2 and STAR, are also compared.

Results

The, non-stranded, Illumina TruSeq kit can be adapted to generate strand specific libraries and can be used to access detailed information on the transcriptome. The RiboZero kit is very effective in removing ribosomal RNA from total RNA and the STAR aligner produces high mapping yield in a short time. Strand specific data gives more detailed and correct results than does non-stranded data as we show when estimating expression values and in assembling transcripts. Even well annotated genomes need improvements and corrections which can be achieved using strand specific data.

Conclusions

Researchers in the field should strive to use strand specific data; it allows for more confidence in the data analysis and is less likely to lead to false conclusions. If faced with analysing non-stranded data, researchers should be well aware of the caveats of that approach.

Background

The transcriptome has long been studied by reverse transcribing single stranded RNA into double stranded cDNA and assessed with assays such as PCR [1, 2], microarrays [3, 4] or massively parrallel sequencing [5, 6]. By assessing gene expression through cDNA the strand information of the RNA is lost. With the advent of many strand specific RNA library preparation protocols increasing number of RNA sequencing experiments are generating stranded RNA sequencing data [7–9]. Without strand information it is difficult to determine correct gene expression from overlapping genes; i.e. genes that have the same location in the genome, at least partly, but are transcribed from opposite strands. Knowing the strand information of the cDNA is essential to determine from which of the overlapping genes the RNA originates from. Such overlapping genes in mammalian genomes, while not frequent, are more common than previously thought [10, 11] and they are widespread in genomes of other species, especially those with small and compact genomes [12].

Increasing exploration of the transcriptome has led to discoveries of multitude of various RNA species [13]. Of particular interest with regards to strand information is antisense RNA (asRNA) which is a transcribed RNA that is complementary, i.e. on the opposite strand, to another gene, usually a protein coding gene. Thus by definition all antisense genes are overlapping genes. The most straightforward regulatory function of asRNA is its ability to hybridize to its existing sense mRNA and hinder translation of that particular mRNA molecule. This, however, is just one function of many and asRNA encompasses many different types of RNA [14]. A relatively newly discovered feature of asRNA is the antisense transcription of pseudogenes [15, 16]. Pseudogenes, evolutionary remnants of gene duplication, were long thought to be silent and non-functional. Still, while prokaryotes rapidly lose pseudogenes from their genomes, complex multicellular animals like mammals often retain their pseudogenes, suggesting evolutionary conservation and thus function. Evidence is now mounting towards various regulatory functions of pseudogenes [17].

A handful of protocols have been published which retain the strand information of the RNA with varying degree of success and labor intensity. In 2009 Parkhomchuk et al. [7] published a strand specific library protocol which has since become popular among such protocols being both relatively simple and effective. The protocol is called dUTP second strand marking method, or dUTP method for short, and consists of using dUTPs instead of dTTPs during the synthesis of the second strand in the cDNA synthesis step during sample preparation. Then prior to PCR amplification the uracil in the second strand is degraded using Uracil-N-Glycosylase (UNG). With the second strand partly degraded only the first strand is amplified in the subsequent PCR. This particular strand specific protocol was evaluated as superior in terms of simplicity and data quality in a benchmark study of strand specific protocols [18].

In the current study we modulate specific steps in a scalable transcriptome preparation method [19] to combine the strand specific dUTP method [7] and the Illumina TruSeq RNA sample preparation kit (# RS-122-2001) into an automated strand specific RNA sequencing protocol. By preparing libraries from different cancer cell lines we show that the stranded protocol is reproducible and compares well to its non-stranded counterpart [19] and requires little extra hands on time in sample preparation. From our sequencing data we compare the performance of two sequence aligners; Star [20] and Tophat2 [21]. In contrast to the published method [19] we use ribosomal depletion instead of poly adenylation selection to enrich RNA and here we evaluate two ribosomal depletion kits; RiboMinus (Ambion®;) and RiboZero Gold (Epicentre). We then highlight some advantages of stranded libraries by performing a differential expression analysis between strand specific and non-stranded libraries and note how this procedure can be used to probe the annotation of the genome. In conclusion, we turn our attention to high coverage strand specific data to further explore stranded features of the transcriptome; we validate the antisense transcription of the pseudogene PTENP1 which has been shown to be involved in the regulatory network of the expression of the gene PTEN [16], and we report novel transcription in the U2OS cell line.

Results

Sample preparation

Apart from the ribosomal depletion step and as described in [19] (and in Methods), all other sample preparation steps; carboxylic acid (CA) purification, cDNA synthesis and library preparation, were carried out on a Magnatrix™ 1200 Biomagnetic Workstation (MBS) (Nordiag ASA, Oslo, Norway), which is equipped with a 12 tip head suitable for preparing 12 samples in parallel. The stranded protocol differs from the non-stranded protocol in two ways; First, during cDNA synthesis a CA purification step, carried out on the MBS, is introduced after the first strand synthesis after which the second strand synthesis continues as normal except the nucleotide mix includes dUTPs instead of dTTPs. This CA purification step is necessary to remove all the dTTPs prior to second strand synthesis. Second, after library preparation [19, 22], a second strand digestion step is added. This step ensures that only the first strand survives the subsequent PCR amplification step and hence the strand information of the libraries. Each of these additional steps add 45-60 minutes to the total preparation time, with about 15-20 min of those being hands on. Additional file 1 shows the main automated steps of sample preparations and highlights the difference between the non-stranded method and the strand specific method.

In total there were 15 libraries prepared, 12 strand specific and 3 non-stranded. All libraries returned a high yield; 78.1 ng/ μl and 110.4 ng/ μl on average for the strand specific and non-stranded libraries respectively. All the libraries had comparable mean fragment length; 259 bp and 245 bp on average for the strand specific and non-stranded libraries, respectively. Additional file 2 shows the concentration and the mean fragment length of each of the 15 libraries.

Sequencing

All libraries were sequenced on the Illumina HiSeq 2000 generating 100 bp paired end reads. The 15 libraries were divided into 5 groups depending on how they were prepared as shown in Table 1. In Table 1 the Group ID shows the RNA source (A431, U251 or U2OS), the enrichment method (RiboMinus or RiboZero) and the library type (strand specific or non-stranded). Also shown in Table 1 is the average number of raw read pairs generated in the sequencing. The raw sequencing reads are available at the NCBI Sequence Read Archive under accession SRP043027 (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra/).

Table 1 Overview of library groups

Full size table

Trimming and mapping

Read alignment was performed by Tophat v2.0.4 [21] and Star v2.3.1o [20] on raw reads and on quality reads, i.e. reads that had been through adapter removal and quality trimming (see Methods). Tophat mapped 59.8% of the raw reads on average compared to 88.2% for Star. For the raw reads the average mapping speed, measured in mapped read pairs per second, was 542 for Tophat compared to 50000 for Star. The percentage of raw reads discarded from analysis by the quality trimming step step was 0.71, 0.40, 0.97, 0.50 and 3.66 for Groups 1, 2, 3, 4 and 5 respectively. Tophat mapped 74.6% of the quality reads on average compared to 94.3% for Star. For the quality reads the average mapping speed, measured in mapped read pairs per second, was 701 for Tophat compared to 64900 for Star. Thus, Star is nearly one hundred times faster than Tophat.

Graphical representation of these mapping attributes, mapping percentage and mapping speed, for both aligners and a comparison between the handling of raw data and quality data is shown in Figure 1. For these attributes Star outperforms Tophat in all instances. Also, the quality trimming improves the alignment yield, not only in relative terms but in absolute terms as well (see Additional file 3), and the mapping speed. Based on these results all further downstream analysis was based on the quality trimmed data aligned with Star.

Quality control metrics

The robustness of the protocol and the quality aspects of the data were evaluated using the 15 libraries generated from the human cell lines (libraries 1-15) through different metrics; ribosomal RNA in data, strand specificity, duplication rate, gene body coverage and expression correlation. Details from some of these analyses can be found in Additional file 3.

Ribosomal contamination

To evaluate the efficiency of the ribosomal depletion, the rRNA reads in each library were quantified. On average the libraries treated with RiboMinus contained 65.7% rRNA compared to only 2.24% for the libraries treated with RiboZero. Figure 2a shows the average percentage of rRNA reads in each library group.

Strandedness of libraries

Figure 2b shows the average strand specificity of each library group. Here, strand specificity means the percentage of times the read matches the annotation correctly according to how the library was made. For dUTP libraries the first read in a read pair must be reversed so it matches the annotation while the second read matches the annotation directly. For the strand specific libraries treated with RiboMinus (library groups 1-3) the average strand specificity is 97.4% while for the strand specific libraries treated with RiboZero (library group 5) the average strand specificity is 95.1%. This difference was found to be statistically significant (Student’s t-test, p < 0.05). The average strand specificty for the non-stranded libraries (library group 4) is 50.0% as expected.

In general, these results show that the protocol works and in particular that the CA purification step is successful in removing dTNPs after the first strand synthesis.

Duplication quantification

Figure 2c shows the average percentage of duplicates identified for each library group. There is some variation in duplication frequency in the libraries especially between library groups. The libraries treated with RiboZero show higher duplication rate, 52.5% on average, than the libraries treated with RiboMinus, 22.4% on average. This difference was thought to be related to the difference in sequencing depth. To verify that, the RiboZero data was downsampled to 15 million reads and the duplication rate quantified again. The duplication rate decreased from 52.5% to 37.0% when using the downsampled data and thus the high duplication rate can only be explained partly by the high sequencing depth. The duplication rate for this downsampled data is shown as hollow symbols in Figure 2c.

Due to the risk of inaccurately identifying reads originating from highly expressed genes as duplicates, the duplicate reads were not removed prior to differential expression analysis.

Gene coverage and read distribution

Figure 2d shows the read coverage, normalized for the different read depths, and averaged over each library group. No discernable difference can be seen between the libraries and they all show even coverage across the gene body. Additional file 4 shows, for each group, how the reads are distributed to exons, introns and intergenic regions. Analysis of variance (ANOVA) revealed no significant difference in the read distribution between the groups.

Expression correlation

To further assess the robustness of the libraries the correlation of expression values between replicates was quantified. The mean Pearson correlation of the 16 possible correlations within replicates was R² = 0.96. Correlation plots and the Pearson correlation value for each correlation is shown in Additional file 5.

Differential expression - strand specific vs. non-stranded data

The only difference between libraries in group 3 and group 4 is that the libraries in group 3 are strand specific, generated using the current approach, while the libraries in group 4 are non-stranded, generated using the approach in [19]. In order to explore the differences between these library types a differential expression (DE) analysis was carried out beween these groups; first by downsampling each library so that they contain equal amount of reads, then by acquiring read counts per gene using htseq-count [23] and finally using these read counts as input for the DE analysis using DESeq v1.10.1 [24].Of the 62893 annotated genes (protein coding and non-coding) 41065 do get assigned low or no expression in all of the six libraries. Of the remaining 21828 genes 245 are found to be significantly differentially expressed genes, hereafter referred to as DEGs. Of these 245 DEGs 69 have higher expression in the non-stranded libraries while 176 DEGs have higher expression in the stranded libraries. Intriguingly the division of DEGs into protein coding genes and non coding genes is different depending on whether the DEGs have a higher expression in the non-stranded data or in the stranded data. So, for the 69 DEGs which show higher expression in the non-stranded data 24 are protein coding and 45 are non-coding while for the 176 DEGs which show higher expression in the stranded data 136 are protein coding and 40 are non coding. This expression profile is shown in Figure 3.

To find out why these DEGs arise, coverage plots for a selection of the DEGs with the lowest p-values, were analysed and compared to the annotation used for counting by htseq-count. All DEGs investigated that have a higher expression in the stranded data compared to the non-stranded data have overlapping annotation which results in many reads mapping to those genes being labeled as ambiguous for the non-stranded data and hence resulting in low expression. Explanation for DEGs with higher expression in the non-stranded data compared to the stranded data is not as straightforward but scrutiny revealed three dominant reasons for these DEGs; i) the DEGs have overlapping features that are unnannotated, ii) the DEGs are annotated in the wrong direction or iii) the DEGs have antisense intronic transcripts that get wrongly assigned to them. Additional file 6 shows coverage plots of selected DEGs along with their annotation and explanations for why these DEGs arise in this comparison and Additional file 7 lists all the genes found to be significantly differentially expressed in this differential expression analysis.

This analysis also demonstrates how essential it is to have strand specific libraries for compact genomes with high abundances of overlapping genes since without strand specificity a large proportion of the genes would be labeled as ambiguous.

Transcriptome assembly - strand specific vs. non-stranded data

For each library in group 3, 4 and 5 two transcript assemblies were made using Cufflinks [25]. The first, termed raw assembly, used all mapped reads while the other, termed novel assembly, used only those reads that did not map to the ensembl reference annotation (version GRCh37.72). Then the assemblies within each group were merged together using Cuffmerge [25]. In addition, library 5 was assembled again without supplying Cufflinks with the information that it was strand specific thus generating a pseudo non-stranded assembly.

From these assemblies it was found that strand specific data generates fewer transcripts compared to non-stranded data and the average transcript length is usually shorter for the strand specific data as compared to the non-stranded data. The same holds true when comparing the assemblies between library group 5 treated as strand specific to library group 5 treated as non-stranded. An overview of the assembly results is shown in Additional file 8.

Antisense transcription of the pseudogene PTENP1

The antisense transcription of the pseudogene PTENP1 has previously been reported and suggested to play a role in the regulation of the gene PTEN [16]. From the high coverage data of the U2OS cell line (Library Group 5) this antisense transcription was verified and further evaluated.The coverage plot in Figure 4 shows clear antisense expression of the PTENP1 pseudogene. The RefSeq database has this antisense transcript annotated as a gene with one isoform containing four exons as shown in the uppermost annotation track in Figure 4 (PTENP1-AS). The Ensembl database, however, does not have this antisense transcript annotated but it does have two genes annotated further downstream as shown in the Ensembl annotation track in Figure 4. The new annotation presented here, based on the raw assembly results from library group 5, is shown at the bottom annotation track in Figure 4 which suggests two new isoforms of the antisense gene PTENP1 one of which includes a new exon (PTENP1-AS2). This new exon overlaps the two annotated genes from the Ensembl annotation indicating that they may be, not seperate genes, but part of the PTENP1 asRNA.

Using an annotation which included the RefSeq isoform, PTENP1-AS, and our two new isoforms, PTENP1-AS2 and PTENP1-AS3, the isoform expression levels were evaluated with Cufflinks. PTENP2-AS is expressed with an average FPKM value of 0.64 but the other isoforms, PTENP1-AS and PTENP3-AS, showed no expression according to Cufflinks.

Novel expression in U2OS

Identification of novel genes was attempted by using the novel assembly from library group 5. By counting the mapped reads towards this novel assembly the most highly expressed genes were investigated. Many of the assembled transcripts were evidently intron transcripts and others matched the current annotation, at least partly. Other transcripts were potentially novel genes.

One interesting example are two overlapping novel genes on chromosome 17: 25380000-25500000. Figure 5 shows this region along with the annotations from Ensembl and the prediciton by Cufflinks. Currently the only known annotation in this region is the pseudogene TUFMP1 (ENST00000581294) but the data shown here clearly indicates more transcriptional activity, originating from both strands. The open reading frame of these novel transcripts indicates that they are non-coding. Comparing the transcription of the locus between the three cell lines shows that this transcription is exclusive to the U2OS cell line (see Additional file 9).

Two other interesting findings can be found in the Additional files; a U2OS cell specific transcription on chromosome 6, possibly a pseudogene, is shown in Additional file 10, and ubiquitous transcription of chromosome 14, which is currently annotated as two genes but our data suggests it is two exons of one gene, is shown in Additional file 11.