Digital expression explorer 2: a repository of uniformly processed RNA sequencing data

Abstract Background RNA sequencing (RNA-seq) is an indispensable tool in the study of gene regulation. While the technology has brought with it better transcript coverage and quantification, there remain considerable barriers to entry for the computational biologist to analyse large data sets. There is a real need for a repository of uniformly processed RNA-seq data that is easy to use. Findings To address these obstacles, we developed Digital Expression Explorer 2 (DEE2), a web-based repository of RNA-seq data in the form of gene-level and transcript-level expression counts. DEE2 contains >5.3 trillion assigned reads from 580,000 RNA-seq data sets including species Escherichia coli, yeast, Arabidopsis, worm, fruit fly, zebrafish, rat, mouse, and human. Base-space sequence data downloaded from the National Center for Biotechnology Information Sequence Read Archive underwent quality control prior to transcriptome and genome mapping using open-source tools. Uniform data processing methods ensure consistency across experiments, facilitating fast and reproducible meta-analyses. Conclusions The web interface allows users to quickly identify data sets of interest using accession number and keyword searches. The data can also be accessed programmatically using a specifically designed R package. We demonstrate that DEE2 data are compatible with statistical packages such as edgeR or DESeq. Bulk data are also available for download. DEE2 can be found at http://dee2.io.


Background
Since its first description 10 years ago, RNA sequencing (RNAseq) has become a powerful method in transcriptomics, allowing highly accurate gene expression quantification [1]. As the cost of sequencing decreases, RNA-seq data are becoming more ubiquitous in the scientific literature. It is standard practice in the field and a compulsory requirement for journals to deposit these data to Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) [2,3] in the form of raw and processed files, with the aim of fostering greater reuse and transparency. In practice, however, there are several hurdles that impede widespread reuse by biologists. First, processing raw sequence data from SRA requires significant computational resources and command-line expertise. Second, the processed RNA-seq data hosted by GEO are prepared in assorted formats that utilize various software tools and genome annotation sets, which complicates meta-analyses. Despite the value of these data to the scientific community and tremendous cost to generate them, RNA-seq data aggregation efforts have been largely limited to human and mouse [4,5] or are closed source/subscription services [6]. BgeeDB provides array and sequencing-based expression data on many animal species with a particular focus on high-quality measurements of baseline samples at different life stages (excluding disease, treatments, or genetic perturbations) [7]. Expression Atlas is one of the most comprehensive repositories of processed expression microarray data with an informative graphical interface, but only a comparatively small number of RNA-seq data sets are currently included [8]. In an effort to boost reuse of public transcriptome data, we developed Digital Expression Explorer 2 (DEE2), an open-access web-based repository of uniformly processed RNAseq digital gene-level and transcript-level expression data for several major organisms that is compatible with many types of downstream analyses.

Data Processing
DEE2 consists of 3 parts: (i) a pipeline that downloads and process raw data sets from SRA; (ii) a data repository where processed files are collected, filtered, and organized/stored and job queues are generated; and (iii) a web server where users can search metadata and obtain data sets of interest. A schematic diagram of the organization of DEE2 is provided in Fig. 1. Data processing nodes request SRA run accession numbers from the web server and obtain raw data from SRA. Processed data are sent to the web server, validated, and relayed to the DEE2 repository server. The repository server performs further validation checks, incorporates new data sets into the repository, collects corresponding metadata from SRAdbV2 [9], and queues outstanding jobs. The repository server then sends updated metadata and job queue information to the web server. End users obtain data from the web browser, command line, or bulk dumps.

Pipeline Features
The DEE2 pipeline uses containerization to enable rapid application deployment and guarantees analytical reproducibility across different computer systems. End users can run the Docker image [10] on their own hardware to process SRA data sets of interest as specified with a species name and SRA run ac-cession. After completion of the processing, users will have immediate access to the outputs, and after validation by the DEE2 repository server, the data sets will be available publicly. In this way, power users obtain benefit by using an established analysis pipeline and simultaneously contribute to expanding the public resource. One concern with Docker images is that they cannot be run without administrator "root" permissions, e.g., by users of a shared high-performance computing system. To address this limitation, the image can be converted for use by Singularity [11] or UDocker [12] without root permissions.
The steps involved in data processing are summarized in Fig. 2. The pipeline fetches the appropriate reference genome, annotation, and complementary DNA sequence data from Ensembl (August 2017 version) [13]. Transcriptome sequencing data sets are downloaded from SRA using Aspera. The pipeline handles both single-end (SE) and paired-end (PE) sequencing data with the exclusion of colorspace sequence data. A sample of 4,000 reads is used to perform basic checks including read and quality string format using FastQC [14] prior to extraction of fastq files with a parallel implementation of fastq-dump [15]. Skewer [16] is used to trim bases with phred quality <10 on the 3 ends and discards reads shorter than 18 nucleotides. Adapter sequences at the 3 end are detected using Minion, part of the Kraken package [17]. Adapter sequences are clipped using Skewer if the predicted adapter sequence is not present in the genome and exceeds a frequency of 2.5%. To handle nonreference 5 bases including unique molecular identifiers, a sample of 10,000 reads undergo progressive clipping of 5 ends (4, 8, 12, 20 nucleotides) followed by genomic mapping with STAR to determine the optimal number of bases to clip from the 5 end, as determined by the proportion of uniquely mapped reads. STAR [18] is then used to map all reads that pass quality control (QC) to the genome and generate gene-wise expression counts with the "-quantMode GeneCounts" (no alignment files are generated). STAR output is also used to diagnose whether the data set is strand specific. To classify a data set as strand specific, there needs to be a 5:1 strand bias in assigned reads according to STAR. This option is passed to Kallisto, which maps reads to the transcriptome to generate estimated transcript counts [19]. Gene and transcript counts, along with analysis logs and QC metrics, are zipped and transferred to the web server by sftp. The pipeline has the added ability to process users' own fastq files using the same pipeline, although the results remain private. The pipeline code is open source and available online [20]. Software versions and parameters used in the pipeline are provided in Supplementary Table S1.

Data Provided
Currently DEE2 hosts data from 9 organisms selected because they are important model organisms and have a large number of corresponding transcriptome data sets in SRA. Currently, there are >580,000 RNA-seq data sets available, with each data set corresponding to a specific SRA run number. Together the 9 species included constitute 73.5% of all transcriptome data sets available from SRA. 1 DEE2 consists of >5.3 trillion assigned sequence reads ( Table 1). The data provided include gene-wise expression counts, transcript-wise estimated counts, gene information, transcript information, summary metadata, full metadata, and QC metrics, provided as 7 separate tables in tsv format. The gene information table contains the gene accession number, corresponding gene symbol, and gene length as calculated by GTFtools v0.6.5 [21]. The transcript information file contains transcript-parent gene relationships, gene symbol, and transcript length. Gene and transcript length information will allow straightforward normalization of expression by contig length. The full metadata table contains all corresponding metadata from SRAdbV2, while the summary metadata contains only corresponding SRA accession numbers and experiment title. Moreover, analysis logs for each data set are provided. Classification of data sets by QC metrics is discussed below.

Quality Control Metrics
QC is paramount for a resource such as this. A range of quality metrics are accessible and can be viewed on the search re-sults page, which includes mean base quality scores, number of reads, alignment rates, and read assignment statistics. Detailed analysis logs are distributed alongside expression data. Summary statistics for human data sets are provided in Fig. 3. There are roughly equal numbers of runs with SE and PE sequencing (Fig. 3A). Almost all data sets are encoded in Illumina 1.9 format (also known as Sanger), and a small number of data sets with Illumina 1.5 quality encoding (Fig. 3B). The median read length is 75 base pairs ( bp) and mode is 50 bp (Fig. 3C). Most (71.3%) data sets had ≥95% of reads pass QC filtering (Fig. 3D). The median number of reads passing QC filtering was 4.6 million (Fig. 3E). The median proportion of STAR unique mapping was 82% (Fig. 3F). Median assignment proportion recorded a median of 59.4% (Fig. 3G). The majority of data sets were classified as unstranded (67.3%), with smaller numbers of runs biased towards each strand (Fig. 3H). The median proportion of reads mapped with Kallisto was 63.2% (Fig. 3I).
Although there are no definitive thresholds for what constitutes a valid RNA-seq data set, there are 2 main principles: (i) the digital nature of RNA-seq means that data sets with more reads will provide more accurate quantification, and (ii) data sets with a large proportion of reads excluded from downstream analysis will be less representative of the original sample, and suggest issues with sample quality, library preparation, or sequencing instrumentation. Using these principles, we have classified the data sets as "pass," "warn," and "fail" according to heuristics outlined in Table 2. Each rule has a numeric code, and this is provided in the search results. Because sequencing depth recommendations are larger for more complex organisms, the metrics describing integer counts are proportional to transcriptome complexity (the number of protein-coding genes as defined by Ensembl).
Furthermore, if a data set profile is substantially different to the bulk of "pass" data sets, this may be useful information for end users. To quantify this, an average gene expression profile (STAR) of "pass" data sets is calculated and each data set is compared by Pearson correlation (Methods). If the correlation coefficient is <0.5, then the data set is flagged as "warn." As new data sets are periodically added, these correlation values may vary slightly over time.
To understand why some data sets have low correlation to the bulk of "pass" data sets, we undertook an unsupervised clustering analysis of correlation in 5,808 S. cerevisiae data sets. While most Spearman correlation coefficients were >0.7, there was a small fraction <0.5 (Fig. 4A). Most data sets (5,571) were classified into 2 large clusters ( Fig. 4B; blue and light blue). The remaining 236 data sets belonged to several smaller clusters. These smaller clusters mostly contained data sets derived from nonstandard RNA-seq library construction protocols such as 3 end RNAseq (ERP004367, SRP048715, SRP048715, SRP021938), Ribo-Seq (SRP075766, SRP082147), and RNA-IP-Seq (SRP032276). One of the smaller clusters contained data sets of cells undergoing sporulation and meiosis (e.g., SRP092588, SRP061166, SRP032309). From this analysis, we can conclude that highly correlated data sets are standard RNA-seq/mRNA-seq and data sets with low correlation are mostly due to the use of nonstandard library construction protocols, but also some data sets derived from less characterized biological states (e.g., meiosis/sporulation in the case of S. cerevisiae).

Pipeline Validation
To demonstrate the accuracy of the pipeline, we performed a simulation study. Synthetic Illumina HiSeq RNA-seq data were generated from Ensembl transcripts and processed with the pipeline (see Methods). The reads per million (RPM) values were compared between the ground truth and DEE2-processed data, and Spearman correlation coefficients (ρ) were calculated ( Fig. 5; Supplementary Table S2). We observed that analyses of simpler organisms were, in general, more accurate than for more complex transcriptomes of human and mouse. Overall there was only a small improvement in accuracy in PE over SE reads. Transcript quantification results from Kallisto were less accurate than gene level quantification with STAR. On the other hand, Kallisto transcript counts collapsed into their parent gene were substantially more accurate than STAR gene counts ( Fig. 5; Supplementary Table S2), consistent with previous a previous report [22]. In a separate validation exercise, we compared authorsupplied expression count data present in GEO with corresponding DEE2-STAR counts, and quantified the similarity at the level of individual runs as well as across contrasts (see Methods for details). At the level of individual runs, there was a tight corre-lation between DEE2-derived and author-supplied RPM values, with Spearman coefficients in the range of 0.95-0.99 (Fig. 6A). After differential expression analysis with edgeR [23], genes were ranked by significance. Author-derived differential expression results were then compared with DEE2-derived differential expression results, enabling us to generate a single Spearman correlation coefficient for each contrast. Using this approach, the correlation in differential expression results between DEE2-STAR and author-supplied counts ranged between 0.55 and 0.95, with a median of 0.81 (Fig. 6B). Differential expression correlation was higher in comparisons with more replicates (ρ = 0.757, n = 9, P = 0.018). Both exercises support the validity of DEE2 data.

A Brief Meta-analysis of Yeast Gene Expression
To demonstrate the utility of DEE2 data we undertook an exploratory analysis of gene expression in S. cerevisiae. We correlated the expression of all genes in 5,808 data sets in DEE2 and performed unsupervised hierarchical clustering. This resulted 7,126 genes being classified into 10 clusters (Fig. 7A). The largest cluster consisted of 3,634 genes (light blue), and the remain- ing clusters contained between 634 and 175 genes. Gene ontology analysis was performed to detect overrepresented biological pathways in each cluster (Fig. 7B). Interestingly, each cluster was involved in different biochemical specializations. For example, the dark green cluster was overrepresented in genes involved in translation, while the nearest neighbour, light purple, was overrepresented in amino acid metabolism. Similarly the light orange cluster was enriched for genes involved in mitochondrial function and the nearest neighbour, pink, was in-volved in adenosine triphosphate metabolism. These findings illustrate one way in which DEE2 facilitates meta-analysis of gene expression.

Reuse Potential
The financial cost of generating these raw data sets is substantial. A rough estimate of the cost to generate raw data included  in DEE2 is ∼$162 million US. 2 In contrast, the estimated cost to process these data sets on Amazon EC2 infrastructure is estimated at just $97,000 but could be reduced to ∼$24,000 using offpeak resources. 3 Therefore, data aggregation efforts like DEE2 can, with a modest budget, add substantial value to these exist- ing data by enabling straightforward reuse. Another benefit of aggregation is that genome annotations are updated over time as compared to author-submitted data that remain static. To enhance the reuse potential, we have designed a simple and easy-to-use website to access the data. Users select 1 of the 9 species featured and provide either keywords or accession numbers to identify data sets of interest (Fig. 8A). The web interface provides data sets in batches of ≤500 runs. When 501-5000 matches are obtained, users can download the corresponding metadata and are given options to access expression data (see below). The search results page contains corresponding SRA accession numbers, experiment title, and keyword context if a keyword was used (Fig. 8B). The results page provides links to QC information so that users can be assured of data set for the total on demand cost is $83,697 for compute, $131 for data transfer, and $12 ,750 for persistent data volume. The off-peak cost breakdown is $11, 544 for compute, $131 for data transfer, and $12 ,750 for persistent data volume. The persistent data volume cost is valid if using 17 instances continuously for a full year. Compute price is based on eastern US ocation (Ohio) and is more expensive at other locations.  quality. If ≤500 matches are found, the QC information can be seen simply by hovering the mouse over the QC summary field (Fig. 8B). Users then tick the box of every data set they would like to download, and by hitting the "Get Counts" button, the data sets are downloaded. The searching and retrieval steps for the example depicted in Fig. 8 took 13 seconds. Fig. 8C demonstrates how data are delivered to end users: as a zip archive containing tab-separated expression count, contig information, metadata, and QC information. The web server is limited to fetching 500 data sets at a time. To enable easy access to large data sets we provide zip "bundles" for each project with ≥200 runs that have been fully processed by DEE2; there are 425 such bundles as of 11 January 2019 [27].
Because R is the main language for downstream statistical analysis of RNA-seq data, we also provide an R package getDEE2 to obtain DEE2 data. In the example shown in Box 1, data sets belonging to an experiment with GEO series GSE33569 are obtained. Transcript-wise counts can be aggregated to gene-level counts with a single command (Tx2Gene). In this case, a search for "fatty acid" for S. cerevisiae yields 20 hits. Mouse pointer hovering over the QC summary field shows the key quality criteria of the processed data sets. (C) Users can select the data sets to download by ticking the boxes and hitting the "Get counts" button at the bottom of the page. The downloaded zip file (C left side) contains detailed processing logs of each run, along with a matrix of gene-wise counts, transcript-wise counts, metadata information, QC information, and information on transcripts and genes. The gene-wise count matrix and QC information are shown in C, middle and right side, respectively.
For power users, bulk data dumps are available and will be of use to researchers wishing to do wholesale meta-analyses, as 3 studies already have [28][29][30]. Irrespective of the method of acquisition, DEE2 data are compatible with many different downstream applications including R/Bioconductor [31,32], Degust [33], and Galaxy [34].

Conclusion
DEE2 provides a unique framework and user-friendly resource of processed RNA-seq data that alleviates many of the bottlenecks researchers currently face with analysis of public RNA-seq data. Our testing shows that DEE2 and Degust enable analysis of public RNA-seq data on mobile devices such as smartphones. Bulk data provided by DEE2 are a useful starting point for researchers performing meta-analyses of RNA-seq data.

Analysis of correlation
Correlation of a data set to the bulk of other high-quality data sets could be a useful metric to filter by when performing metaanalyses. To make this a tractable calculation, first a mean gene expression profile is generated using STAR gene expression data from ≤10,000 randomly selected data sets that pass all other QC checks. Next, each data set is compared to the average pass profile and a Pearson correlation coefficient is calculated. As new data sets will be added to DEE2 regularly, the average pass profile will vary slightly over time and as such, the correlation coefficient is calculated to 2 significant figures only.
In the analysis of data set correlation (Fig. 4), the Spearman correlation of all 5,808 S. cerevisiae data sets classified as "pass" and "warn" was calculated in a pairwise fashion (data as of 3 January 2019). Clustering was performed using the hclust function in R with the complete linkage method, followed by tree cutting at 0.375 of the maximum branch length. The heatmap.2 package was used to generate the heat map. In the analysis of gene expression correlation (Fig. 7A), the S. cerevisiae data were transposed prior to correlation analysis, then tree cutting was Box 1: An example of obtaining gene and transcript expression data sets using the R functions (GEO series: GSE33569). The Tx2Gene function is used to aggregate (sum) transcript counts to gene-level counts. performed at 0.17 of the maximum branch length. The clusters obtained were subjected to gene ontology analysis at the level of biological pathways using the enrichGO tool in the cluster-Profiler package version 3.8.1 (clusterProfiler, RRID:SCR 016884) [35]. Only the top 3 pathways for each cluster were plotted.

Pipeline validation using simulated data
To validate the accuracy of the DEE2 pipeline, we generated Illumina HiSeq2500-like sequence reads from Ensembl complementary DNA sequences using ART (v2016-06-05) [24] with a defined seed (1,540,165,885) and a uniform fold coverage of 2. Read lengths were 50 and 100 bp in SE and PE format, respectively. The read sets were processed with the DEE2 pipeline and the observed expression data were compared with the ground truth, using Spearman correlation of library size normalized profiles in RPM as an indicator of accuracy. For Kallisto-based analysis, estimated transcript counts "est counts" were used. Transcriptestimated counts were totalled for each parent gene to generate gene-wise expression counts. These analyses were performed for all 9 organisms currently included in DEE2.

Pipeline validation using public data
Another way to validate the accuracy of DEE2 data is to compare with author-submitted results available on GEO. We searched for studies that reported expression data as raw counts with official gene names or Ensembl accession numbers, ≥2 replicates, and acceptable read depth and genome mapping rate. Authorsupplied counts were obtained from GEO for the data sets listed in Table 3 [36][37][38][39][40][41][42][43][44]. Spearman correlation of RPM values was used to quantify the similarity of DEE2 and author-supplied data at the level of individual runs. To determine the similarity in differential expression results, the same edgeR v3.22.3 [23] analysis was performed in parallel on author-supplied counts and DEE2 counts. The runs defined as control and case for each experiment are listed in Table 3. To rank genes by significance in differential expression, the sign of the fold change was multiplied by the negative log2 P-value. Spearman correlation analysis was used to quantify the similarity in differential expression results using these 2 data sources. ceives funding from the National Health and Medical Research Council-European Union Collaborative Research Grants Scheme (APP1075563).

Author contributions
M.Z. and A.E.-O. conceived and designed the study. M.Z. and A.K. wrote the computer code. M.Z. coordinated data processing and drafted the manuscript. All authors read, revised, and approved the final manuscript.