GWIPS-viz: development of a ribo-seq genome browser

We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.

To date, there have been two main strategies of ribosome profiling: ribosome profiling of initiating ribosomes and ribosome profiling of elongating ribosomes. For a review on the usages and advantages of each approach, please see (22).
The majority of published studies using ribosome profiling provide the raw sequencing data in NCBI's Sequence Read Archive (SRA) (23). In addition, most published ribosome profiling experiments have corresponding naked mRNA controls, where total mRNA is randomly degraded to yield fragments of a size similar to ribosome protected fragments. For simplicity, here we refer to it as mRNA-seq. mRNA-seq is carried out under the same experimental conditions. It helps to take into account the differential abundance of mRNA between experimental conditions and to monitor technical biases associated with complementary DNA library generation and sequencing.
Owing to the increasing popularity of the ribo-seq technique, the number of ribosome profiling experiments is expected to increase dramatically in the near future. However, the visualization of ribosome profiling data in a browser first requires preprocessing and aligning the raw sequencing reads. As with any type of next-generation sequencing data, demands are placed on biomedical researchers in terms of time, data storage, computational knowledge and prototyping of computational pipelines (24). Web-based integrative framework tools such as Galaxy (25) provide centralized platforms for researchers to carry out next-generation sequencing alignment pipelines. However, because of decreasing costs, the coverage depth of ribo-seq and corresponding mRNA-seq data is continually increasing resulting in ever larger data sets. Consequently the computational resources required to process such data and the computer memory required to store such data may not be available to many biologists. The time required to download, preprocess and align the raw data may be the most limiting factor of all for timepoor researchers.
To address these issues, we introduce GWIPS-viz (http://gwips.ucc.ie), a free online browser that is prepopulated with published ribo-seq data. The aim of GWIPS-viz is to provide an intuitive graphical interface of translation in the genomes for which ribo-seq data are available. Users can readily view alignments from many of the published ribo-seq studies without the need to carry out any computational processing. GWIPS-viz is based on a customized version of the University of California Santa Cruz (UCSC) Genome Browser (http://genome. ucsc.edu) (26). Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for human, mouse, zebrafish, nematode, yeast, two bacterial species (E. coli K12 and B. subtilis) and two viral genomes (human cytomegalovirus and bacteriophage lambda).

USAGE
In GWIPS-viz, users can search for their gene(s) of interest in the genome(s) for which ribo-seq data are available and view a snapshot of the gene's translation under the conditions of the experiment. Ribosome coverage plots (red) and mRNA-seq coverage plots (green) display the number of reads that cover a given genomic coordinate. Figure 1 provides coverage plots for the S. cerevisiae genome locus containing ABP140, MET7, SSP2 and PUS7 from (2) and illustrates how differential translation can be viewed in GWIPS-viz.
Users can visually identify which isoform(s) of a gene is transcribed and translated and also compare translation of the gene between different ribo-seq studies. For example, Figure 2 provides a comparison of two riboseq data sets obtained in different tissue-cultured human cells, HeLa (3) and PC3 human prostate cancer cells (6). It can be seen that translation of a non-Refseq Ensembl transcript, reported based on the analysis of HeLa cell data (27), is observed in both data sets.
For the eukaryotic data sets, ribosome profiles display the number of footprint reads at a particular genomic coordinate that align to the A-site (elongating ribosomes) or P-site (initiating ribosomes) of the ribosome, depending on the study. For the prokaryotic data sets, a weighted centred approach (18) is used to indicate the positions of ribosomes. Figure 3 shows ribosome profile densities in a region of the E. coli genome that includes the gene dnaX (b0470). The ribosome density is scaled relative to the maximum density present within the displayed genomic segment. As a result, in the zoomed segment allowing visualization of neighbour genes (top), dnaX appears as lowly expressed. However, at a range covering only the dnaX locus, it can be seen that nearly all codons in the dnaX mRNA are covered with footprints. Moreover the coverage is sufficient to allow visual detection of decreased ribosome density downstream of the site of programmed ribosomal frameshifting, which is known to cause 50% of translating ribosomes to terminate prematurely (28,29). Figure 4 provides an example of how ribo-seq tracks for elongating and initiating ribosomes can be compared. The example illustrates the data obtained in Human HEK293 cells (7) mapped to TOMM6 and SFPQ genes. The latter gene apparently uses two sites of translation initiation for its expression.

DATABASE DESIGN AND IMPLEMENTATION
GWIPS-viz is a customized version of the UCSC Genome Browser (26) 24. Static hypertext mark-up language and cascading style sheets files of the UCSC Genome Browser were downloaded from http://hgdownload.cse.ucsc.edu/and rehosted on our local server, whereas C source code for the common gateway interface executables was downloaded and compiled using gcc 4.6.3. Selected parts of the MySQL databases were downloaded from the UCSC browser for the majority of organisms included in GWIPS-viz.
Because the goal of GWIPS-viz is to be a browser for ribo-seq data, rather than a mirror of the UCSC browser, some of the functionality of the UCSC browser was removed to streamline the interface of GWIPS-viz. For example, the 'clade' menu in the genome selection menu was removed. In the browser window, the link 'UCSC' was added in the top bar to allow the user to view the current genome position in the UCSC browser.
Ribo-seq and mRNA-seq tracks were added by incorporating the outputs of our RNA-seq unified mapper (RUM) (37) alignment pipeline into the MySQL database. These tracks are divided into groups by  publication and data type (ribo-seq and mRNA-seq). Tracks generated from uniquely mapping reads are colour coded according to their experiment type (elongating ribosome footprints are red, initiating ribosome footprints are blue and mRNA-seq reads are green).

Raw sequencing data retrieval
Published Ribo-seq and mRNA-seq data sets are downloaded from the NCBI SRA (23) and converted to FASTQ format using the fastq-dump utility (SRA Handbook citation, not in PubMed). Data from replicate experiments are consolidated into one data set so as to have one browser track for each experimental condition.

Alignment pipeline
As there are no specific tools as yet for aligning ribo-seq data, RNA-seq tools are used in our preprocessing and alignment pipeline.
Depending on the study, adaptor linker sequence or poly-(A) tails are trimmed from the 3 0 ends of reads using Cutadapt version 1.1 (38). Trimmed reads shorter than 25 nt are discarded.
Contamination from ribosomal RNA (rRNA) may account for a significant proportion of the raw reads even after depletion by subtractive hybridization during the experiment. Hence it is desirable to remove rRNA reads from the data set before performing alignments to increase the proportion of informative sequences and improve alignment efficiency. To detect reads that are the result of rRNA contamination, trimmed reads are aligned to rRNA sequences using Bowtie (39). Bowtie version 0.12.8 is run using the -v option allowing three or fewer mismatches between the read sequence and the reference (rRNA) sequence. All reads that align to rRNA are discarded.
In most eukaryotes, a proportion of ribosome footprints will span splice junctions, i.e. the read will span the 3 0 end of one exon and the 5 0 end of another. There is the added complexity that ribo-seq reads are typically 30 nt in length. Hence the short-read alignment programme needs to be capable of aligning reads of 30 nt across splice junctions. We use the RUM, (current version 2.0.5_05) (37). RUM handles splice junctions by using the short read aligner Bowtie (39) to align sequence reads to both the genome and transcriptome and merging the results, before attempting to map remaining unaligned reads using another existing aligner, Blast-like alignment tool (40).
Owing to the relatively short lengths of ribosome footprint reads, a read may align to two or more distinct genomic locations due to sequence similarity. RUM outputs information separately for uniquely mapping reads and non-uniquely mapping reads (reads that align to several positions in the genome). Currently we provide tracks of uniquely mapping reads only in GWIPS-viz.
RUM's output files include a Sequence Alignment/Map file showing the alignment(s) for each read, files giving the span of the alignment in genomic coordinates (RUM_Unique and RUM_NU) and coverage files (RUM.cov and RUM_NU.cov) listing the depth of coverage of reads across the genome.  The coverage files generated by the RUM alignment, RUM_Unique.cov and RUM_NU.cov, are in four column bedGraph format. The bedGraph data are converted into bigWig format, an indexed binary format that results in higher performance (41).
Ribosome profiles are generated from the RUM_Unique and RUM_NU files by obtaining the number of footprint reads whose 5 0 ends align at a given genomic coordinate (with an offset of 12 nt designating the ribosome P-site for initiating ribosomes or 15 nt for the ribosome A-site for elongating ribosomes).

FUTURE PLANS
We plan to expand the existing repertoire of ribo-seq tracks by integrating publically available ribosome profiling experiments as they become available.
GWIPS-viz currently displays the positions of the ribosomes mapped to the reference genomes. In the case of eukaryotic organisms that extensively use RNA splicing, visualization of ribosome positions in GWIPSviz could be problematic due to a large number of long introns. Therefore, visualization of ribosome positions mapped to individual RNA transcripts is among our top priorities.
We currently provide ribo-seq and mRNA-seq tracks of uniquely mapping reads only. In the future, we wish to provide a differential display that will incorporate nonunique mapping reads (mapping to two or more locations in the genome) with uniquely mapping reads.
We also aim to provide access to the Galaxy platform from within GWIPS-viz so that researchers who generate their own ribo-seq experimental data can preprocess and align their data with the tools provided within Galaxy and then view the alignments in GWIPS-viz.
In addition, we aim to design a track specifically for the UCSC Genome Browser, which will display whether a region is translated or not (one global track per genome for which ribosome profiling data exist). If a user is interested in further details of the data (cell type or tissue, particular condition, specific density profile), they can be found in GWIPS-viz where individual tracks for each experiment are provided.
Our overall objective is to continuously improve the service we provide in GWIPS-viz. As GWIPS-viz is under intensive development, some of the features described in this article could become outdated soon. Hence we encourage users to post their questions, comments and feedback on the GWIPS-viz forum. Furthermore, as ribosome profiling is a relatively recent technique that is still evolving and undergoing optimization, we provide forums for discussing the experimental protocol itself, its applications and analysis of the data. In this way, GWIPS-viz will not only be a centralized repository to visualize ribosome profiling data, but its forums will encourage researchers to actively engage in the establishment of quality standards for ribosome profiling that will be of benefit to the community in general.