RNA-bioinformatics: Tools, services and databases for the analysis of RNA-based regulation

The importance of RNA-based regulation is becoming more and more evident. Genome-wide sequencing e ﬀ orts have shown that the majority of the DNA in eukaryotic genomes is transcribed. Advanced high-throughput techniques like CLIP for the genome-wide detection of RNA – protein interactions have shown that post-transcriptional regulation by RNA-binding proteins matches the complexity of transcriptional regulation. The need for a specialized and integrated analysis of RNA-based data has led to the foundation of the RNA Bioinformatics Center (RBC) within the German Network of Bioinformatics Infrastructure (de.NBI). This paper describes the tools, services and databases provided by the RBC, and shows example applications. Furthermore, we have setup an RNA workbench within the Galaxy framework. For an easy dissemination, we o ﬀ er a virtualized version of Galaxy (via Galaxy Docker) enabling other groups to use our RNA workbench in a very simple way.


Motivation
Genome-wide sequencing efforts have revealed that a majority of DNA in eukaryotic genomes is pervasively transcribed. Non-coding RNAs and RNA-protein interactions are important parts of cellular regulation that were ignored at first but have received an increasing level of attention over the past decade. While the exact numbers, and even the magnitude, of functional transcripts, regulators and interactions are a matter of ongoing discussion, they reflect the current challenge for the analysis of whole transcriptome data.
The identification of new classes of regulatory RNAs such as microRNAs (miRNAs), or the genome-wide identification of RNA-protein interactions, which has been enabled by the development of new technologies such as cross-linking and immunoprecipitation (CLIP) methods, suggests that the complexity of post-transcriptional gene regulation is comparable to transcriptional gene regulation. The human genome encodes hundreds to thousands of miRNAs more than 1000 RNA binding proteins (Medenbach et al., 2011;Baltz et al., 2012;Gerstberger et al., 2014;Brannan et al., 2016;He et al., 2016;Castello et al., 2016). Along with such profiling efforts, a picture has emerged that many human diseases are caused or linked to post-transcriptional gene regulation. Examples include not only rare genetic disorders but cover the entire spectrum of cardio-vascular diseases, cancer, and neurodegenerative disorders (for recent reviews see Kapeli and Yeo, 2012;Darnell and Richter, 2012;Kong and Lasko, 2012;Ibrahim et al., 2012;Rosina and Hurst, 2017;Schmiedel et al., 2016). With increasing evidence that non-coding RNAs are also involved in epigenetic regulatory control, it is clear that RNA biology is of vital, newly emerging importance for research not only in basic molecular biology but also for medical and disease research. Consequently, many of the existing or newly founded centres for common diseases have great need to develop or get access to computational tools and databases that capture and predict regulation by RNA or RNA-protein interactions. Fig. 1 for an overview of functions and associated bioinformatics services). To name only a few, they can regulate imprinting by modulating chromatin structures, act as guide RNAs for protein complexes, form scaffolds for protein-RNA complexes, regulate other RNAs by RNA-RNA-interaction (Busch et al., 2008;Mückstein et al., 2006), function as decoys for proteins and other non-coding RNAs (Memczak et al., 2013) or act as cis-regulatory elements such as riboswitches (Wachsmuth et al., 2013). MiRNAs (Rajewsky, 2006) are an abundant class of small RNAs, each of which can regulate up to hundreds of transcripts. In total, it is estimated that 60% of all human proteins are regulated by miRNAs. With the advances in high-throughput approaches to detect binding sites of RNA-binding proteins (RBP) such as CLIP-Seq, a plethora of new RNA regulatory mechanisms has been detected when analysing the RBPome (i.e., the network of protein-RNA interactions) (Rinn and Ule, 2014). Finally, ribozymes are an important class of ncRNAs that are often involved in the maturation of other RNA or DNA molecules.
With all these potential roles, it has become clear that the analysis of epigenetic and expression data is incomplete if RNA-based regulation is not taken into account. As consequence, the analysis of RNA has to be integrative, combining sequencing datasets with sequence and structure analysis of RNA elements, and allowing for integration with other regulatory mechanisms such as transcription. High-throughput techniques to analyse RNA-based regulation are rapidly evolving, which give rise to a large amount of information but also to the need to constantly adapt databases, annotations and tools.
To overcome these problems and limitations, the RNA Bioinformatics Centre (RBC) was founded within the German Network for Bioinformatics Infrastructure (de.NBI) with the following priorities: 1 To establish an integrated, easily accessible RNA analysis workbench which can be used on our own cluster or downloaded and installed on every HPC environment. 2 To work with other Bioinformatics Centers and relevant scientific communities to allow for maximal usefulness, interconnectivity, and added value of the developed infrastructure. 3 To use this infrastructure as foundation for a learning and teaching environment that fosters an awareness for the importance of RNA analysis.
In consequence, our goal in RBC is to serve as contact point for all RNA bioinformatic questions in Germany, ranging from initial study design, over providing protocols and infrastructure, up to developing specialised solutions for individual problems. In addition, the RBC provides specialized curated RNA-related information resources such as databases for protein-RNA interaction or tRNAs, which will be fully integrated into our workbench. Across the three locations Berlin, Freiburg and Leipzig, the joint expertise covers many if not all aspects of RNA biology of current interest, ranging from structure prediction and genome-wide annotations of conserved secondary structures via the detection of members of specific classes of regulatory RNAs, and the interaction of RNA binding proteins and regulatory RNAs with their targets.

Individual tools provided and maintained by the RBC
In this section, we will give an overview of different services and tools that are required to analyse RNA-related data. We will take our emphasis on tools that are provided by the RBC and only shortly mention other related tools. Tools and databases maintained by the RBC will be written in italics. The complete list of tools can be found under https://github.com/bgruening/galaxy-rna-workbench.
3.1. Prediction of RNA structure and detection of conserved RNA structure Many functional RNAs require a specific structure to be formed. Very often, the so-called secondary structure (i.e., the set of Watson-Crick and GeU bonds) is well-conserved and characteristic for the function of the RNA. Prediction of the secondary structure is a wellestablished area in RNA-bioinformatics. The ViennaRNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures. It is also the defacto standard library for the development of RNA based methods (Lorenz and Bernhart, 2011).
However, the prediction of the secondary structure is usually only a first step in a whole pipeline for the analysis of RNA-related data. Often it is required to determine the conserved secondary structure, or whether a structure is conserved at all. MARNA (Siebert and Backofen, 2007) is an early approach that solved the problem of generating multiple alignments using well-defined pairwise RNA-alignment approaches by using Tcoffee (Notredame et al., 2000) to combine the pairwise alignments. It computes multiple sequence-structure alignments considering a single fixed structure for each sequence only. Ex-paRNA is a fast, motif-based comparison and alignment tool for RNA molecules. Instead of computing a full sequence-structure alignment, it computes the best arrangement of sequence-structure motifs common to two RNAs (Smith et al., 2010). The gold standard here is the Sankoff (1985) approach (and its variants) of performing a sequence-structure alignment of RNAs. One approach that is provided by the RBC is Lo-cARNA . It is an efficient variant of the Sankoff approach that computes multiple alignments of RNAs based on their sequence and structure similarity and considers the whole ensemble of secondary structures for each RNA. Thus, LocARNA aligns RNAs with unknown structure and predicts a consensus secondary structure for a set of unaligned RNAs. LocARNA is best suited to compare several (up to about 20) structural RNAs, in particular, of low sequence similarity. Carna (Palu et al., 2010) is a tool for multiple alignment of RNA molecules based on their full ensembles of structures. Carna computes the alignment that fits best to all likely structures simultaneously. Hence, Carna is in particular useful to align RNAs with more than one stable structure, as for example riboswitches, and is able to align arbitrary pseudoknots. The above tools detect whether there is a conserved structure, however, they do not decide whether the structure is significantly conserved to indicate a structural RNA. This is solved by RNAz (Gruber et al., 2010), which is a program for predicting structurally conserved and thermodynamically stable RNA secondary structures in multiple sequence alignments. It can be used in genome wide screens to detect functional RNA structures, as found in noncoding RNAs and cisacting regulatory elements of mRNAs.
Genomic screens produce a large set of putative RNAs, however, annotation of these approaches is a critical task. One successful approach is to cluster these RNAs in order to detect RNA classes. These are RNAs that are structurally similar but do not stem from a common anchestor; a prominent example is the class of miRNAs. GraphClust (Heyne et al., 2012) is a method based on graph kernels for an alignment-free clustering of ncRNAs. It can be used to detect new ncRNA classes as well as for detecting members of known classes. It is currently the only approach capable of clustering hundred of thousands RNA according to sequence and structure. Another approach that uses clustering as means for annotation of ncRNA in a specialized RNA, namely the annotation of CRISPR repeats is CRISPRmap (Lange et al., 2013), which provides a quick and detailed insight into repeat conservation and diversity of both bacterial and archaeal systems. It comprises the largest dataset of CRISPRs to date and enables comprehensive independent clustering analyses to determine conserved sequence families, potential structure motifs for endoribonucleases, and evolutionary relationships.
Finally, one does not only want to detect new natural ncRNAs. For many applications one wants to design, either computationally or even biotechnologically, new synthetic RNAs that are putative members of an RNA class or family. RNAdesign, which is part of the Vienna RNA package, is one of the earliest yet still most widely used programs for the design of RNA sequences that fold into a given pseudo-knot free RNA secondary structure. Another successful example is INFO-RNA (Busch and Backofen, 2007), which is a web-server providing RNA designs. ANTARNA (Kleinkauf et al., 2015) is an improved design approach based on ant colony optimization that can control the GC-content.

Identification of specific regulatory non-coding RNA classes
The approaches listed above are general tools that, in theory, can be used for all RNA classes. However, optimized tools exists for specific classes such as miRNAs and snoRNAs that can make use of additional biological knowledge as well as of additional RNA-seq data. The first example is miRDeep (Friedlander et al., 2008), which is a probabilistic model that detects the presence of expressed animal microRNAs in deep sequencing data. It does this via a set of features that reflect its processing from primary transcript to mature short 22nt sequence, such as the relative frequency of reads aligning to the mature RNA compared to other parts of the precursor. PiPMir (Breakfield et al., 2012) follows a similar idea to detect new plant miRNAs. Plant precursors can be much longer compared to animals and contain multiple mature miRNAs; PiPMir addresses these differences in the pathways of miRNA maturation in plants, for instance via extensive predictions of local secondary structure for precursors up to several hundred nucleotides. DARIO (Fasold et al., 2011) is a webservice providing functionalities complementary to miRDeep, allowing not only the recognition of novel microRNAs but also small RNAs derived from other types of parental RNAs such as snoRNAs and tRNAs. The tool is being made available also for plant genomes.
As stated above, one of the important ideas behind the above described tools is to combine computational RNA analysis with sequencing data. NASTI-seq  extends the formalism behind popular differential expression RNAseq tools to strand-specific protocols. It uses an explicit likelihood ratio test to identify the significant presence of overlapping antisense transcription, and consequently, candidate loci for the generation of cis-natural antisense siRNAs.

Identification of targets of regulatory RNAs based on sequence
The assignment of a ncRNA to a specific class is the first annotation task. Many small ncRNAs such as miRNA serve as guide RNA or are directly acting on their target via RNA-RNA interactions. For that reason, target prediction relies on features extracted from RNA-RNA interactions between small ncRNA and its target, possibly combined with additional features. PicTar (Krek et al., 2005;Lall et al., 2006Lall et al., 2006 is one of the most established and successful miRNA target predictors based on sequence features of functional miRNA-target interactions. PicTar is specifically designed for miRNAs. However, other small ncRNAs also act via RNA-RNA interaction. In this case, general approaches for predicting RNA-RNA interactions can be used. RNAcofold  is part of the Vienna RNA package and can predict joint structure of two RNAs, provided that the structure is nested. However, this excludes common interactions such as kissing hairpin loops. These types of interaction can be determined by accessibilitybased approaches, which combine the calculation of a duplex energy with a penalty that measures the energy required to make the interaction sides accessible in the two interacting RNAs. RNAup  was one of the first accessibility-based interaction prediction tools and allows reliable predictions of RNA-RNA binding energies using an approach that is based on the ensemble of RNA-structures of a sequence. It combines the energy for making an interaction site accessible with the energy of duplex-formation. However, it has a complexity of, O n w ( ) 2 2 where n is the sequence length and w is the maximal width for the interaction sites. IntaRNA (Busch et al., 2008;Wright et al., 2014) is a fast accessible interaction approach that reduces the complexity by applying a heuristic approach while maintaining a high prediction quality due to the use of a seed-interaction. It has been designed to predict mRNA target sites for given non-coding RNAs (ncRNAs) like eukaryotic microRNAs (miRNAs) or bacterial small RNAs (sRNAs), but it can also be used to predict other types of RNA-RNA interactions. It combines the accessibility of interaction-sites with duplex energy and is efficient enough to be used on a genomewide scale. RNApredator (Eggenhofer et al., 2011) combines pre-computed accessibilities for the target genomes with a simplified energy model for the RNA-RNA interaction to speed up genome-wide predictions.
Albeit both approaches are quite successful for predicting targets of small ncRNAs, they still have a quite high false positive rate when applied genome-wide. For that reason, CopraRNA (Wright et al., 2013) computes whole genome predictions by combination of whole genome IntaRNA predictions using homologous sRNA sequences from distinct organisms, thus greatly reducing the false positive rate.

Prediction of in vivo RNA-binding protein (RBP) interactions
Hundreds of RBPs have been shown to play a role in virtually all aspects of (post-transcriptional) gene expression regulation, ranging from transcript processing, export and localization, stability to translation (see e.g. Baltz, 2012;Castello et al., 2012;Ray et al., 2013). A manually curated collection of over 1.500 RBPs in human as presented in Gerstberger et al. (2014) highlights their vast number and interaction and regulation potential. For many RBPs, their direct interaction with target RNA requires more or less specific sequence motifs (Cook et al., 2011) and accessible binding sites. So far, most investigated RBPs have been shown to prefer single stranded binding regions, although some interact with structured RNA regions (Auweter et al., 2006) or prefer a structural context such as the location within a loop.
RNA-protein interactions play a key role in the complex interactome of higher organisms rendering their interplay and underlying mechanisms an investigative challenge. Distinguishing true binding sites from sites sharing sequence and/or structure features by chance is a non-trivial task, that becomes even harder as interaction is not necessarily functional. Proteins can, besides specific binding, interact with their targets in a probing manner known as diffusional search (Mechetin and Zharkov, 2014), which further complicates interaction analysis.
Experimental investigation of RNA-protein interactions requires some knowledge of at least one of the interacting partners, be it to generate specific probes, antibodies, cell-types or substrates. Using RNA-centric methods, an RNA of interest is purified and interacting proteins or protein complexes can be identified via methods like mass spectrometry. Although this allows identification of novel RBPs, or RBPs for which antibodies are hard to come by, RNA-centric methods require the purification of enough protein mass, which means a high amount of starting material (Bantscheff et al., 2007). Purified protein can, in contrast to nucleic acids, not be amplified, which makes RNAcentric methods challenging for low abundancy RNAs and proteins.
In vivo protein-centric methods are based on specific purification methods for the protein of interest. Antibodies which allow immunoprecipitation (IP) of the latter are most common, however, the quality and specificity of the antibody impacts the quality of the results. To identify interaction partners, co-immunoprecipitated RNA is then reverse transcribed into cDNA, PCR amplified and sequenced. PCR amplification allows to start from low amounts of starting material in contrast to RNA-centric methods. In general, native and denaturating purification methods are available. RNA immunoprecipitation (RIP), preserves physiological conditions and native RNA-protein and protein-protein complexes during native purification. However, the protein of interest can during purification interact with RNAs not natively present in the same cell compartment or interact unspecific with highly abundant RNAs, e. g. rRNAs, which can interfere and mask specific interactions with low-abundancy targets. This can be prevented applying denaturing methods, i. e. crosslinking the protein of interest to its target RNA. Such a snapshot of interactions at the time of crosslinking prevents non-native interactions in later steps of purification. Short wavelength UV light crosslinking creates covalent bonds between aromatic amino acids of the protein and RNA nucleotides in close proximity without crosslinking proteins with other proteins. CLIP (crosslink and immunoprecipitation)  is an in vivo method utilizing UV crosslinking, followed by antibody-purification.
Several types of CLIP procedures have been proposed, e. g. HITS-CLIP (High-Throughput Sequencing of RNA isolated by CrossLinking ImmunoPrecipitation) (Yeo et al., 2009), iCLIP (Individual-nucleotide resolution CLIP) (König et al., 2010) and PAR-CLIP (PhotoActivatable-Ribonucleoside-enhanced CrossLinking and ImmunoPrecipitation) (Hafner et al., 2010). Together with recent methods like eCLIP (enhanced CLIP) (Van Nostrand et al., 2016), irCLIP (infrared CLIP) (Zarnegar et al., 2016) or hiCLIP (RNA hybrid and individual-nucleotide resolution ultraviolet crosslinking and immunoprecipitation) (Sugimoto et al., 2015) a bandwidth of experimental designs rely on the same principle, crosslinking protein residues and adjacent nucleotides with UV light, with varying details that affect the specific outcome. PAR-CLIP, for example, makes use of nucleotide analogs like thio-uridine or thio-guanine, which are introduced into the cell as crosslinking agents. These nucleotide analogs can be crosslinked with long-wave UV light (365 nm), which helps to circumvent the otherwise low efficiency of UV-crosslinking at 254 nm, but works only with cultured cells which readily utilize the nucleotide analogs. The biochemical details behind UV-crosslinking are not yet fully investigated, so that it remains hard to predict how many interactions might be missed completely. However, it is known that reverse transcriptase (RT) misreads crosslinked nucleotides or drops off completely, which is exploited by PAR-CLIP. The introduced nucleotide analogs are in case of thio-uridine misread by RT as guanines, consequentially introducing T-to-C transitions in the resulting sequencing reads. These transitions can then be used to pinpoint interaction sites. iCLIP, as another example, relies on the fact that an amino acid tag left at the crosslink site after proteinase digestion causes termination of reverse transcription to pinpoint the interaction site with nucleotide resolution.
Depending on the CLIP technique used (iCLIP, HITS-CLIP, PAR-CLIP, etc.), downstream analysis requires specific algorithms to filter signal from noise. In general, the goal is to filter spurious and unspecific binding to identify true binding sites. A major challenge of many CLIP data sets is the lack of negative control. Without the latter, a measure to distinguish true binding from background binding has to be defined. In this regard, the RBC provides software for the analysis of RBP binding sites, to be specific, PARalyzer (Corcoran et al., 2011), and micro-MUMMIE (Majoros et al., 2013). PARalyzer is a principled quantitative approach to detect RBP target sites based on a local excess of the diagnostic T-to-C transitions observed at PAR-CLIP derived interaction sites. It computes local kernel density estimates for background and binding sites to distinguish signal from noise, while simultaneously accounting for different RNA expression levels and sequencing depths. This leads to a tighter definition of locations compared to heuristics. microMUMMIE was the first approach to directly utilize PAR-CLIP data to identify in vivo targets of expressed miRNAs. It integrates PAR-CLIP data profiling RISC protein binding locations with sequence features in a multivariate hidden Markov model to predict which mi-croRNA targeted which of the observed in vivo target sites. Both tools provide the user with CLIP derived binding sites that can readily be used for downstream analysis.
Binding motif prediction. After binding sites are defined, the next step is usually the search for binding preferences of the protein of interest. Determination of preferred binding motifs is a routine task with CLIP data, identification of such a motif is, however, non-trivial.
The problem of discovering motifs without any prior knowledge of how the motifs look is described in standard bioinformatics textbooks (see e.g. Jones and Pevzner, 2004). The task is to find subsequences that occur more often than expected, i. e. they are over-represented from a given set of sequences. The motif of interest can in principle be found by aligning the input sequences and searching for conserved regions, given that it should occur in many sequences. However, motifs can consists of sub-motifs themselves and do not have to be fully conserved as they can show some variability in their nucleotide content. Position Weight Matrices (PWM), which assign each position in a sequence a probability for containing a certain nucleotide can be generated from alignments. From there, the frequency of each motif in the input can be calculated and compared to the background frequency (e. g. number of corresponding motifs in genes), to derive a measure for over-representation. Many algorithms based on this or equal strategies exist, among which MEME (Bailey and Elkan, 1994) is the most widely used. It applies an expectation maximization (EM) algorithm to find the most over-represented motifs in a set of sequences and was successfully used to predicted binding motifs for a set of RBPs from HTS data.
cERMIT (Georgiev et al., 2010) is a fast sequence motif identification algorithm that utilizes suffix arrays to efficiently find optimal motifs in large sequence sets (such as tens of thousands of sequences identified by chromatin or RNA immunopreciptation experiments). It uses rank-order statistics and accounts for quantitative information for each sequence, and has also been applied to identify the most prominent miRNA seed matches in differentially expressed mRNAs. miReduce  is another computational algorithm that discovers motifs in mRNAs that explain changes in gene expression, for example upon perturbation of miRNA expression.
In general, RBP binding motifs can be predicted by DNA motif finders that only consider the sequence, or by tools that also consider the RNA secondary structure. For DNA-based motif finders, accessibility in terms of structure is not a factor. The double stranded B-form α-helical structure of DNA allows (sequence specific) DNA binding proteins to interact with its major groove. RNA on the other hand is shaped in Aform α-helical geometry, which results in a very deep and narrow major groove and a shallow and wide minor groove when double-stranded, rendering it less accessible for proteins. In consequence, most RBPs are thought to prefer single stranded RNA (ssRNA) regions for interaction. It is therefore interesting to include accessibility of binding sites to correctly predict binding motifs for RBPs. MEMERIS (Hiller et al., 2006), predicts the probability of being unpaired for sequences and incorporates this single-strandedness information into MEME motif prediction, rendering it more accurate for RBP binding motif prediction.
However, accessibility of the preferred motif is not the only interesting factor to consider, as the structural context of motif embedding regions can of course influence the binding behaviour of RBPs.
GraphProt (Maticzka et al., 2014) is an advanced graph kernel-based machine learning algorithm, extracting motifs that were highly predictive for binding from a set of bound and unbound sequences. These motifs can be used to predict binding affinities and de novo binding sites that are not present in the experimental output. GraphProt is able to use both structural profiles as well as detailed 2D-structures, without the need to decide a priori about the weight of the different structural components. A main advantage is that the full secondary structure information is conserved and not just a structure profile per motif, which decreases the error-rate and can be used to identify structural preferences of RBPs with higher resolution.

Databases
With the ever growing number of experiments detecting new RNAs or targeting RNA-RNA and RNA-RBP interactions, the need for dedicated databases collecting and curating these kind of data emerged. Such databases make it possible to store the results of research projects in a standardized way, fulfilling two very important purposes. They guarantee centralized, long-term and easy access to the results of projects. Keeping data accessible beyond the end of a project is a crucial step for reproducibility and the advancement of a field, which is however not easy to implement for individual groups with rapidly changing personnel. Specialized databases can maintain a high quality due to manual curation. They can act as a "gold standard" to compare new results to or serve as the initial data set for advanced analysis. Without such databases many datasets could not be used to their full extent. Several of the databases are part of the European RNAcentral effort to more tightly integrate all sequence and annotation resources (The RNA Central Consortium, 2017). RBC hosted databases make it possible to compare RNA and RBP targets for shared/unique sequence and structure features, nuclear and mitochondrial tRNA genes as well as special genomic motifs. They build the basis for many downstream analysis tasks.
doRiNA is a database for post-transcriptional regulatory elements, such as RNA:protein interactions obtained via CLIP technologies or computational predictions. Integrating data from different RNA binding proteins, non-coding RNAs, publications and labs is key for understanding combinatorial post-transcriptional gene regulation (Anders et al., 2012;Blin et al., 2014). doRiNA (http://dorina.mdc-berlin.de) curates hundreds of thousands of post-transcriptional regulatory events and is visited by hundreds of researchers world wide.
We have recently proposed circular RNAs as a potentially large class of post-transcriptional regulators (Memczak et al., 2013). Due to the abundance of circular RNAs across all animals and plants that have been studied so far, we are currently developing circbase (http://www. circbase.org), where we are curating, storing, and making accessible our own and other circRNA data (Glažar et al., 2014).
Transfer RNA (tRNA) are one of the first known classes of noncoding RNAs and are crucial for proper translation of RNAs to proteins. tRNAdb (Jühling et al., 2008) continues Sprinzel's tRNA collection (Sprinzl and Vassilenko, 2005) and contains more than 12.000 tRNA genes from 577 species and 623 tRNA sequences from 104 species and is developed in close collaboration with Rfam. Several important features of tRNAs can be extracted from the database, e.g. anticodon, amino acid, position of loop regions as well as the predicted secondary structure.
mitotRNAdb (Jühling et al., 2008) contains more than 30.000 metazoan mitochondrial tRNA genes from more than 1500 species. Mitochondria are eukaryotic organelles whose main function is the production of adenosine triphosphate (ATP). They are separate regions within a cell and therefore have their own genome and translation machinery. Often mtDNA is the first genome that is sequenced in new organisms. It can already be used for phylogenetic analysis. Mitochondrial tRNAs differ significantly from cytoplasmic ones and are often studied independently, therefore a specialized database was generated (see http://mttrna.bioinf.uni-leipzig.de/mtDataOutput).
AREsite2 (Fallmann et al., 2016) is a database for the detailed investigation of AU, GU and U-rich elements (ARE, GRE, URE) in the transcriptome of Homo sapiens, Mus musculus, Danio rerio, Caenorhabditis elegans and Drosophila melanogaster. It contains information on genomic location, genic context, RNA secondary structure context and conservation of annotated motifs. Furthermore, it includes data from CLIP-Seq experiments in order to highlight motifs with validated protein interaction. A REST interface for experienced users to interact with the database in an semi-automated manner is available and also part of the RBC RNA-workbench as described in Section 5. The database is publicly available at http://rna.tbi.univie.ac.at/AREsite.

Integrated service provided by RBC
Experimental labs now generate data of a complexity that makes computational analyses an absolute necessity, but do often not have the means to employ lab members with advanced practical computational skills (such as combining tools of different provenance, compiling and installing on different platforms, etc). We aim to close this bottleneck by providing (1) stand-alone platform-independent access to the applications; (2) workflows for standard analyses as well as means to custom adapt them; (3) integration of new data with published relevant datasets, including expansion of existing database resources; (4) training at distinct levels, from effective tool use to deep understanding of the of the algorithms.
For that reason, we aimed at the integration of tools and databases in one easy accessible and transparent RNA analysis workbench. The services include both genome annotation tools (e.g., target prediction, RNA structure analysis and comparison) as well as pipelines for the analysis of RNA-related HTS-data. Our integrated systems offer a broad range of different ready-to-use pipelines. As RNA-based tools are only one part in a whole analysis pipeline, we offer different workflows for standard RNA-related HTS-analysis such as analysis of RNA-seq data and the associated determination of differential expression or the analysis of epigenetic-related HTS data such as ChIP-seq. To illustrate how the services provided by the center can be used, we here sketch out a couple of examples. A researcher working on RNA-seq data should be able to easily include expressed ncRNA transcripts using our integrated workbench. For that purpose, he needs to be able to define the corresponding ncRNA transcripts from the RNA-seq data, which is a nontrivial task due to the fact that reads typically do not cover the full ncRNA. For the functional annotation of the found transcripts, he also needs to understand the RNA structure. Only an integrated analysis of binding sites of RBPs, microRNA binding, HTS-structure probing and RNA-structure prediction will allow a comprehensive understanding of function associated with the RNA. A further functional analysis would contain also the assignment of the transcript to ncRNA classes, determination of homologs and the prediction of putative targets. To give another example, a scientist working on disease related synonymous SNPs will be enabled to answer the following questions: (1) Are there binding sites of microRNAs or RNA-binding proteins that are affected by the SNP? (2) Does this enhance or decrease the affinity of binding? (3) If there are no direct binding sites, does the SNP change the secondary structure and thus influence some other binding sites? (4) Are other regulatory RNA-elements affected? Currently, these kind of questions need a lot of manual work and a very specific expertise in RNA-bioinformatics, and thus cannot be solved by a normal lab person with a side interest in bioinformatics analysis of high-throughput sequencing data.
One of our main goals is to strengthen the awareness of ncRNA importance during analysis of biological data like differential gene expression or transcriptomics data and the impact of non-coding sequence variation. Combining different data sources is already becoming a standard approach in other areas. To give an example, whole genome methylation data has gained attraction recently and is often being combined with ChIP-Seq data (Gilsbach et al., 2014). However, ncRNA data is not taken into account, leading to a systematic knowledge gap. Fig. 2 puts the core tasks of RNA bioinformatics analysis into a broader context. Some of these interconnections already have been described above, such as the use of services related to RNA-structure and RNA-protein-interactions for genome-wide association studies, or the use of RNA-target prediction and RNA-gene detection for the analysis of RNA-seq data. The investigation of RNA-protein interactions clearly needs proteomics for exact quantification of proteins. On the other side, proteomics also needs RNA-target prediction and RNA-protein interaction to answer questions that are related to translation such as mRNA stability and translational efficiency. ChIP-seq data is e.g. used to investigate transcriptional regulation, and provides information that might be used to improve the prediction of RNA-genes. Conversely, epigenetic modifications that are investigated by using ChIP-seq or by analysing genome-wide methylation often show effects on long non-coding RNAs, which than can be investigated using our services. Finally, the use of RNA-target prediction and RNA-protein interactions have already been established in synthetic biology.

The Freiburg Galaxy Server
For the dissemination of our RNA workbench we have chosen the Galaxy platform (Afgan et al., 2016), because it allows to set up appropriate advisory and team-based structures to allow for an effective integration of tools across topics, and to avoid duplication of efforts. Galaxy is a highly modular, flexible and extensible system, that focuses on easy accessible and reproducible research. As part of this endeavor, we have invested massively in the definition and sharing of analysis workflows. Since the workbench is intended to be usable in a standard laboratory setting, we did not restrict ourself to RNA-based tools only but also integrated workflows for related task such as the analysis of RNA-seq data or pipelines for epigenetic research.
Since February 2013, RBC has been running a Galaxy server for high-throughput sequencing (HTS) data analysis. Since the start of this Freiburg Galaxy Server we gained more than 500 users from a diversity of scientific disciplines, who use this service on a regular basis with about 1 Million jobs in 2016. Reproducible research is taken seriously: every version of an application, the raw data and the executed workflows and analysis histories are stored in a dedicated file server with a capacity of more than 100 TB and an extensive backup strategy. RBC and the Freiburg Galaxy Server is thus well positioned to deal with the challenges of big data. Moreover, group members of the RBC are experts in Galaxy development, active community members and part of a commission that ensures the functional correctness of Galaxy applications and fulfill the strict rules to enable reproducibility.
The Freiburg Galaxy Server offers data analysis tools in an easy accessible user-friendly way without any required knowledge in programming. Beside text manipulation and format converters, the Freiburg Galaxy instance offers tools for the analysis of HTS data from e.g. ChIP-seq, CLIP-seq, Exome-seq, genome annotation, and MethylC-seq experiments in addition to RNA-based pipelines. New tools are continuously developed and integrated with existing tools and databases as well as standard pipelines for the analysis of high-throughput sequencing data.
While the standard pipelines cover many aspects of the HTS analysis and are sufficient for large group of users, there are many additional analysis steps and visualization techniques that can be performed on an individual level. This includes comparison with publically available data or gene and pathway enrichment analysis. Even more, there are many individual experiments that require a deviation from the standard protocol. This can be due to the type of experiments (e.g. CLIP-seq or MethylC-seq), the quality of data (e.g., low coverage or different biases) or just an unusual use case (e.g. sequencing of compartments with low RNA expression). In the definition of standard workflows we try to be as comprehensive as possible. To give an example, various tools for the analysis of RNAseq data are available on the Freiburg Galaxy Server. First the raw data files are checked for their quality by using the tool FastQC. Preprocessing of the fastqsanger files, e.g. by trimming using Trim Galore!, is followed by mapping of reads to a reference genome. In our Galaxy server, several mappers are included, such as HISAT, TopHat2, and STAR. After read counting (htseq-count, feature counts), differentially expressed genes are calculated by the tool DESeq2 or edgeR. The output then needs to be filtered by e.g. p-value and sorted by fold change. In Galaxy, the results can be visualized by various bar charts, diagrams and heatmaps. All tools in Galaxy can be combined into shareable workflows, where all parameter settings and tool versions are saved. The standard RNAseq analysis workflow (Fig. 3) is published on Galaxy (http://galaxyproject.github.io/training-material).
Moreover, a virtualized version of Galaxy (via Galaxy Docker) enable other groups to use our RNA workbench and to do data analysis behind firewalls for e.g. sensitive data. Docker allows to wrap a complete Galaxy instance into a container that contains everything to run the instance. The only requirement is the basic installation and maintenance of the machine plus the installation of the Docker software. We have already developed several docker containers for a basic Galaxy instance as well as for several extensions, which allows to build up a Galaxy instance of a specific "flavour", i.e., a Galaxy instance containing all necessary tools to handle some types of experiments such as RNA-seq or ChIP-seq experiments. The Freiburg Galaxy group also offers a comprehensive set of training material online for self-study and invites the community to contribute to it (https://github.com/ galaxyproject/training-material).

Concluding remarks
With the recent advent of high-throughput RNA-based methods such as CLIP-seq, RIP-seq, ChIRP-Seq or Shape-Seq, the investigation of RNA-based regulation has become a central topic in molecular biology research. The RNA workbench curated by the RNA Bioinformatics center provides a comprehensive set of analysis tools and workflows for the analysis of this type of data. The integration of these tools in the Galaxy framework allows easy access to our RNA workbench. However, the technology in this field is rapidly developing and novel protocols are established in an increasing rate. Our current developments therefore include development of RNA-centric annotation efforts that take RNA processing steps into account (Mukherjee et al., 2016), sequence/ structure integrated motif finding, extending RBP peak callers to new protocols, or detecting RNA modifications. As a specific example, profiling the mRNA portions that are covered by translating ribosomes, socalled RiboSeq, is rapidly gaining in popularity and therefore motivated us to develop a new dedicated computational approach (Calviello et al., 2015). We also cope with the need for new approaches to integrate RNA secondary structure information into RBP binding site prediction (Fallmann et al., 2017). Ultimately, an important task for the future will be the design and development of novel analysis tools and integration in our workbench to accommodate the technological progress.