Recent advances in RNA folding

Secondary structure is the natural level of coarse in the realm of nucleic acid structures. It forms a conceptually important intermediate level of description and explains the dominating part of the free energy of structure formation. Secondary structures are well conserved over evolutionary time-scales and for many classes of RNAs evolve slower than the underlying primary sequence. Given the close link between structure and function, secondary structure is routinely used as a basis to explain experimental ﬁndings. Recent technological advances, ﬁnally, have made it possible to assay secondary structure directly using high throughput methods. From a computational biology point of view, secondary structures have a special role because they can be computed eﬃciently using exact dynamic programming algorithms.


Introduction
Structure, in particular evolutionarily conserved structure is an excellent predictor of biological function. This is true for all classes of biopolymers, including  proteins and nucleic acids. The physics of structure formations, however, differs substantially between proteins and nucleic acids. The dominating process 5 in protein folding is global, driven by hydrophobic forces. RNAs, on the other hand, exhibit a hierarchical folding process, where base pairs and thus helices, are rapidly formed, while the spatial arrangement of complex tertiary structures usually is a slow process.
RNA secondary structure elements (see Fig. 1 for an overview) are formed 10 via intramolecular interactions of nucleotides. Such interactions form base-pairs via hydrogen bonds between corresponding nucleotides, enforcing restrictive local geometries. The standard set of RNA base-pairs (AU,GC) is known as Watson-Crick-base-pairs , named after the famous discoverers of DNAs doublehelical structure [1]. GC-base-pairs can form three hydrogen bonds between 15 their Watson-Crick edges, while AU-base-pairs can only form two. This is important considering their energy contributions, which is higher for GC-than for AU-base-pairs . The main part of the interaction energy, however, is con-tributed by the stacking interaction of the π-electron systems of the aromatic rings of the nucleobases. These energy contributions are large compared to the 20 effects of hydrogen bonding. As a consequence, almost all RNAs form highly stable, well-defined secondary structures, while protein structures often remain flexible or are only marginally stable at room temperature [2].
At a more detailed level, other interactions between nucleotides beyond canonical base-pairs contribute to structure formation. Most prominently, GU 25 wobble-base pairs regularly appear in native RNA structures. RNA bases not only interact via the "standard" Watson-Crick-edge. Instead, they can also form bonds between their Hoogsteen-or CH-edge and their Sugar-edge. These edges even allow the formation of base-pairs between three bases at once, known as base triplets, influencing the stability of helices and tertiary as well as quaternary 30 structures. Long range interactions like pseudo-knots or kissing hairpins also contribute to RNA secondary structure formation. This form of intramolecular base-pairing happens when a stem or loop region interacts with another non-adjacent stem or loop region.
In this contribution we provide a short overview of the RNA folding al- 35 gorithms and recent additions and variations. We briefly introduce current extensions beyond the basic secondary structure model and address methods to align, compare, and cluster RNA structures. The contribution ends with a tabular summary of the most important software suites in the fields, many of which are already integrated in the Galaxy-RNA-workbench [3]. 40

Basic Secondary Structure Prediction Algorithms
The dominance of base stacking and loop entropies as energetic contribution and the restriction to a single interaction partner enables a purely combinatorial description of RNA (and DNA) secondary structures, and thus to completely ignore both, the atom-scale details and the actual spatial embedding of the 45 molecule. Formally, an RNA secondary structure is simply a (labeled) graph whose nodes represent entire nucleotides and whose edges denote base pairs, so that 1. edges are formed only between nucleotides that form Watson-Crick or GU base pairs; 50 2. no two edges emanate from the same vertex, i.e., from the mathematical point of view, a secondary structure is a matching; 3. edges span at least three unpaired bases; 4. if the vertices are placed in 5 to 3 order on the circumference of a circle and edges are drawn as straight lines, no two edges cross. 55 The last condition ensures that the graph is outerplanar and therefore excludes so-called pseudo-knots, to which we will briefly return below.
Over the last two decades an additive energy model known as the "Turner parameters" has become the well-tested standard model for the energy of an RNA secondary structure. It stipulates that relevant energetic contribution are 60 the stacking of base pairs, the entropic strain of loops, as well as partial stacking of unpaired bases at the ends of helical regions (usually referred to as dangling ends). These have been tabulated as function of the sequence compositions of stacked pairs and loops respectively, based on a wealth of detailed experimental evidence. 65 The dynamic programming approach to RNA secondary structure prediction relies on the fact that structures can be recursively decomposed into smaller components with independent energy contributions. In each of the decomposition steps only a single loop (or stacking of two consecutive base pairs) needs to be evaluated. Fig. 2 outlines this scheme in a graphical manner. This decom-70 position scheme has the form of a context free grammar. In the simplest model, Nussinov's maximum circular matching [5], the paired contribution C is interpreted as a single base pair around an arbitrary structure F . The more realistic Turner model requires a somewhat more complex grammar, distinguishing hairpin loops, interior loops (including stacking base pairs as a special case), and 75 multi-branch loops. Again we refer to the literature for the details.
The grammar, whose exact form depends on the structural building blocks   [4]). The hieroglyphic symbols denote different types of RNA secondary structures: F is an arbitrary secondary structures, C a structure enclosed by a base pair, and M and M 1 denote components of multibranch loops. We refer to [4] and the references therein for a detailed description of the algorithms.
that are associated with energy contributions, pertains an identical form to the computation of the minimum free energy structure [6,7], the partition function [8] or the density of states [9]. These algorithmic variants differ only in the way 80 how the individual steps of the recursion are evaluated, i.e., whether energies are minimized, partition functions are summed, or histograms are convoluted over alternative decompositions. Instead of experimentally measured parameters, one can also employ machine learning techniques to infer parameters from training sets of known structures [10]. The machine learning approaches, usu-85 ally phrased as stochastic context free grammars (SCFGs) [11], can afford more freedom in the choice of the details of the decomposition model [12].
Generic variations on the algorithms have been designed to retrieve a large collection of sub-optimal structures [13] instead of only a single representative minimum free energy structure. The exact computation of partition functions 90 not only provides access to equilibrium base pairing probabilities but also to melting temperatures and specific heat profiles [14].

Secondary Structures
Local Secondary Structures. RNAs much beyond the length of ribosomal RNAs 95 presumably do not fold into their global minimum but form locally stable structural domains. This effect can be modeled by restricting the maximal span L of base pairs. This approach not only yields more plausible structure predictions, it also drastically increases the computational efficiency. The "scanning versions" [15,16] of the standard folding recursions require only O(nL 2 ) time and 100 O(n + L 2 ) space, where n is the sequence length. This makes them fast enough for genome and transcriptome-wide approaches. In [17], optimized parameter for local folding of mRNA were introduced. On a large set of benchmark, this work could also show that local folding is preferable to global folding for mRNAs.

RNA Folding with Constraints
Although the Turner energy model provides a surprisingly accurate approx-120 imation of the RNA folding energies, it is not perfect. On the one hand, the energy parameters, which have been estimated by regression from large numbers of melting experiments, are afflicted by residual measurement errors. On the other hand, the secondary structure model is not perfect and neglects many weak interactions. As a consequence, secondary structure predictions are far 125 from perfect. It is of great interest therefore to guide the prediction procedure with external information. This can be done in two ways: either by constraining the set of allowed structures using hard constraints or by encouraging or discouraging certain structural features with the help of bonus energies. Recently a generic framework to handle both types of constraints has been incorporated 130 into the ViennaRNA Package [22].
Hard constraints either enforce or prevent pairing of a certain base or basepair , usually implemented as high energy penalties. A less harsh way to implement constraints is to reward or penalize structures that match or contradict available information via moderate pseudo-energy terms, so called soft con-135 straints. The latter can be set in proportion to some measure of confidence or signal strength.
In general, constraints become of interest in scenarios where RNAs interact, either with other RNAs, proteins, or ligands. Hard constraints can be used to model the exposition of binding sites, rendering them either accessible, or inac-140 cessible for interaction partners. Soft constraints can be used to fine-tune RNA secondary structure predictions by incorporating chemical or enzymatic "reactivities" either directly, as energy contributions/penalties, or by minimizing the deviations between predicted and measured signal. In particular, the inclusion of SHAPE reactivities has been studied in much detail by several groups. A re-145 cent addition to the ViennaRNA package implements the most commonly used options [22]. These methods have become applicable to genome-wide surveys of condition-dependent secondary structure changes. An example is a recent study of temperature dependence of structures in bacterial pathogens [23].
RNA molecules in vivo usually interact with multiple partners simultane-150 ously. These interactions can influence each other even if there is no direct competition of the same or overlapping binding sites since competitive and co-operative effects can be mediated by structural changes that can unblock or block previously paired or accessible regions. The magnitude of such effects can be computed when free energy of a RNA molecule bound by two interaction 155 partners is derived from the difference ∆∆G between the sum of the energy of both partners interacting separately and the end state and ground state [24]. A negative value of ∆∆G indicates antagonistic binding effects, a positive ∆∆G indicates cooperative effects. Such effects can efficiently be modeled using the constraint folding option in the ViennaRNA package 2.0 [4], where a pair of 160 binding sites constraints the structure ensemble by forcing these sites inaccessible.
Regional Accessibility. A parameter that is crucial for the analysis of interactions of RNAs with proteins or other nucleic acids is energy necessary to expose a local binding site region to the partner. It is of crucial importance for exam-165 ple in the context of microRNA/mRNA binding, siRNA efficiency, or bacterial sRNA function. This opening energy is conceptually the difference between the free energy of the equilibrium ensemble and the free energy of an ensemble constrained to leave a known binding site unpaired. Instead of using constrained folding framework to compute accessibilites for each individual region, it is pos-170 sible to compute accessibilities for all intervals simultaneously using a much more efficient dedicated variation of the folding algorithms [25,26].

RNA-RNA and RNA-Protein Interactions
The multi-faceted regulatory machinery of gene expression is based on the interplay between RNA and regulatory factors like other RNAs or proteins. It is 175 crucial for the balance between synthesis (transcription), translation, transport and decay of mRNAs, ncRNAs and proteins to modulate the spatial-temporal expression of RNA molecules. Hundreds of RNA binding proteins and even more miRNAs are encoded in the human genome [27,28,29], emphasizing their role in gene regulation and thus the vitality of organisms. The extreme versatility of 180 RNA molecules in terms of sequence and structure features and the complexity of RNA binding domains and binding preferences of proteins raise the need for advanced and efficient algorithms for interaction analysis.
There are basically two different approaches for determining the interaction between two RNAs that takes into account both the sequence and structure of 185 the participating RNAs. The first type of approaches defines the search for an RNA-target as the problem of predicting a common stable structure for the two interacting RNAs. This is in general an NP-complete problem [30]. Thus, existing approaches implement a partial structure model that can predict a certain  The energy required to make the interaction site accessible can be determined in a modified partition function approach for the individual sequences in cubic time [35]. RNAup [36] then combines this accessibility term for two interaction sites with the best energy for the duplex-formation for these sites, yielding an O(n 2 w 2 ) approach for target prediction, where w is the maximal length of the 215 interaction sites. The resulting score corresponds to the partition function of all interacting structures that have the same duplex. IntaRNA [37,38] reduces this runtime to O(n 2 ) for the final duplex calculation using a heuristics for the right end of the interaction site. By combining this with a seed-based approach, the prediction quality is nearly the same as for RNAup . RNAplex is an even faster 220 approach that uses a heuristic version for the calculation of accessibility. The energy required to make a region accessible is directly related to the probability that this region is free in the ensemble of all structures. This probability is now approximated in RNAplex using a Markov chain with limited memory.
One additional problem is that RNAup , IntaRNA and RNAplex predict only One possibility for taking conservation into account is to use an alignmentfolding approach as in RNAalifold (see above). Here, one predicts interactions between two different alignments [41,42]. However, as shown in [43], the interaction sites is not necessarily conserved, especially on mRNAs. CopraRNA [44] 235 does not attempt to predict conserved interaction sites, but conserved interactions by combining evidence for the interaction between two RNAs in different species. A recent benchmark on sRNA target prediction shows that CopraRNA clearly outperforms other target prediction tools. However, CopraRNA is limited to RNAs where conservation information is available.

240
While RNA-RNA interactions are directly related to RNA folding, this is not the case for the prediction of targets of RNA-binding proteins (RBPs). Instead, the approaches for finding binding sites of RBPs are more related to finding mo-tifs in a set of bound sequences, which is the data provided by SELEX and CLIP experiments. Due to the similarity between this problem and the task of find-245 ing binding sites of transcription factors, motif discovery tools like MEME [45] have been used frequently. However, as already shown, one cannot ignore the contribution of the RNA secondary structure. Many RBPs for example prefer single-stranded regions as binding sites. Memeris [46] is an extension of MEME that uses accessibility as prior for motif discovery. RNAcontext [47] uses a 250 physical energy model of motif binding that integrates structural information.
Graphprot [48] extends the idea of k-mers with gaps to graphs, which are used to represent the folding of the binding sites and its context, using an efficient graph-kernel. It is currently one of the most reliable tools for predicting binding sites from CLIP data, as shown by several experimentally verified binding pre-255 dictions [47,49]. RNA secondary structure influences on RNA binding behavior of proteins has also been successfully used to discriminate actively bound sites from a list of potential binding sites [50]. This concept was one of the key moti-

RNA Gene Finding
Homology-based RNA gene finding. RNAs with conserved secondary structure 265 are typically either short non-coding RNAs or relatively small structured domains that are part of larger transcripts. The short length, the small size of the nucleotide alphabet, and the usually relatively low level of sequence conservation conspire to make RNA homology search a difficult problem [52]. Still, the most commonly used tool is blastn and it works well in many circumstances.

270
The conserved secondary structure of many RNA families, however, provides additional information that is harnessed by infernal to improve both sensitivity and specificity of the search [53]. Instead of single sequence, it starts from a structure-annotated multiple alignment, as available for many RNA families from the Rfam database [54]. The alignment is converted into a covariance 275 model, a tree-like generalization of HMMs, which allows efficient search in genomic sequences. At present, infernal serves as the tool for RNA homology search.
De novo detection of conserved RNA structure. Our current knowledge of ncRNA genes is far from complete, however. Even in the age of efficient RNA-seq meth-280 ods, it is still of interest to find evidence for evolutionarily conserved, and thus likely functional, RNA structure (see [55,56] for recent reviews). Over the years, several types of tools have been devised for this purpose. QRNA [57] uses a fully probabilistic model and computes for a pairwise sequence alignment the posterior probabilities that it derived from a coding region, a conserved secondary 285 structure, or neither. RNAdecoder [58] is an extension of this idea that considers the superposition of RNA structure and coding region. Tools such as AlifoldZ [59], SissiZ [60], and RNAz [61, 62, 63] start from multiple sequence alignments and evaluate descriptors such as folding energies and sequence diversity to decide whether the alignment harbors a conserved structure or not. To make this 290 decision, RNAz , for example, uses a support vector machine trained from large sets of structured RNAs and shuffled decoys. cmfinder [64] considers a set of related, but unaligned sequences and their predicted secondary structures. Together with anchors of sequence similarity these are used to build CMs with the help of infernal , which in turn are used to search for further matches, which are 295 used in an interactive expectation maximization step to refine the CM, whose significance is then evaluated. A common issue with all de novo RNA gene finders is a relatively high false discovery rate that needs to be estimated by comparing the foreground data with a control, which is usually constructed by column-wise shuffling of the input alignments.  Pseudoknots. The topic of RNA pseudoknots has received much attention in the past, albeit to a large extent from a more theoretical and algorithmic point of view. There are several competing models describing the different classes of pseudoknotted structures, most of which fall into the realm of multi-contextfree grammars (MCFGs) and can be handled by dynamic programming [11,71], 325 albeit at computational complexities that are prohibitive for molecules larger than a few hundred nucleotides. Enumerative approaches for non-MCFG classes of structures are discussed in [72]. At present, the practical applicability of pseudoknots is largely limited by accuracy of energy models, which have to be estimated from small sets of examples. 330
Such expressions of nested parenthesis have a natural interpretation as rooted, ordered trees in computer science. In consequence, tree alignment and tree editing algorithms, which generalize familiar sequence alignment methods, can be adapted for comparing RNAs based on their sequence and structure [73,74].

335
Both approaches were extended to multiple RNAs following the progressive alignment scheme [73,75]. Furthermore, the tree-based approach can even be extended to pseudoknotted structures for a large variety of pseudoknot types [76,77].
Such methods however, especially if based on tree-alignment, are very sen-340 sitive to the compared secondary structures. This limits their practical use for analyzing RNAs of a priori unknown structure, since secondary structures have to be predicted from the sequence of each single RNA.
Simultaneous Folding and Alignment. The quality of secondary structure prediction increases substantially, when the structure is computed from an align-345 ment of related sequences. While sequences of high similarity can be aligned sufficiently well by traditional sequence alignment methods, such alignments tend to become inaccurate, when pairwise identities drop below about 60%; then compromising comparative structure prediction.
In such cases, the simultaneous computation of alignment and secondary 350 structure folding, originally proposed by Sankoff [78], remedies this RNA structure analysis dilemma. In practice, the original Sankoff algorithm suffers from considerable computational cost, due to its extreme time complexity of O(n 6 ) and overhead due to the computation of minimum energy in the Turner model. Measures of reliability. Prediction always includes some amount of uncertainty.
For the user it is important to get some information on how reliable a prediction is, for a detailed review refer to [84]. In case of RNA secondary structure The successful classification of known RNA-genes in families (i.e., RNAs related by evolution like tRNAs) and classes (i.e., RNAs related by same functional structure like miRNA and snoRNAs) has opened up a possibility for a structure-based annotation approach by clustering putative ncRNAs according 410 their sequence and structure to detect new RNA classes. One possibility is to directly use the score produced by sequence-structure alignment as for the hierarchical clustering of RNAs [85,79]. RNAclust [79] is a dedicated pipeline facil-itating complete clustering of RNAs. SoupViewer allows to semi-automatically analyze such a complete RNA cluster tree, easing the otherwise manual pro-415 cess of inspection for potential misclassifications. However, the problem of this clustering approach is twofold. First, determining the similarity between two different RNAs in the clustering procedure is complex (at least O(n 4 ) time).
Second, this score has to be calculated for all pairs of RNAs, which restricts its application to a small sets of RNAs, typically in range of few thousands. 420 This is circumvented by, so-called alignment-free approaches [86,87] that avoid the calculation of a quadratic number of alignments and thus are able to cluster hundreds of thousand of RNAs. GraphClust [86] even avoids any quadratic step by using an inverse index based on a structure-aware hashing approach to determine dense RNA neighborhoods. 425

Tools and suites for RNA analysis
This section presents a collection of the most relevant RNA-centric software available. Table 6 lists a selection of tools or suites of tools which are concerned with RNA secondary structure prediction, design, homology and more. This

ViennaRNA
The ViennaRNA Package consists of a C-library and a set of RNA secondary structure related standalone programs covering topics like (sub)optimal structure prediction, RNA-RNA interaction analy- mfold/UNAfold UNAFold is a comprehensive software package for nucleic acid folding and hybridization prediction. It provides methods for folding of single-stranded RNA or DNA, or hybridization between two single-strands. It can compute partition functions, minimum free energy foldings or hybridizations and also suboptimal foldings.
http://unafold.rna. albany.edu/?q=mfold [91] RNAstructure The RNAstructure suite provides many tools for tasks like the prediction of secondary structures, secondary structures common to two or more sequences as well as the prediction of bimolecular secondary structures.

[93]
Bielefeld RNA tools The RNA processing tools of Bielefeld work on RNA data and provide, among others, shape prediction/abstraction and hybridization solutions.

de/rna
[94] However, the here presented tools and suites allow to investigate virtually all aspects of RNA secondary structure and thereby affected features. Most of them are either available as suites, or have web server interfaces that allow the non-commandline affine user to benefit from their features. In a collaborative 460 effort, many of these tools have additionally been collected in the Galaxy-RNAworkbench , which makes them available in a virtualized box, featuring a Galaxy brand easy to use interface.

Acknowledgement
This work was supported by the BMBF-funded projects (FKZ 031A538A+B) 465 within The German Network for Bioinformatics Infrastructure (de.NBI).