Computational modeling of RNA 3D structures and interactions

RNA molecules have key functions in cellular processes beyond being carriers of protein-coding information. These functions are often dependent on the ability to form complex three-dimensional (3D) structures. However, experimental determination of RNA 3D structures is difﬁcult, which has prompted the development of computational methods for structure prediction from sequence. Recent progress in 3D structure modeling of RNA and emerging approaches for predicting RNA interactions with ions, ligands and proteins have been stimulated by successes in protein 3D structure modeling.


Introduction
RNA is a deceptively simple molecule. It appears to be built from four simple ribonucleotide residues (A, U, C, G) that form canonical A-U and C-G base pairs. Simple diagrams expressing this base pairing have long been used to illustrate the approximate architecture of the RNA, which is referred to as RNA secondary structure. Many RNA molecules fold spontaneously into well-defined three-dimensional (3D) structures, where information about the overall architecture is encoded in the sequence of residues (Figure 1). Yet for RNA sequences longer than about 50 nucleotides, predicting the 3D structure remains a major challenge.
RNA 3D structure formation in vivo occurs during transcription and in many cases requires interactions with other molecules, including metal ions, small organic molecules, proteins, and other RNAs. Moreover, ribonucleotide residues can form a large number of interactions other than canonical pairs. Thus, in developing software for building and analyzing 3D structures of RNA, all these complex interactions should be taken into account.

Fundamental principles versus knowledge based
Approaches to predicting RNA 3D structures have been largely inspired by developments in the field of protein 3D structure prediction, which fall between two general categories ( Figure 2): bottom up and top down. Bottom up approaches ('Greek science') employ a finite set of fundamental principles, usually from physics and chemistry, in the hopes of generating the bigger picture. Top down approaches start with existing knowledge ('Babylonian science': alluding to the Babylonian penchant for recordkeeping) and attempt to break down very complex problems into more manageable concepts that can be solved in a knowledge-based manner. Which strategy is best depends on the type of problem or question of interest.
At the very foundation of the bottom-up approach lays quantum mechanics (QM). All biological processes ultimately depend on the quantum chemistry; e.g., hydrogen bonding, enzymatic reactions, hydrolysis, etc. Hence, 'Greek science' would compel us to build everything from the 'bricks and mortar' of QM. However, because of the enormous mathematical complexity, such idealistic employment of QM is limited to a few hundred atoms even on the computers of this era. Nevertheless, the principles of QM have contributed substantially to the modeling methods discussed here and without them, there would be no 'Greek science' modeling methods to discuss. More tractable, the next level up (Figure 2) involves molecular mechanics (MM) methods, which approximate QM interactions in terms of Newtonian mechanics. For instance, QM provided the main justification for modeling of base stacking with equations for van der Waals interactions [1 ].
One popular application of the MM approach is molecular dynamics (MD), which calculates the time dependent motions of a molecule in a particular environment. A common software package for this purpose is AMBER [2]. In tractable simulations of biophysical time scales (on the order of nanoseconds to microseconds), one can observe many conformations of the molecule and variations in the respective energy values. One of the ways to depict the variety of molecular conformations and their energies in a MD simulation is to express it in terms of what is called a 'folding landscape' (Figure 2). A plot of the folding landscape shows the energy of the molecule as it progresses toward the native structure, resembling a reaction coordinate.
Calculations in MD are considerably less expensive computationally than in QM, but still very costly. This is because there can be local minima in the folding landscape where the structure is essentially trapped (in a suboptimal state) for a relatively long time. When the energy changes in the landscape are large, the landscape is often referred to as 'rugged' [3] and the molecule can become stuck for a long time in a local energy minimum. Nevertheless, because MD only makes fundamental assumptions about atoms, bonds and neighboring interactions, it is a very adaptable method for studying the plethora of interactions of biomolecules -short of chemical reactions that require making and breaking of covalent bonds. In some cases, where the QM part of the problem can be safely isolated from the MM parts, a hybrid QM/MM approach can be used (review: [4 ]).
Because of the very high costs of all atom MD simulation in a buffer of solvent and ions, a common approach is to reduce the resolution from atoms to a cluster of atoms, while retaining most of the essential physics. Groups of atoms may be treated as single interaction centers or 'pseudoatoms', so that a smaller number of elements and interactions need to be considered, and solvent molecules are treated implicitly as a continuous medium (review: [5]). This clustering of atoms follows some logical 'coarse-grained' structure, such as using one bead to represent the nucleic acid base and another bead for the sugar and phosphate. One of the earliest coarse-grained MD approaches was a one bead per residue model that has evolved into the program YUP [6]. Recent models consist of two or a larger number of beads [7][8][9][10][11], up to six or seven beads in HiRE-RNA [12]. Developing a force field to describe the interactions between the beads is, however, not as straightforward as in all-atom MD. Typically, one must use thermodynamic information or find the average behavior of a group of atoms [13]. The coarsegraining approach helps to reduce the ruggedness of the landscape (Figure 2), and hence greatly speeds up calculations in the search of the global energy minimum, while ignoring details on the way.  Knowledge-based ('Babylonian') methods take a different approach -they work largely from the answer. For example, they often turn to the Protein Data Bank (PDB) to take advantage of the rapidly growing trove of experimentally determined macromolecular structures: including proteins, nucleic acids, and various combinations thereof. Likewise, they often seize upon the burgeoning stockpile of sequence information, etc. Armed with an enormous database of structures, sequences and other accumulated information, one hopes to paw through volumes of data to find some general facts that summarize a phenomenon, regardless of its physio-chemical basis. For instance, evolutionarily related structured RNAs retain globally similar structures, despite an accumulation of mutations in the sequence, and hence knowledge of one such structure can be used to infer the structure of another without doing the whole folding simulation from scratch. In addition, the atomic-level geometrical and stereo-chemical parameters can be constrained to values commonly seen in experimentally determined RNA structures.
Leaning heavily on the side of the knowledge-based approach, there are comparative modeling methods such as ModeRNA [14] that use experimentally determined structures as 'templates' and 'mutate' the sequence according to a sequence alignment. Another comparative modeling approach implemented in MacroMolecular Builder (previously RNABuilder) uses internal coordinates with distance and torsion angles and interatomic distances from the aligned regions of the template structure as restraints for the modeling of the target sequence [15]. Models of RNA structure can be also assembled from a number of smaller fragments, such as in the fully automated method RNA Composer [16] or in semiautomatic interactive modeling programs such as Assemble [17] and RNA2D3D [18]. Comparative models generally require further optimization.
Another knowledge-based approach to derive interaction potentials is based on a statistical analysis of experimentally determined structures. Statistical potentials replace the calculation of the physical free energy by evaluating the relative frequencies of features such as the pairwise distances of atoms (bonded and non-bonded) and mutual orientations of chemical groups (e.g., torsion angles, base pairs, stacking interactions etc.). In this picture, the most frequently observed structural features are also the most probable ones.
Many methods combine the use of coarse-grained representation or fragment assembly with sampling of 24 Theory and simulation conformations and scoring the resulting structures by a statistical potential. Some methods combine elements of a statistical potential with a physical energy function. FARNA/FARFAR is an example of an adaptation of a protein 3D structure prediction method ROSETTA to RNA 3D structure prediction [19,20 ], which assembles 3D structures from very short fragments and scores them with a combination of statistical and physical energy terms. MC-Fold|MC-Sym samples RNA conformations with ribonucleotide cyclic motifs based on predicted canonical and non-canonical interactions [21]. SimRNA uses a coarse-grained representation with five beads per residue, and scores the structures with a purely statistical potential [22]. Other knowledge-based approaches consist of one or more beads per residue and employ thermodynamic parameters for interactions such as base pairing [23][24][25]. These and many other similar methods bear the elements of thermodynamics in that they use statistical counting of configurations and presume the role of temperature as a control parameter.
Some methods have been developed to sample the space of possible RNA 3D structures based on userdefined restraints, rather than to simulate the folding or produce one structural model. NAST [26] uses just one bead to describe each ribonucleotide residue, and it relies almost entirely on base pairing information. ERWIN [27] represents individual helices and loops as centers of interactions described by a statistical potential that ignores other interactions. RAGTOP [28 ] is a hierarchical graph sampling approach that uses constraints on secondary structure including pseudoknots and predicted junction topologies, and a statistical potential that assesses geometries of internal loops and radii of gyration in known RNAs. BARNACLE [29] uses a probabilistic model to sample RNA conformations with an all-atom representation based on restraints on base-pairing. Though some global regularity can be derived from the statistical arrangement of secondary structure units, the role of long-range stem-stem interactions has yet to be developed.

RNA Puzzles
Analogous to the project known as Critical Assessment of protein Structure Prediction (CASP), several groups now participate regularly in a project known as 'RNA Puzzles' to predict the 3D structures of RNA molecules [30 ,31 ]. As with CASP, the results of RNA Puzzles were a sobering experience, as it turned out that for the most part, none of the methods is highly successful at predicting 3D structures for RNA molecules much larger than 50 residues. Expectedly, in analogy to protein structure prediction, comparative models of RNA 3D structure were the most accurate; given the proper template structures existed and were correctly identified by the modelers. Secondary structure was usually predicted correctly, even in the absence of templates; however most methods performed poorly in predicting non-canonical interactions. The reasons for this failure are surely many; however, a significant contribution comes from inadequate sampling and inaccurate force fields. Non-canonical contacts depend on multibody interactions, are relatively weak, and coarse-grained methods tend to be insensitive to the delicate local environment. One strategy may be to resample local conformations in initial models generated by the coarse-grained approach using fine-grained allatom methods.
Most of the successful predictions in RNA Puzzles have depended on discovering and using various restraints within the modeling. Therefore, benchmarks related to these studies serve as a comparison of entire workflows rather than individual computational programs. The results suggest that 3D structure models are improved when sufficient restraints are included on the secondary structure and on local motifs, either based on experimental structure probing or on computational predictions (review: [32]).

RNA-metal ion interactions
Metal ions are indispensable for proper RNA folding, stability and function in various biological processes; in particular, the positive charge of metal cations is needed to compensate for the negative charge of RNA phosphate backbone, and in some ribozymes, metal ions have been found to directly mediate catalysis [33]. RNA-Mg 2+ interactions have been modeled using e.g., all-atom MD [34] and the Poisson-Boltzmann (PB) model [35], with varying success [36]. More recently, Chen's group has developed the tightly bound ion (TBI) model based on evaluating an ensemble of discrete ion distributions to specifically model the interactions of Mg 2+ in RNA in different buffer conditions [37 ].
Knowledge based methods have also been developed to predict the location of ions given RNA 3D structure; for example, FEATURE [38] and MetalionRNA [39]. The main limitation of the knowledge-based approach is that they can only predict the position of ions in environments that were contained in the training sets. Since crystallographers cannot easily detect these sites, this becomes a circular problem [40].
Modeling RNA interactions with proteins and small molecule ligands RNA rarely acts alone. For instance, many RNAs function only in complex with specific proteins. However, most knowledge-based methods typically specialize in protein or RNA structure prediction, but not both. Even though all atom MD simulation permits protein-RNA interactions in principle, in practice the force fields are not sufficiently tested. Currently available methods largely specialize in: (1) sampling the conformational space of possible orientations and conformations of the individual protein and RNA components and generating what are called poses or decoys and (2) scoring these docked models in a way that distinguishes near-native structures from non-native ones [41 ].
In the currently available methods, the structures of interacting RNA and protein partners are treated as relatively rigid; that is, folding and binding are treated as separate problems. However, this is rarely true and RNA is known to undergo significant conformational changes upon protein-RNA complex formation. Although the persistence length (stiffness) can become very large, particularly in an RNA helix [42], protein binding can induce small changes in the local conformations that propagate: yielding visible alterations that extends well beyond the binding interface. To explicitly model conformational changes is computationally demanding; hence, some strategies of protein-RNA complex modeling introduce a certain level of 'fuzziness' (review: [43]) or some hybrid of rigid body docking and flexible refinement of interacting residues.
Many protein-protein docking methods have been developed and assessed via the Critical Assessment of PRediction of Interactions (CAPRI) experiment, analogous to CASP [44]. Protein-RNA docking is a less mature field, and most existing methods were developed by adapting algorithms previously used for protein-protein docking [45,46]. The applications 3dRPC/RPDock [47] and NPDock [48] were recently developed specifically for protein-RNA binding and RosettaDock was also adapted to fulfill this function [49].
Many RNA molecules are regulated by small molecules. RNA molecules like riboswitches are known to change their conformation upon binding of ligands. RNAs are also known to be targets of small molecule drugs or drug candidates: for example the bacterial ribosome is the main target of antibiotics [50] and viral RNAs are considered as targets for inhibitors. The main objectives of computational analyses focused on RNA-ligand binding are to identify small molecules that bind the RNA receptor and to characterize the 3D structure of the receptor-ligand complex. Docking is the usual method of choice for determining the binding modes of the ligands to the receptor molecule. As in RNA folding or RNA-protein docking, scoring functions to rank small molecule binding poses can be knowledge-based or derived from first principles (review: [51]).

Conclusions
The field of RNA 3D structure prediction is rapidly progressing and we could only present a very brief overview of recent developments. Recent progress in RNA modeling has been largely inspired by successes in protein 3D structure modeling methods. Recently, efforts have turned to RNA interactions with proteins and ligands and toward including ions and solvent; however, this remains a difficult problem to solve. As the structural database of experimentally determined RNA molecules and their complexes increases, knowledge-based methods will likely play an increasingly important role in the first phase of predictions: that of building approximate models that present the most recurrent features found in similar molecules. A deeper understanding of the fundamental physics of RNA folding and interactions is required to model the details of interactions and features that are rarely seen. This knowledge remains wanting; nevertheless, though often only by the school of hard knocks, our gaps in understanding are gradually being filled.