Dial: a Web Server for the Pairwise Alignment of Two Rna Three-dimensional Structures Using Nucleotide, Dihedral Angle and Base-pairing Similarities

DIAL (dihedral alignment) is a web server that provides public access to a new dynamic programming algorithm for pairwise 3D structural alignment of RNA. DIAL achieves quadratic time by performing an alignment that accounts for (i) pseudo-dihedral and/or dihedral angle similarity, (ii) nucleotide sequence similarity and (iii) nucleotide base-pairing similarity. DIAL provides access to three alignment algorithms: global (Needleman–Wunsch), local (Smith–Waterman) and semiglobal (modified to yield motif search). Suboptimal alignments are optionally returned, and also Boltzmann pair probabilities Prða i ; b j Þ for aligned positions a i , b j from the optimal alignment. If a non-zero suboptimal alignment score ratio is entered, then the semiglobal alignment algorithm may be used to detect structurally similar occurrences of a user-specified 3D motif. The query motif may be contiguous in the linear chain or fragmented in a number of non-contiguous regions. The DIAL web server provides graphical output which allows the user to view, rotate and enlarge the 3D superposition for the optimal (and sub-optimal) alignment of query to target. Although graphical output is available for all three algorithms, the semiglobal motif search may be of most interest in attempts to identify RNA motifs.


INTRODUCTION
During much of the 20th century the structural biology community has focused attention on the study of proteins, leading to a 'protein-centric' view of molecular and cellular biology, as manifest in various protein databases and tools: 'protein sequence' databases such as SwissProt (1), PIR (2), 'protein structure' databases such as the PDB (3), SCOP (4), CATH (5), tools such as PHD secondary structure prediction (6) and DALI structural alignment (7), etc.
In this century, RNA has emerged as an important focus of the structural biology community, as evidenced by the surprising and previously unsuspected roles played by RNA in genomic regulatory processes, such as post-transcriptional regulation with micro RNAs and small interfering RNAs (8), transcriptional and translational gene regulation by allosteric conformational changes in riboswitches (9,10), ribosomal frameshift induced by pseudoknots and slippery sequences (11) and chemical modification of specific nucleotides in the ribosome. Even the peptidyltransferase reaction in peptide bond formation is catalyzed by RNA (12,13).
Within this context, the current article describes the web server, DIAL (dihedral alignment), for pairwise structural alignment of RNA from input PDB files.
Depending on the precise formulation of the problem, structural alignment of 3D protein/RNA backbone conformations is known to be NP-complete. 1 It follows that all current efficient algorithms either restrict the notion of 3D structural alignment or involve a heuristic. For protein structural alignment, DALI (7) and SSAP (Sequential Structure Alignment Program) (15) are perhaps the most widely used heuristic algorithms, where the latter has been used toward automatic classification in the CATH database (5).
With the current interest in RNA, ribonucleic acid sequence and structure databases have been established, and there is continual development of new algorithms and efficient software. For instance, Rfam (16) is an important sequence database of RNAs grouped by family (tRNA, SAM riboswitch, miRNA, etc.), while the NDB (Nucleic Acid Database) (17) is the primary repository for 3D RNA structures and the SCOR (Structural Classification of RNA) database is a derived data collection of RNA motifs (18).
There are far too many RNA alignment and motifsearching algorithms for us to properly survey the area in this article. To properly situate the contribution of DIAL, we give only the most necessary remarks. Most RNA structural alignment algorithms account for sequence and secondary structure similarity. In pioneering work, Sankoff (19) provided an important Oðn 6 Þ algorithm to compute the optimal sequence and secondary structure alignment for two given RNA nucleotide sequences. 2 Both Foldalign (20) and Dynalign (21) are important practical implementations of reasonable restrictions of Sankoff's algorithm, using the Turner nearest neighbor energy model (22,23). Quite surprisingly, in the technical report (24) Blin et al. prove that optimal pairwise alignment is NP-complete, when the input consists of two RNA sequences along with their given secondary structures. 3 It follows that the precise stipulation of the input can effect the computational complexity of RNA structural alignment, a fact that explains in part the multitude of different algorithms for structural alignment.
In (27) Dalli et al. (29) describe the program STRAL, which performs a progressive alignment of non-coding RNA using base-pairing probability vectors in quadratic time. In an unusual approach, Sato and Sakakibara (30) apply conditional random fields to determine optimal RNA alignment.
Turning to RNA 3D structural alignment, in (31) Olson describes two virtual (or pseudo-) dihedral angles, later reintroduced by Duarte and Pyle (32). The pseudodihedral angles respectively are determined by the four points C4 0 ði À 1Þ; PðiÞ; C4 0 ðiÞ; Pði þ 1Þ respectively PðiÞ; C4 0 ðiÞ; Pði þ 1Þ; C4 0 ði þ 1Þ, where P(i) respectively C4 0 ðiÞ denotes the phosphorus atom respectively 4 0 -carbon atom of the ith RNA nucleotide. The program AMIGOS of Duarte and Pyle (32) computes RNA dihedral and pseudo-dihedral angles, used in the program PRIMOS of Duarte, Wadley and Pyle (33) to compute RNA 'worms', i.e. a sequence of ; angles for the entire RNA molecule. The method COMPADRES of Wadley and Pyle (34) uses PRIMOS to detect new RNA structural motifs, such as the -turn, -turn, -loop, C2FA and hook turn (34), by the following procedure: (i) a nonredundant RNA structural data collection of 49 structures, 50 chains and 6697 nt is created; (ii) RNA worms are calculated for each of these structures, and the worms are concatenated into a single sequence; (iii) all maximal gapless matches of at least 5 nt of this sequence with itself are detected 4 and (iv) known 3D motifs are removed from the matches, and a frequency count is made of remaining matches, from which the high-frequency motifs are analyzed.
In ( In (36,37), Dror, Nussinov and Wolfson describe a cubic time RNA tertiary structure alignment algorithm, ARTS, which proceeds by a seed match and greedy global extension to approximately compute the 'largest common point set' (LCP) between phosphorus atoms of two RNA molecules. Given PDB files for two RNA molecules A and B, the program ARTS first determines 3D coordinates a 1 , . . . ; a n respectively b 1 , . . . ; b m of all phosphorus atoms from A respectively B, then applies the software 3DNA of Lu and Olson to determine base pairs of each structure. Given RMSD error bound of , ARTS determines all seed matches of 'base quadrats' 5 ði; i þ 1; j À 1; jÞ and ði 0 ; i 0 þ 1; j 0 À 1; j 0 Þ for which there is a rigid transformation (rotation and/or translation) T such that max jja i À Tðb i 0 Þjj; jja iþ1 À Tðb i 0 þ1 Þjj; È jja jÀ1 À Tðb j 0 À1 Þjj; jja j À Tðb j 0 Þjjg Since there are O(n) base pairs in an RNA molecule of length n, the computation of all seed matches is done in Oðn 2 Þ time. Subsequently, a greedy extension of seed matches approximately computes the LCP A 0 ¼ fa i 1 , . . . ; a i k g A and B 0 ¼ fb i 0 1 , . . . ; b i 0 k g B of phosphorus atoms between both RNA molecules, such that jja i x À Tðb i 0 x Þjj for all 1 x k. The extension is done 2 Sankoff provided a general O(n 3k ) algorithm to determine the optimal multiple sequence/secondary structure alignment for k RNA nucleotide sequences of length n. To the best of our knowledge, there is no publicly available implementation of Sankoff 's algorithm. 3 In other words, the nested-nested edit-distance problem of Lin et al. (25) is NP-complete. See (26) for an O(n 4 ) algorithm for a related RNA alignment problem. 4 The worm h( 1 , 1 ), . . . , ( m , m )i is defined to match the worm hð 1 ; 1 Þ; . . . ; ð 0 m ; 0 m Þi if the Euclidean distance between ð i ; i Þ and ð 0 i ; 0 i Þ is at most 258 for each 1 i m.
to maximize a score Fð'; kÞ ¼ w 1 Á k þ w 2 Á ', where there are ' base pairs and k nucleotides, for appropriate weights w 1 ; w 2 . Note that ARTS does not necessarily respect the order of nucleotides in the linear chain, and that no account is taken for nucleotide identity; i.e. there is no nucleotide bonus for GNRA tetraloops.
In (38), Mokdad and Leontis describe the program Ribostral, which analyzes an RNA 3D alignment, and graphically presents base-pair isostericities. Sarver et al. (39) describe the algorithm FR3D used for RNA motif search by computationally intensive coordinate RMSD computations to determine optimal alignment between motif and target, by using a reduced atom representation of RNA nucleotides.
In this article, we introduce a quadratic time, dynamic programming algorithm, DIAL, able to find the optimal alignment with gaps (i.e. bulges) of two RNAs taking into account sequence, structure and base-pairing information extracted from the PDB file. DIAL provides a number of features not available in other RNA pairwise structural alignment algorithms. While PRIMOS and COMPADRES compute gapless alignments of pseudodihedral angles of contiguous segments, DIAL can perform global, local and semiglobal alignment in Oðn 2 Þ time with affine gap penalty by taking into account nucleotide similarity, 6 dihedral and pseudo-dihedral angles as well as the base-pairing nature of nucleotides (0: unpaired, L: base paired with nucleotide to left, R: base paired with nucleotide to right). The program DIAL can perform alignments of 'fragmented' (i.e. non-contiguous or composite) motifs with targets, where the number of fragments is arbitrary. Since the computation of pseudodihedral angle for the ith nucleotide requires atomic coordinates of both the (i À 1)st and ði þ 1Þst nucleotide, DIAL additionally extracts atomic coordinates from 1 nt preceding the start of the region specified and 1 nt following the end of the region specified. Inaccuracies (not checked by DIAL) will occur for the first and last nucleotide in the chain of a PDB file. The web server DIAL is an important extension of the program PRIMOS; indeed, PRIMOS alignments are obtained if DIAL parameters are specified to obtain a gapless alignment of pseudo-dihedral angles. This is done by entering negative gap initiation and gap extension parameters whose absolute value is prohibitively large, and setting to 0 all parameters for dihedral angles, nucleotide similarity, basepairing nature (0,L,R). For user-specified 'suboptimal alignment score ratio' 0 p 1, DIAL returns suboptimal alignments for which SÀS 0 S p, where S denotes the optimal alignment score and S 0 denotes the suboptimal alignment score. Additionally, in quadratic time DIAL computes the partition function (41)(42)(43) for alignments, hence returns the Boltzmann pair probabilities Prða i ; b j Þ for aligned nucleotides a i , b j occurring in the optional alignment. Boltzmann pair probabilities can suggest the biological significance of portions of the optimal alignment, an idea validated for protein sequences by Vingron and Argos (44).
While ARTS is an excellent cubic time program for 'motif detection', yielding an approximation to the LCP set of phosphorus atoms, it should be noted that ARTS does not necessarily preserve linear order within the alignment; i.e. it can happen that i j < i ' and i 0 j > i 0 ' . Moreover, ARTS takes no account of nucleotide identity or similarity.
The graphical user interface of DIAL is particularly simple, in that PDB accession codes, chain IDs and starting and ending residue sequence numbers can be entered for both RNA molecules; optionally, PDB files can be uploaded. Allowing the user to finetune all parameters, DIAL is powerful, flexible and sufficiently accurate to allow the comparison of a large number of molecules for subsequent refinement by other methods.

MATERIALS AND METHODS
We computed RNA backbone dihedral angles by writing Python scripts based on the Biopython Structural Bioinformatics package (45). Six dihedral angles (, b, g, d, , ) can be defined in the RNA backbone, and one dihedral angle () describes the rotation between backbone and base. The values for all these angles are not independent, but there is a very high correlation between values of each pair of angles (35). Two additional virtual angles and , first introduced by Olson (31) and later reintroduced by Duarte, Duarte, Wadley and Pyle (33) offer a reduced but sufficient conformational description of the RNA backbone (46). To determine base-pairing status of each nucleotide, we run RNAVIEW (47) on all the SCOR RNA chains.

Algorithm
In addition to quadratic time implementations of Needleman-Wunsch global alignment (48) and Smith Waterman local alignment (49) algorithms, the DIAL web server includes an implementation of 'semiglobal' alignment (50), opportunely modified to perform motif searching for contiguous or fragmented queries. All algorithms have been extended to account for the similarity of matched nucleotides, dihedral angles 7 and base-pairing attributes. To illustrate our modification of semiglobal alignment for fragmented motifs, suppose that the query consists of two non-contiguous fragments, a 1 , . . . ; a m and a 0 1 , . . . ; a 0 m 0 , and that the target consists of the contiguous sequence b 1 , . . . ; b n . In our semiglobal alignment, there is no penalty for gaps occurring to the left of a 1 , between a m and a 0 1 and to the right of a 0 m 0 , while gaps 6 Default RNA nucleotide similarities are taken from BLAST (40); however, the user can modify nucleotide similarity, gap initiation and gap extension costs as well as other parameters. 7 A dihedral or torsion angle is determined by four points a, b, c, d in Euclidean 3D space. By taking cross products, compute the normal vectors u ! ; v ! to the plane determined by a, b, c and b, c, d. The dihedral angle is defined to be the inverse cosine of the inner product u occurring in a 1 , . . . ; a m or in a 0 1 , . . . ; a 0 m 0 are penalized; i.e. alignment of the query is semiglobal. In contrast, all gaps in b 1 , . . . ; b n are penalized, including those occurring to the left of b 1 and to the right of b n . All algorithms run in quadratic time using affine gap penalties, following Gotoh's method (51).
Our scoring function evaluates the similarity between each nucleotide of the query and target, by accounting for (i) dihedral/pseudo-dihedral similarity, (ii) nucleotide sequence similarity and (iii) base-pairing similarity. Each one of these contributions is weighted by default parameters; these parameters can be modified by the user. In particular, if the user enters parameters x; y; z respectively for the dihedral, nucleotide and base-pairing parameters, then the weight of dihedral angle contribution is x=ðx þ y þ zÞ, while that for nucleotide similarity is y=ðx þ y þ zÞ, and that for base pairing is z=ðx þ y þ zÞ. Similarly, one can modify the parameters for the seven dihedral angles and two pseudo-dihedral angles; i.e. if x 1 , . . . ; x 7 respectively y , z denote the form values for the dihedral and pseudo-dihedral angles, then the first dihedral angle weight is x 1 =ðx 1 þ Á Á Á þ x 7 þ y þ zÞ, the weight for the first pseudo-dihedral angle is Given query a 1 , . . . ; a n and target b 1 , . . . ; b m , the 'similarity' simða i ; b j Þ of aligning a i from the query RNA with b j from the target RNA is given by the weighted sum where, following BLAST default (40), the nucleotide sequence contribution Sequenceða i ; b j Þ is 1 if nucleotides a i , b j are identical (match) and À3 otherwise (mismatch), and where Here A is the set of six backbone dihedral angles (, b, g, d, , ), one dihedral angle () describing the orientation of the base, and two pseudo-dihedral angles respectively determined by the 4 points C4 0 ði À 1Þ; PðiÞ; C4 0 ðiÞ; Pði þ 1Þ respectively PðiÞ; C4 0 ðiÞ; Pði þ 1Þ; C4 0 ði þ 1Þ. 8 BasePairða i ; b j Þ is a penalty if the base-pairing attribute of a i and b j differ. Although we have focused discussion on the motif search application of DIAL, global and local 3D structural alignment is supported. Unless the parameters are set to be permissive, local alignment tends to report very small alignments of only a few nucleotides. Full details of the algorithm and extensions will be given in a forthcoming article. Following Clote, Ferre`and Straubhaar (42,43), we additionally compute the Boltzmann pair probabilities within an optimal alignment (41) by computing a 'forward' Boltzmann partition function where A ranges over all possible alignments of a 1 , . . . ; a i with b 1 , . . . ; b j , R is the universal gas constant and T is absolute temperature. 9 In the inductive case, the forward partition function FZði; jÞ can be computed by where for notational simplicity we have assumed a linear gap penalty g. 10 In a similar fashion, the backward Boltzmann partition function BZ can be computed, where where A ranges over all possible alignments of a i , . . . ; a n with b j , . . . ; b m . The Boltzmann probability Pr½ða i ; b j Þ that a i will be aligned with b j is then It should be stressed that due to the complexity of RNA 3D structural alignment, one cannot hope that a quadratic time algorithm such as DIAL be highly accurate. However, by using DIAL to compute potential target regions predicted to align well with the query, one can subsequently apply a very accurate, but computationally intensive RNA structural alignment algorithm, such as FR3D (39). We believe that this will be the primary application of DIAL.
The input form for DIAL is shown in Figure 1. The user must either upload or give the four character alphanumeric PDB accession code for both query and target RNA structures, and indicate the chain identifier for each (underscore if the PDB file contains no chain identifier). Optionally, the starting and ending residue sequence number for the query and/or target structure can be given. Default parameters for dihedral and pseudodihedral angle contributions to the alignment may be used or modified. The user can choose between the semiglobal motif finding algorithm (default), or global or local alignment. Three temperatures may be chosen for the Boltzmann pair probability computation to determine highly significant portions of the alignment. Figure 2 displays the output of DIAL, when executing semiglobal alignment of query 1J5A (chain A, nucleotides 2530-2536) with target 1HR2 (chain A). Hot links are provided for the alignment, dihedral and pseudo-dihedral angles (and sugar pucker), Boltzmann probabilities and superposition; alignment and a zoomed close-up of the superposition are depicted in Figure 3.

RESULTS
To illustrate the difference in alignment accuracy of DIAL and ARTS, we applied the motif search algorithm to two transfer RNA structures. The query structure was 1ASZ:R from residue sequence number 620 to 660 and the target structure was 4TRN. While DIAL correctly aligned this 41 nt portion of aspartyl-tRNA 1ASZ with the corresponding portion of 4TRN, the alignment produced by ARTS is incorrect; see Figure 4. For certain examples, this comportment of ARTS is not surprising, since it was designed to compute the largest collection of phosphorus atoms which are -close to each other.
To assess the accuracy of the DIAL web server, we computed receiver operating characteristic (ROC) curves (52), which depict the trade-off between sensitivity (true positive rate) and specificity (1 minus false positive rate). For this assessment, we used the SCOR database (18).
SCOR XML dumps were parsed in order to locally reconstruct the SCOR database. Our starting structure data set included all RNA motifs in the SCOR database; i.e. 440 families and altogether 9850 motifs. Of 440 SCOR families, 82 had both fragmented and non-fragmented members while 62 had only fragmented members. Note that even if the number of SCOR families having fragmented members is relatively small, they are often the most populated families. There were 5110 members in families having 2 fragments, 3 members in families having 3 fragments, 1 member in a family of 4 fragments Figure 1. DIAL input form. The user must either upload query and target PDB files, or give the four character PDB code, and additionally indicate the chain identifier (underscore indicates a blank chain identifier in the PDB file). By modifying the parameter , the user can appropriately weigh the sequence versus dihedral angle contribution to the alignment score. By default, the alignment takes into account only the two pseudo-dihedral angles ; and the base-pairing similarity. and 1 member in a family of 6 fragments. We filtered the SCOR collection to eliminate the following motifs: (i) shorter than 3 nt; (ii) composite motifs where fragments belong to different chains and (iii) no range is specified. (i.e. starting and ending position in the chain.) After this selection, we accepted only SCOR families having more than one remaining motif. This step produced 136 families and altogether 5619 motifs. Of these 136 families, 89 contained only local (contiguous, non-fragmented) motifs, 41 contained only composite (fragmented) motifs, and 6 contained both local and composite examples. Of a total of 5619 motifs, 2836 are composite, all formed by two fragments.
Since SCOR includes RNA structures which may be identical or very similar but have different PDB accession codes, for each SCOR family we produced a sequence non-redundant subcollection using Algorithm 2 described in (53). 11 In this process, we additionally discarded structures shorter than 5 nt and having poorer resolution than 3.5 Å . Our final, filtered, non-redundant data set extracted from SCOR database thus consisted of 78 families and altogether 359 motifs. The reason there were so few remaining motifs is due to the fact that the SCOR database has many identical or very similar motifs occurring in different RNA molecules. Figure 5 and 6 present ROC curves respectively for contiguous and fragmented queries. These are computed as follows. For each pair (S 1 , S 2 ) of structures in the non-redundant data collection obtained from SCOR as indicated above, we computed the DIAL similarity where seqSim represents nucleotide sequence similarity, strSim represents pseudo-dihedral ; angle similarity and Figure 2. DIAL screen output, when applying the motif detection (semiglobal alignment) algorithm of query 1J5A (chain A, nucleotides 2530-2536) with target 1HR2 (chain A). The target (respectively query) conformation is depicted in the upper (respectively lower) left corner, along with hot links to the computed dihedral angles. The superposition of optimal query to target alignment is depicted on the right. The images are produced by using a JMOL applet, hence allow the user to rotate, zoom in, zoom out and choose a variety of molecule representations. To the right of this output (not shown) is a pull-down tab for suboptimal alignments, provided the user entered a non-zero parameter for suboptimal alignment score ratio. 11 Algorithm 2 constructs a sequence non-redundant data set as follows. Given list L of sequences, determine BLAST similarity of first sequence to all others, removing from L all homologous sequences (with E-value above a given threshold). Take the second sequence from the filtered list L, determine BLAST similarity with all successive sequences from L, removing those which are homologous, etc. In this fashion a set of sequences is obtained, guaranteed not to be pairwise homologous. In our implementation, we used default BLAST values for nucleotide match, mismatch and gap, set the threshold to be the E-value 0.001. bpSim represents base-pairing similarity. 12 Computations were performed for weights w from 0; 0:2; 0:4, . . . ; 0:8; 1:0. Positives (respectively negatives) were considered pairs (S 1 , S 2 ) from the same (respectively different) SCOR class. This allowed the computation of ROC curves displayed in Figure 5. For the most part, pseudo-dihedral angle similarity is much more important for proper SCOR classification than nucleotide sequence similarity. These data gave rise to the ROC curves shown in Figure 6, which displays overlaid curves with different weights w for the sequence versus structural alignment, w ¼ 0; 0:2; 0:4, . . . ; 0:8; 1:0. Table 1 presents the area under ROC curves, denoted by AUC, for both nonfragmented and fragmented motifs, using the data from the previously discussed ROC curves.

DISCUSSION
In this article, we have described the DIAL web server, which provides access to global, local and semiglobal alignment of RNA structures, presented as PDB files. We believe the semiglobal alignment to be of particular interest as a preprocessing step for RNA motif detection.
The DIAL web server performs a quadratic time, dynamic programming alignment, taking into account similarity of nucleotide identity, (pseudo-) dihedral angles and base pairing in the secondary structure. The algorithm is fully customizable by allowing the user to stipulate different weights for angles and base pairs.  Table 1 or text for fuller description of parameters used.) This figure depicts ROC curves for contiguous queries, consisting of an uninterrupted linear sequence of nucleotides. Figure 6. Average ROC curves when using the semiglobal DIAL algorithm to align query motifs from the SCOR database with targets from the SCOR database. The x-axis represents false positive rate (1 minus specificity), while the y-axis represents true positive rate (sensitivity). Overlaid curves represent different weighting of dihedral angle versus sequence contributions with weights w ¼ 0; 0:2, . . . ; 0:8; 1:0. (See Table 1 or text for fuller description of parameters used.) This figure depicts ROC curves for fragmented queries, representing 3D motifs consisting of two or more interrupted linear sequences of nucleotides. In the SCOR database, most fragmented queries consist of two contiguous linear sequences. (See text for fragment breakdown for the SCOR databse.) Table 1. Area under ROC curve (AUC) for ROC curves displayed in Figures 5 and 6. ROC curves were created for a non-redundant data set extracted from the SCOR database-see Section Data set in reference (20). AUC is computed for different values of weight parameter w for both non-fragmented and fragmented queries, for w ¼ 0; 0:2, . . . ; 0:8; 1:0. This corresponds to setting parameters on DIAL web form as follows: 'dihedral' ¼ w, 'sequence' ¼ (1Àw), 'base-pairing' ¼ 1. With these settings, DIAL alignments give same weight to sequence/structural similarity and base-pairing similarity. By varying weight w, we obtain a trade-off between sequence and dihedral angle similarity. With these settings, DIAL appears to perform slightly better on fragmented motifs Unlike the PRIMOS algorithm of Duarte et al. (33), which considers a gapless alignment of pseudo-dihedral angles for contiguous sequences, DIAL can handle fragmented queries and alignments with bulging nucleotides by means of gap insertion. DIAL alignment accounts for base-pairing similarity, known to be of primary importance in the manual curation of the SCOR database. Additionally, DIAL computes Boltzmann pair probabilities in the alignment, and can return suboptimal querytarget alignments.