Predicting consensus structures for RNA alignments via pseudo-energy minimization.

Thermodynamic processes with free energy parameters are often used in algorithms that solve the free energy minimization problem to predict secondary structures of single RNA sequences. While results from these algorithms are promising, an observation is that single sequence-based methods have moderate accuracy and more information is needed to improve on RNA secondary structure prediction, such as covariance scores obtained from multiple sequence alignments. We present in this paper a new approach to predicting the consensus secondary structure of a set of aligned RNA sequences via pseudo-energy minimization. Our tool, called RSpredict, takes into account sequence covariation and employs effective heuristics for accuracy improvement. RSpredict accepts, as input data, a multiple sequence alignment in FASTA or ClustalW format and outputs the consensus secondary structure of the input sequences in both the Vienna style Dot Bracket format and the Connectivity Table format. Our method was compared with some widely used tools including KNetFold, Pfold and RNAalifold. A comprehensive test on different datasets including Rfam sequence alignments and a multiple sequence alignment obtained from our study on the Drosophila X chromosome reveals that RSpredict is competitive with the existing tools on the tested datasets. RSpredict is freely available online as a web server and also as a jar file for download at http://datalab.njit.edu/biology/RSpredict.


Introduction
RNA secondary structure prediction has been studied for quite awhile. Many minimum free energy (MFE) methods have been developed for predicting the secondary structures of single RNA sequences, such as mfold, 1 RNAfold, 2 MPGAfold, 3 as well as recent tools presented in the literature. 4 However, the accuracy of predicted structures is far from perfect. Recently, a new concept of normalized free energy for predicting the secondary structures of single RNA sequences was introduced. 5 The normalized free energy of an RNA substructure is the free energy of that substructure divided by the length of its underlying sequence. A dynamic programming algorithm, called Densityfold, was developed, which delocalizes the thermodynamic cost of computing RNA substructures and improves on secondary structure prediction via normalized energy minimization. 5 Here, we extend the concept used in Densityfold and present a new tool, called RSpredict, for RNA secondary structure prediction. RSpredict computes the RNA structure with minimum normalized energy based on the loop decomposition scheme used in the nearest neighbor energy model. [6][7][8] The new tool focuses on the loops in an RNA secondary structure, whereas Densityfold considers RNA substructures where a substructure may contain several loops.
To understand the difference between the two tools, see Figure 1. In the fi gure, the substructure S contains six loops, denoted as L 1 , L 2 , L 3 , L 4 , L 5 , L 6 . RSpredict calculates the normalized free energy of the substructure S, denoted NE(S), by taking the sum of the normalized energies of the six loops. That is, On the other hand, Densityfold computes the normalized free energy of the substructure S as the sum of normalized energies of all its substructures including S. Referring to Figure 1, we see that S has six substructures, denoted as S 1 = S, S 2 , S 3 , S 4 , S 5 , S 6  While the normalized energy model creates a foundation for RNA secondary structure prediction, there are many limitations in Densityfold, just like in all other single sequence-based MFE methods. Optimal structures predicted by these methods do not necessarily represent real structures. This happens due to several reasons. The thermodynamic model may not be accurate. The bases of structural RNAs may be chemically modifi ed and these processes are not included in the prediction model. Finally, some functional RNAs may not have stable secondary structures. Thus, a more reliable approach is to use comparative analysis to compute consensus secondary structures from multiple related RNA sequences.
In general, there are three strategies with the comparative approach. The fi rst strategy is to predict the secondary structures of individual RNA sequences separately and then align the structures. Tools such as STRUCTURELAB 9 and RADAR 6 are based on this strategy. The second strategy predicts common secondary structures of two or more RNA sequences through simultaneous alignment and consensus structure inference. Tools based on this strategy include RNAscf, 10 Dynalign, 11 stemloc, 12 and CARNAC. 13 These tools utilize either folding free energy change parameters or stochastic context-free grammars (SCFGs) and are considered derivations of Sankoff's method. 14 The third strategy is to fold multiple sequence alignments. RNAalifold 15,16 uses a dynamic programming algorithm to compute the consensus secondary structure with minimum free energy by taking into account thermodynamic stability, sequence covariation together with RIBOSUMlike scoring matrices. Pfold 17 is a SCFG algorithm that produces a prior probability distribution of RNA structures. A maximum likelihood approach is used to estimate a phylogenetic tree for predicting the most likely structure for input sequences. A limitation of Pfold is that it does not run on alignments of more than 40 sequences and in some cases produces no structures due to under-fl ow errors. Maximum weighted matching (MWM), based on a graph-theoretical approach and developed by Cary and Stormo 18 and Tabaska et al, 19 is able to predict common secondary structures allowing pseudoknots. KNetFold 20 is a recently published machine learning method, implemented using a hierarchical network of k-nearest neighbor classifi ers that analyzes the base pairings of alignment columns in the input sequences through their mutual information, Watson-Crick base pairing rules and thermodynamic base pair propensity derived from RNAfold. 2 The method proposed in this paper, RSpredict, joins the many tools using the third strategy; it accepts a multiple alignment of RNA sequences as input data and predicts the consensus secondary structure for the input sequences by minimizing their pseudo-energy, which takes into account both the normalized free energy and covariance scores of the input sequences.
The rest of the paper is organized as follows. We fi rst describe the implementation and algorithms used by RSpredict, and analyze the time complexity of the algorithms. We then present experimental results of running the RSpredict tool as well as comparison with the existing tools. The experiments were performed on a variety of datasets. Finally we discuss some properties of RSpredict, possible ways to improve the tool and point out some directions for future research.

Methods
RSpredict, which can be freely downloaded from http://datalab.njit.edu/biology/RSpredict, was implemented in the Java programming language. The program accepts, as input data, a multiple sequence alignment in FASTA or ClustalW format and outputs the consensus secondary structure of the input sequences in both the Vienna style Dot Bracket format 16 and the Connectivity Table  format. 21 Below, we describe the normalized energy model adopted by RSpredict. We then present a dynamic programming algorithm for folding a single RNA sequence via normalized energy minimization. Next, we describe techniques for calculating covariance scores based on the input alignment. Finally we summarize the algorithms used by RSpredict, combining both the folding technique and the covariance scores obtained from the input alignment, and show its time complexity.

Folding of a single RNA sequence
We represent an RNA secondary structure as a fully decomposed set of loops. [6][7][8] In general, a loop L can be one of the following ( Fig. 1 We now introduce some terms and defi nitions. Let S be an RNA sequence consisting of nucleotides or bases A, U, C, G. S [i] denotes the base at position i of the sequence S and S [i, j] is the subsequence starting at position i and ending at position j in S. A base pair between nucleotides at positions i and j is denoted as (i, j) or (S [i], S [ j]), and its enclosed sequence is S [i, j]. Given a loop L in the secondary structure R of sequence S, the base pair (i*, j*) in L is called the exterior pair of L if S [i*] (S [ j*], respectively) is closest to the 5′ (3′, respectively) end of R among all nucleotides in L. All other non-exterior base pairs in L are called interior pairs of L. The length of L is the number of nucleotides in L. Note that two loops may overlap on a base pair. For example, the interior pair of a stack may be the exterior pair of another stack, or the exterior pair of a hairpin loop. Also note that a bulge or an internal loop has exactly one exterior pair and one interior pair.
We use the normalized energy concept as follows. Given a secondary structure R, every base pair (i, j) in R is the exterior pair of some loop L. We assign (i, j) and L a normalized energy, which is the free energy of loop L divided by the length of L. The set of free energy parameters for nonmultibranched loops used in our algorithm is acquired from Mathews et al. 8 The free energy of a multibranched loop is computed based on the approach adopted by mfold, 1 which is a linear function of the number of unpaired bases and the number of base pairs inside the loop, namely a b n c n + × + × 1 2 , where a, b, c are constants, n 1 is the number of unpaired bases and n 2 is the number of base pairs inside the multibranched loop. We adopt the loop decomposition scheme used in the nearest neighbor energy model developed by Mathews et al. 8 The secondary structure R contains multiple loop components and the normalized energies of the loop components are additive based on our defi nition of the normalized energy of a structure, as explained in the beginning of the Introduction section and illustrated in Figure 1. Our folding algorithm computes the total normalized energy of R by summing up all normalized energies of the loops in R. Thus, the RNA folding problem can be formalized as follows. Given an RNA sequence S, fi nd the set of base pairs (i, j) and loops with (i, j) as exterior pairs, such that the total normalized energy of the loops (or equivalently, the exterior pairs) is minimized. The set of base pairs constitutes the optimal secondary structure of S.
When generalizing the folding of a single sequence to the prediction of the consensus structure of a multiple sequence alignment, we introduce the notion of refi ned alignments. At times, an input alignment may have some columns each of which contains more than 75% gaps. Some tools including RSpredict delete these columns to get a refi ned alignment; 17 some tools simply use the original input alignment as the refi ned alignment. Suppose the original input alignment A o has N sequences and n o columns, and the refi ned alignment A has N sequences and n columns, n Յ n o . Formally, the consensus structure of the refi ned alignment A is a secondary structure R together with its sequence S such that each base pair (S [i], S [ j]), 1 Յ i Ͻ j Յ n, in R corresponds to the pair of columns i, j in the alignment A, and each base S [i], 1 Յ i Յ n, is the representative base of the ith column in the alignment A. There are several ways to choose the representative base. For example, S [i] could be the most frequently occurring nucleotide, excluding gaps, in the ith column of the alignment A. Furthermore, there is an energy measure value associated with each base pair (S [i], S [ j]) or more precisely its corresponding column pair (i, j), such that the total energy measure value of all the base pairs in R is minimized. The consensus secondary structure of the original input alignment A o is defi ned as the structure R o , obtained from R, as follows: (i) the base (base pair, respectively) for column C o (column pair (C o1 , C o2 ), respectively) in A o is identical to the base (base pair, respectively) for the corresponding column C (column pair (C 1 , C 2 ), respectively) in A if C o ((C o1 , C o2 ), respectively) is not deleted when getting A from A o ; (ii) unpaired gaps are inserted into R, such that each gap corresponds to a column that is deleted when getting A from A o (see Fig. 2).
In what follows, we fi rst present an algorithm for folding a single RNA sequence based on the normalized energy concept described here. We then generalize the algorithm to predict the consensus secondary structure for a set of aligned RNA sequences.
The denominator in Equation (2) is 4 because there are four nucleotides in the stack. In (4), n 1 is the number of unpaired bases and n 2 is the number of base pairs in the multibranched loop. Using Equations (1), (2), (3) and (4), the total normalized energy of all loops in R [i, j] where (i, j) is a base pair is computed by Equation (5): That is, the normalized energy NE p (i, j) is calculated by taking the minimum of the following four cases: i. (i, j) is the exterior pair of a hairpin, in which case the normalized energy NE P (i, j) equals E H (i, j), which is the normalized energy of the hairpin; ii. (i, j) is the exterior pair of a stack, in which case NE P (i, j) equals the normalized energy of the stack, i.e. E S (i, j), plus NE P (i + 1, j − 1); iii. (i, j) is the exterior pair of a bulge or an internal loop, in which case NE P (i, j) equals the minimum of the normalized energy of the bulge or internal loop loop, in which case NE P (i, j) equals the minimum of the normalized energy of the multi branched loop Equation (6) below shows the recurrence formula for calculating NE(i, j): , respectively) is the free energy (normalized energy, respectively) of the hairpin with exterior pair (i, j). iv. e s (i, j) (E S (i, j), respectively) is the free energy (normalized energy, respectively) of the stack with exterior pair (i, j) and interior pair , respectively) is the free energy (normalized energy, respectively) of the multibranched loop with exterior pair (i, j) and interior pairs ( , ),( , ), ,( , ).
, , , , , , ..., , ) That is, the normalized energy NE(i, j) is computed by taking the minimum of the following four cases: i. the total normalized energy of all loops in the optimal secondary structure R (Fig. 3a); ii. the total normalized energy of all loops in the optimal secondary structure R . 3b); iii. the total normalized energy of all loops in the optimal secondary structure R Note that case (iii) of Equation (6)

Complexity
In calculating the time complexity of the folding algorithm, there is a need to check for fi nding the optimal ′ ′ < ′ < ′ < , in case (iv), respectively) of Equation (5). Letting n be the number of nucleotides in the given sequence S, this checking would require O(n 4 ) time in case (iii) and O(n (k + 2) ) time in case (iv). This time complexity can be reduced by adopting the following implementation strategy.
We introduce two entries to be updated based on the cases in Equation (6): form a base pair (recall that an interior pair of some loop is the exterior pair of another loop). Figure 3. Illustration of the four cases in Equation (6) , , ) Then: Depending on which case in Equation (6) yields the value of NE(i, j), we have: With the two introduced entries, the case (iii) in Equation (5): and the case (iv) in Equation (5): can be combined and expressed by a single formula as follows: where n is the number of nucleotides in the given sequence S. The space complexity of the folding algorithm is O(n 2 ), since all energy values are stored in two dimensional tables. The normalized energy of the optimal secondary structure R for the sequence S equals NE(1, n).

Calculation of covariance scores
When applying the above folding algorithm to a multiple sequence alignment A o , we take into consideration the correlation between columns of the alignment. In many cases, the sequences in the alignment may have highly varying lengths. We refi ne the alignment A o by deleting columns Therefore, it takes linear time to compute containing more than 75% gaps to get a refi ned alignment A. 17 We will use this refi ned alignment throughout the rest of this subsection.

Covariance score
We use the covariance score introduced by RNAalifold 15 to quantify the relationship between two NE P (i, j) in Equation (5). Hence, the time complexity of the folding algorithm is O(n 3 ) since we need to columns in the refi ned alignment. Let f ij (XY) be the frequency of fi nding both base X in column i and base Y in column j, where X, Y are in the same row of the refi ned alignment. We exclude the occurrences of gaps in column i or column j when calculating f ij (XY). The covariation measure for columns i, j, denoted C ij , is calculated by Equation (9): Here, D ij (XY, X′Y' ) is the Hamming distance between the two base pairs (X, Y) and (X ′, Y ′) if both of the base pairs are standard base pairs, or 0 otherwise. The Hamming distance between (X, Y) and (X ′, Y ′) is calculated as follows: Observe that the information acquired from the two base pairs (X, Y ) and (X ′, Y ′) is the same as that from (X ′, Y ′) and (X, Y ). Thus, we divide the numerator in Equation (9) by 2 so as to obtain the non-redundant mutual information between column i and column j in the refi ned alignment.
For every pair of columns i, j in the refi ned alignment, the covariance score of the two columns i and j, denoted Cov ij , is calculated in Equation (12): Here, C ij is as defi ned in Equation (9), c 1 is a user-defi ned coeffi cient (in the study presented here, c 1 has a value of −1), and where N is the total number of sequences and NC ij is the total number of confl icting sequences in the refi ned alignment. A confl icting sequence is one that has a gap in column i or column j, or has a non-standard base pair in the columns i, j of the refi ned alignment. A sequence with gaps in both columns i, j is not confl icting.

Pairing threshold
We say that column i and column j in the refi ned alignment can possibly form a base pair if their covariance score is greater than or equal to a pairing threshold; otherwise, column i and column j are forbidden to form a base pair. The pairing threshold, η, used in RSpredict is calculated as follows.
It is known that, on average, 54% of the nucleotides in an RNA sequence S are involved in the base pairs of its secondary structure. 22 We use this information to calculate an alignmentdependent pairing threshold, observing that the base pairs in the consensus secondary structure of a sequence alignment represent the column pairs with the highest covariance scores. Given that different structures contain different numbers of base pairs, we consider two different percentages of columns, namely, 30% and 65%, in the sequence alignment. For each percentage p, there are at most T p possible base pairs, where and n is the number of columns in the sequence alignment. Now, we calculate the covariance scores of all pairs of columns in the given refi ned alignment, and sort the covariance scores in descending order. We then select the top T p largest covariance scores and store the covariance scores in the set ST p . Thus, the set ST 0.65 contains the top largest covariance scores that involve 65% of the columns in the refi ned alignment; the set ST 0.30 contains the top largest covariance scores that involve 30% of the columns in the refi ned alignment; and ST 0.65 \ST 0.30 is the set difference that contains covariance scores in ST 0.65 but not in ST 0.30 (see Fig. 4). The pairing threshold η used in RSpredict is calculated as the average of the  . . \ If the covariance score of columns i and j is greater than or equal to η, then column i and column j can possibly form a base pair, and we refer to (i, j) as a pairing column. If the covariance score of the columns i and j is less than η, we will check the covariance scores of the immediate neighboring column pairs of i, j to see if they are above a user-defi ned threshold 20 (in the study presented here, this threshold is set to 0). The immediate neighboring column pairs of i, j are i + 1, j −1 and i − 1, j + 1. If the covariance scores of both of the immediate neighboring column pairs of i, j are greater than or equal to max{η, 0}, then (i, j) is still considered as a paring column.

Algorithms for RSpredict
Given a refi ned multiple sequence alignment A with N sequences, let (i, j) be a pairing column in A. Let Here, c 2 is a user-defi ned coeffi cient (in the study presented here, c 2 = −1). Thus, every pairing column in the refi ned alignment A has a pseudoenergy value. We then apply the minimum normalized energy folding algorithm described in the beginning of this section to the refi ned alignment A, treating each pairing column in A as a possible base pair considered in the folding algorithm.
Notice that when calculating the normalized energy for the loop L, the sequence S is in the refi ned alignment A, which may have fewer columns than the original input alignment A o (cf. Fig. 2). RSpredict computes all normalized energies based on the refi ned alignment, and the program uses loop lengths from the refi ned alignment A rather than the original input alignment A o . Let R be the consensus secondary structure, computed by RSpredict, for the refi ned alignment A. We obtain the consensus structure R o of the original input alignment A o by inserting unpaired gaps to the positions in R whose corresponding columns are deleted when getting A from A o (cf. Fig. 2). The following summarizes the algorithms for RSpredict: 1. Input an alignment A o in FASTA or ClustalW format. 2. Delete the columns with more than 75% gaps from A o to obtain a refi ned alignment A. 3. Compute the pseudo-energy e ij for every pairing column (i, j) in A as in Equation (16). 4. Run the minimum normalized energy folding algorithm on A, using the pseudo-energy values obtained from Step (3) to produce the consensus secondary structure R of the refi ned alignment A. The base at position i of the consensus secondary structure R is the most frequently occurring nucleotide, excluding gaps, in the ith column of the refi ned alignment A. 5. Map the consensus structure R back to the original alignment A o by inserting unpaired gaps to the positions of R whose corresponding columns are deleted in Step (2).
Notice that Equation (6) is used to compute the NE values only. To generate the optimal structure R in Step (4), we maintain a stack of pointers that point to the substructures of loops with minimum normalized energy as we compute the NE values. Once all the NE values are calculated and the normalized energy of the optimal secondary structure R is obtained, we pop up the pointers from the stack to extract the optimal predicted structure. In Step (5), we map the bases (base pairs, respectively) for the columns (column pairs, respectively) in A to their corresponding columns (column pairs, respectively) in A o . For example, consider Figure 2 again. In the fi gure, the refi ned alignment A is obtained by deleting column 4 from the original input alignment A o . The bases for columns 1, 2, 3, 4 in A are mapped to columns 1, 2, 3, 5 in A o . The base pair between column 1 and column 9 in A becomes the base pair between column 1 and column 10 in A o ; the base pair between column 2 and column 8 in A becomes the base pair between column 2 and column 9 in A o . An unpaired gap is inserted to the position corresponding to the deleted column 4 in A o .
Let N be the number of sequences and n o be the number of columns in the input alignment A o .

Results
We conducted a series of experiments to evaluate the performance of RSpredict and compared it with three related tools including KNetFold, Pfold and RNAalifold. We tested these tools on Rfam 23 sequence alignments with different similarities. The Rfam sequence alignments come with consensus structures. For evaluation purposes, we used the Rfam consensus structures as reference structures and compared them against the consensus structures predicted by the four tools. The similarity of a sequence alignment is determined by the average pairwise sequence identity (APSI) of that alignment. In the study presented here, a sequence alignment is of high similarity if its APSI value is greater than 75%, is of medium similarity if its APSI value is between 55% and 75%, or is of low similarity if its APSI value is less than 55%. The data sets used in testing included 20 Rfam sequence alignments of high similarity and 36 Rfam sequence alignments of low and medium similarity. These data sets were chosen to form a collection of sequence alignments with different (low, medium and high) APSI values, different numbers of sequences, as well as different sequence alignment lengths. More specifi cally, the data sets contained sequence alignments that ranged in size from 2 to 160 sequences, in length from 33 to 262 nucleotides and had APSI values ranging from 42% to 99%. We also tested the tools on a multiple sequence alignment obtained from our study on the Drosophila X chromosome. 24,25 The Drosophila data set has a reference structure, obtained from biochemical and other methods that are different from the algorithms used in the tools under analysis. The performance measures used in our study include sensitivity (SN) and selectivity (SL), 20 where Here TP is the number of correctly predicted base pairs ("true positives"), FN is the number of base pairs in a reference structure that were not predicted ("false negatives") and FP is the number of incorrectly predicted base pairs ("false positives"). False positives are classifi ed as inconsistent, contradicting or compatible. 20 In predicting the consensus secondary structure for a multiple sequence alignment, a predicted base pair (i, j) is inconsistent if column i in the alignment is paired with column q, q ≠ j, or column j is paired with column p, p ≠ i, and p, q form a base pair in the reference structure of the alignment. A base pair (i, j) is contradicting if there exists a base pair (p, q) in the reference structure of the alignment, such that i Ͻ p Ͻ j Ͻ q or p Ͻ i Ͻ q Ͻ j. A base pair (i, j) is compatible if it is a false positive but is neither inconsistent nor contradicting. The ξ in SL represents the number of compatible base pairs, which are considered neutral with respect to algorithmic accuracy. Therefore ξ is subtracted from FP. Finally, we used the Matthews correlation coefficient (MCC) to combine the sensitivity and selectivity, where MCC is approximated to the geometric mean of the two measures, i.e. MCC SN SL ≈ × . The larger MCC, SN, SL values a tool has, the better performance that tool achieves and the more accurate that tool is.

Performance evaluation on Rfam alignments of high similarity
The fi rst data set consisted of seed alignments of high similarity taken from 20 families in Rfam.
The APSI values of these seed alignments ranged from 77% to 99%. The alignments ranged in size from 2 to 160 sequences and in length from 33 to 159 nucleotides. Table 1 presents details concerning the 20 families and their seed alignments.
All four tools including RSpredict, KNetFold, RNAalifold and Pfold were tested on this data set. Table 2 presents the values of Matthews correlation coeffi cient (MCC), sensitivity (SN) and selectivity (SL), as well as their mean and standard deviation for each of the four tools. Figure 5 compares the four tools based on these values. We use ROC (receiver operating characteristic) plots to simultaneously display both sensitivity and selectivity for each tool.
It can be seen from Table 2 and Figure 5 that RSpredict performed the best. The Pfold tool had good performance in selectivity but did not perform well in sensitivity and as a result in Matthews correlation coeffi cient. It also suffered from a size limitation (the Pfold web server can accept a multiple alignment of up to 40 sequences). Only 17 out of the 20 sequence alignments used in the experiment were accepted by the Pfold server; the other three alignments (RF00386, RF00041 and RF00389) had more than 40 sequences and therefore could not be run on the Pfold server. RSpredict had stable performance with the best mean 0.85 (standard deviation 0.16, respectively) in MCC, while the other methods' MCC values varied a lot and had means (standard deviations, respectively) ranging from 0.71 to 0.82 (0.24 to 0.28, respectively).

Performance evaluation on Rfam alignments of medium and low similarity
In the second experiment, we compared RSpredict with the other three methods on multiple sequence alignments of low and medium similarity. The test dataset included seed alignments of 36 families taken from Rfam. 23 The average pairwise sequence identity (APSI) values of the seed alignments ranged from 42% to 75%, the number of sequences in the alignments ranged from 3 to 114, and the alignment lengths ranged from 43 to 262 nucleotides. Table 3 presents details concerning the 36 families and their seed alignments. Table 4 presents the values of Matthews correlation coeffi cient (MCC), sensitivity (SN) and selectivity (SL), respectively, as well as their mean and standard deviation for each of the four tools. Figure 6 shows ROC (receiver operating characteristic) plots, which simultaneously display both sensitivity and selectivity for each tool.
Comparing Table 2 and Table 4, we see that the methods under analysis generally performed better on sequence alignments of medium and low similarity than on sequence alignments of high similarity. RSpredict outperformed the other three methods based on the three performance measures used in the experiment. The tool achieved a high mean value of 0.94 in MCC, better than those of KNetFold (0.86), Pfold (0.88) and RNAalifold (0.89). Similar results were observed for sensitivity and selectivity values. Furthermore, RSpredict exhibited stable performance across all the families tested in the experiment. The tool had an MCC, SN and SL standard deviation of 0.08, 0.09 and 0.08, respectively. These numbers were better than the standard deviation values obtained from the other three methods, which ranged from 0.12 to 0.25. Due to the restriction on the alignment size imposed by the Pfold server, only 27 alignments out of 36 could be run on Pfold. The other nine alignments had more than 40 sequences and hence could not be accepted by the Pfold server.

Performance evaluation on the Drosophila dataset
The male-specifi c lethal (MSL) complex, which includes two noncoding RNAs on X (roX1 and roX 2 RNAs), induces histone H4-Lys16 acetylation for twofold hypertranscription of the male X chromosome in Drosophila melanogaster. 24,25 We applied all four methods including RSpredict, KNetFold, Pfold and RNAalifold to predicting a common secondary structure of roX 2 RNA. Among the sequences of the roX 2 gene (1.4 kb) found from nine different The best MCC value for each family, and the best mean and standard deviation are highlighted in bold. Note that only 17 out of the 20 sequence alignments in this experiment were accepted as input to the Pfold server. The other three (RF00386, RF00041 and RF00389) had more than 40 sequences and therefore could not be run on the server. Note also that RSpredict has the best standard deviation in MCC, SN and SL respectively compared with the other three tools. This shows that RSpredict has stable performance across all the families tested in the experiment.
Drosophila species, an evolutionarily conserved functional domain (71 nt) was characterized and predicted as a stem-loop structure by RSpredict, KNetFold and RNAalifold respectively (Figs. 7a, 7b and 7d respectively). This result affi rmed what our lab experiments revealed. 24 Pfold predicted a less accurate structure with a shorter stem (Fig. 7c).

Discussion
In this paper we presented a software tool, called RSpredict, capable of predicting the consensus secondary structure for a set of aligned RNA sequences via pseudo-energy minimization. Our experimental results showed that RSpredict is competitive with some widely used tools including RNAalifold and Pfold on tested datasets, suggesting that RSpredict can be a choice when biologists need to predict RNA secondary structures of multiple sequence alignments. Notice that RSpredict differs from our previously developed KNetFold 20 in that KNetFold is a machine learning method that relies on pre-compiled training data derived from existing RNA secondary structures. RSpredict, on the other hand, is based on a dynamic programming algorithm for folding sequences and does not utilize training data. RSpredict adopts the normalized energy concept originated from Densityfold, whose web server is named alteRNA. 5 We described the difference between RSpredict and Densityfold in the beginning of the Introduction section. Unlike RSpredict that predicts consensus structures for multiple sequence alignments, alteRNA predicts secondary structures for single sequences. Thus, one cannot do a direct comparison of performance between the two tools. We implemented the idea in alteRNA in a program that uses the same covariance scores as RSpredict to predict consensus structures for multiple sequence alignments. The performance of this program is inferior to those of the four tools studied in the paper.
RSpredict contains two user-defi ned parameters, c 1 in Equation (12) and c 2 in Equation (16). In the study presented here, both c 1 and c 2 were fi xed at −1. Changing the value of c 1 or c 2 would affect the accuracy of the predicted structure. More specifi cally, when c 1 is smaller than -1, the accuracy degrades slowly. When c 1 is set to 0, the predicted  Table 1. We use ROC plots to simultaneously display the sensitivity and selectivity values achieved by each tool. The X-axis shows the selectivity values, whereas the Y-axis shows the sensitivity values. consensus structures are less accurate than when c 1 is set to −1. This shows the impact that confl icting sequences in an alignment have on consensus structure prediction, since when c 1 is 0, there is no weight for the confl icting sequences in covariance score calculation. When c 1 is set to a positive number, RSpredict generates less accurate consensus structures. The same behavior was observed for c 2 . In practice, if a few more base pairs are predicted or omitted, given that all the others are in the right places, the predicted structure would not be optimal, i.e. its pseudo-energy would not be minimized. A tool predicting fewer base pairs would have a lower sensitivity value whereas a tool predicting more base pairs might have a lower selectivity value. It was observed that the accuracy of the consensus structures produced by RSpredict varies with sequence lengths, though there is no clear trend between the accuracy and the lengths. For example, referring to Tables 3 and 4, the seed alignment of RF00230 (RF00559 and RF00468, respectively) has a length of 262 nt (81 nt and 66 nt, respectively) and MCC value of 0.83 (0.91 and 0.65, respectively), showing no direct relationship between the alignment length and the MCC.
Finally, we compared the computational time required by the four tools studied in the paper. Pfold and KNetFold are available as web servers. We submitted the same multiple alignments having a length of 100 nt, 200 nt and 300 nt, respectively to each of the four web servers. Results were reported back by KNetFold (Pfold, RNAalifold, RSpredict, respectively) in 605, 660, 1125 seconds (10, 15, 33 seconds, 9, 10, 11 seconds, 9, 19, 76 seconds, respectively). While the speed of a web server also depends on network traffi c and server workload, these timing data give an estimate of how fast each tool can run. The RSpredict program needs 0.3 MB RAM (1.3 MB and 2.9 MB, respectively) to perform the computation for the input alignment whose length is 100 nt (200 nt and 300 nt, respectively) measured on a PC with Intel(R) Core(TM)2 Duo CPU (2.19 GHz/ 2.00 GB RAM/Microsoft Windows XP). The RSpredict web server can accept multiple  Table 3. We use ROC plots to simultaneously display the sensitivity and selectivity values achieved by each tool. The X-axis shows the selectivity values, whereas the Y-axis shows the sensitivity values.  B) The consensus structure predicted by KNetFold. C) The consensus structure predicted by Pfold. D) The consensus structure predicted by RNAalifold. RSpredict, KNetFold and RNAalifold, all with MCC = 1, performed better than Pfold with MCC value of 0.88. Note that each gap in the consensus structure in (a) represents a column with more than 75% gaps in the roX2 RNA sequence alignment. All the columns with more than 75% gaps are deleted to obtain the refi ned alignment based on which RSpredict generates the consensus structure.
alignments with at most 200 sequences in size and at most 500 nucleotides in length, though the downloadable version does not have this restriction. All the four tools studied in this paper take multiple sequence alignments as input data. We are looking into algorithms for enhancing the quality of multiple sequence alignments so as to improve on secondary structure prediction. One approach is to take into consideration the effect of base-pair covariation in the alignment process, like the Murlet tool. 26 There are other approaches for obtaining better sequence alignments. For example, BlockMSA 27 uses a combination of a biclustering algorithm and a divide-and-conquer technique. Groups of similar sequences are found and subsequences within them are locally aligned. The fi nal alignment is obtained by dividing both the set of sequences and their contents. 27 Other ways to improve the prediction accuracy include the utilization of evolutionary information, more sophisticated models of covariance scoring, and training data for more accurate pairing thresholds. Finally, by integrating our previously developed RADAR server 6 with RSpredict, we plan to offer our software as a suite of tools to the user as done in some recent work. 28