Discovery of Novel ncRNA Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures

Recently, non-coding RNAs (ncRNAs) have been discovered with novel functions, and it has been appreciated that there is pervasive transcription of genomes. Moreover, many novel ncRNAs are not conserved on the primary sequence level. Therefore, de novo computational ncRNA detection that is accurate and efficient is desirable. The purpose of this study is to develop a ncRNA detection method based on conservation of structure in more than two genomes. A new method called Multifind, using Multilign, was developed. Multilign predicts the common secondary structure for multiple input sequences. Multifind then uses measures of structure conservation to estimate the probability that the input sequences are a conserved ncRNA using a classification support vector machine. Multilign is based on Dynalign, which folds and aligns two sequences simultaneously using a scoring scheme that does not include sequence identity; its structure prediction quality is therefore not affected by input sequence diversity. Additionally, ensemble defect was introduced to Multifind as an additional discriminating feature that quantifies the compactness of the folding space for a sequence. Benchmarks showed Multifind performs better than RNAz and LocARNATE+RNAz, a method that uses RNAz on structure alignments generated by LocARNATE, on testing sequences extracted from the Rfam database. For de novo ncRNA discovery in three genomes, Multifind and LocARNATE+RNAz had an advantage over RNAz in low similarity regions of genome alignments. Additionally, Multifind and LocARNATE+RNAz found different subsets of known ncRNA sequences, suggesting the two approaches are complementary.


Introduction
Traditionally, RNA was considered to simply be important in expressing proteins. The discovery of a wide range of RNA molecules that do not function as intermediates of protein translation has changed that view. RNA sequences are involved in important biological functions

Structure Determination
Multilign was used to determine common structures among multiple sequences [17]. Multilign uses Dynalign, a program that folds and aligns two sequences simultaneously to find their common structures. In Multilign, Dynalign progressively constructs the consensus structure for multiple sequences. Among the input sequences, one sequence is chosen as the index sequence to participate in pairwise Dynalign calculations with each other sequence. Base pairs are only allowed in the index sequence if they are contained in a set of low free energy structures predicted by Dynalign with each other sequence. In the final iteration of refinement, Multilign folds the index sequence, where its structure is well-determined, with each other sequence with Dynalign calculations to determine the common structure.
For single-sequence structure prediction, Fold [20] and MaxExpect [21] in the RNAstructure package [22] were used. The free energy changes were calculated using the most recent nearest neighbor parameter set [20,23,24], with the exception that the parameter for adding an additional helix to a multibranch loop was set to -0.6 kcal/mol to be consistent with the estimate based on optical melting experiments [25].

SVM implementation and usage
The SVM implementation LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/, was used. LIBSVM implements SVM formulations both for classification and regression analysis [26]. Each of these implementations has a set of parameters that need to be optimized. In this study, ε-support vector regression (ε-SVR) using the radial basis function (RBF) kernel was used for regression analyses. This formulation has three parameters to optimize, C, ε and γ. Classification analyses used the c-support vector classification (c-SVC) with the RBF kernel that has two parameters (C and γ) to optimize. LIBSVM provides two python scripts (grid.py and gridregression.py) that were used to optimize the parameters by searching for their optimal values in user-specified grids. Parameter values were evaluated according to 5-fold cross validation on the training data sets.

Features for Distinguishing ncRNA
Multifind uses three features to distinguish ncRNA sequences from background sequences. These features are structural conservation index, average single sequence folding free energy Z score and average single sequence normalized ensemble defect Z score. Additionally, the average Shannon entropy for the sequence alignment provides context for the values of the three features that is important for classification accuracy.

Structure conservation index (SCI)
Structural conservation index (SCI) quantifies the structural conservation among RNA secondary structures [8]. It is defined as: where E c is the average of the folding free energies of the structures predicted by Multilign. E s is the average of the folding free energies of the structures predicted with Fold, a single sequence structure prediction tool in RNAstructure [22].
Average single sequence folding free energy Z score To quantify the significance of the thermodynamic stability of the structures predicted for single sequences, a Z score was used, i.e. the number of standard deviations the stability is different as compared to the mean of a suitable sample. To generate a sample to determine the background folding free energy change, the original sequence was shuffled, only maintaining the nucleotide frequency. The structures are predicted for each sequence and the Z score is then defined as: where E is the folding free energy change of an individual sequence predicted on single sequence, μ is the average folding free energy change of the shuffled sequences and σ is the standard deviation of the folding free energy changes of the shuffled sequences. Calculating the Z score by shuffling sequences, however, is computationally costly. SVMs were used to predict the Z scores for single sequences, as done previously [8]. First, 17,303 sequences were generated with length from 30 to 150 nucleotides, GC content from 25% to 75%, G/GC ratio from 25% to 75% and A/AU ratio from 25% to 75%. Then each sequence was shuffled 1,000 times to get the average and the standard deviation of the folding free energy changes of the shuffled sequences for each target sequence. Two separate regression SVMs were trained to predict average and standard deviation. The inputs to each SVM are GC content, G/GC ratio, A/AU ratio and sequence length.
Average single sequence normalized ensemble defect Z score Functional RNAs are not only constrained to fold into thermodynamically stable structures. To function, the RNA structural conformational space needs to be well constrained to one or at most a few dominant structures. Prior analysis showed natural occurring RNA sequences have well-constrained conformational spaces compared to random sequences with the same nucleotide content [27]. To describe compactness of folding space of RNAs, the distance, d, between two structures s 1 and s 2 of a sequence is defined as: where i and j are indexes of nucleotide position and N is the sequence length. S i,j (s 1 ) = 1 if base pair i-j is in structure s 1 , and is 0 otherwise. Similarly, S i,N+1 (s 1 ) = 1 if i is unpaired and 0 otherwise. It is clear d(s 1 , s 2 ) = N if every nucleotide in the sequence adopts a different conformation in two structures and d(s 1 ,s 2 ) = 0 if two structures are completely identical. This formulation of distance between structures is chosen for the convenience of calculating ensemble defect, as shown below. The distance of one structure, s, from its own thermodynamic ensemble therefore can be defined as: where O is the set of all the possible structures of the sequence. p(σ,O) is the probability of structure σ in O, which can be obtained by calculating the partition function. It can be shown that: where P(i,s) is the probability of nucleotide i adopting the specific conformation in structure s, i. e. the probability of nucleotide i being base paired with the specific nucleotide in structure s or the probability of nucleotide i being unpaired if i is unpaired in structure s. The term, n(s,O), is the ensemble defect of structure s, and n(s, O) /N is the normalized ensemble defect [28]. MaxExpect in RNAstructure predicts the lowest ensemble defect structure, which can be used to define s. For this structure, the MaxExpect score is n(s,O), when γ equals one [21].
To infer the significance of the normalized ensemble defect, SVMs were trained to predict the average normalized ensemble defect and normalized ensemble defect standard deviation of the shuffled sequences. The input sequences were the same sequence set used for determining average folding free energy change and its standard deviation for SVM training. The average ensemble defect Z score of all the sequences was included as a feature in the classification SVM.

Shannon entropy
Although SCI is useful for predicting sequences that are ncRNA, when sequences have high identity, SCI is close to one and therefore it cannot identify ncRNAs effectively. The folding free energy change Z score and ensemble defect Z score predicted on single sequences are then more meaningful to identify ncRNAs. For the SVM to put the correct emphasis on these features, a feature that can describe the diversity of the aligned sequences, average Shannon entropy, was included [29]: where S i is the Shannon entropy of one column, A is the set of all the columns in an alignment, N is the number of columns in an alignment.
where C is all the types of characters in a RNA alignment, C = {A,C,G,U,-} and p k is the frequency of character k in the column i. The higher the Shannon entropy, the more diverse the sequences in the alignment are.

Machine Training and Evaluation
The four features (structural conservation index, average single sequence folding free energy Z score, average single sequence normalized ensemble defect Z score and Shannon entropy) were included in ncRNA classification training. Training data were drawn from the Rfam database 10.1 [18,19]. All the Rfam sequence families with average lengths from 30 to 150, with over 25 members, and with conserved structures were chosen for obtaining sequences. This provided 164 families. In each family, an equal number of groups containing from 3 to 6 sequences were randomly selected with replacement. The number of sequence groups drawn from each family was proportional to the size of the family. In total, 22,308 sequence groups were drawn to constitute the positive training set. Then each group of sequences was aligned using ClustalW [30]. For each alignment generated, all the columns were randomly shuffled to build up the negative training set with the exact same size as the positive training set. The complete data set of positive and negative controls for training and testing is provided as Supporting Information (S1 and S2 Files). SVM training was run on these alignments using the '-b 1' option. This trained the SVM to output probability of an alignment being ncRNA, which provides the information needed for using a threshold for classification.
Genomic testing data were generated by cutting genome alignments into windows. Then ncRNA detection methods were run on the windows. Three genome alignments were used for benchmarking, which are (1) Escherichia coli (RefSeq Accesssion: NC_000913) aligned with Salmonella typhi (NC_004631), Salmonella paratyphi (NC_011147), Shigella boydii (NC_ 010658) and Klebsiella pneumonia (NC_011283), (2) Streptomyces coelicolor (NC_003888) aligned with Streptomyces avermitilis (NC_03155) and Streptomyces griseus (NC_010572) and (3) Saccharomyces cerevisiae (NC_001133) aligned with Saccharomyces paradoxus, Saccharomyces mikatae, Saccharomyces kudriavzevii, Saccharomyces bayanus, Saccharomyces castellii and Saccharomyces kluyveri. The sequences for the E. coli and S. coelicolor alignments were downloaded from NCBI RefSeq database [31], and both alignments were generated using the "progressiveMauve" command in Mauve [32] with no extra options other than the input sequences. The Saccharomyces cerevisiae alignment was downloaded from the UCSC genome browser [33]. For all the alignments, only the alignment blocks that include all the input sequences and are in intergenic regions of the E. coli, the S. coelicolor or the S. cerevisiae genome were kept for subsequent processing and analysis. The intergenic regions' coordinates of the E. coli and the S. cerevisiae genomes were inferred from the coordinates of the genes included in the RefSeq files. The intergenic regions' coordinates of the S. coelicolor genome were provided by Vockenhuber et al. [13]. Then all the alignment blocks were cut into 100 nt windows with 50 nt step size. Known ncRNA locations in E. coli and S. coelicolor genomes were acquired from the Rfam database 10.1 [18,19]. Additional known ncRNAs in the S. coelicolor genome identified by deep sequencing experiments were also included [13]. ncRNA locations in S. cerevisae were acquired from the RefSeq file.

Scoring
For ncRNA prediction, there are two criteria to evaluate the prediction: First is the fraction of the real ncRNAs detected. This is the sensitivity: where TP is the true positives (ncRNAs correctly classified as ncRNA) and FN is the false negatives (ncRNA incorrectly classified as not being ncRNA). The second criterion is the fraction of the non-ncRNAs correctly classified as not ncRNA, the specificity: where TN is the true negatives (sequences that are not ncRNA and correctly classified as not being ncRNA) and FP is the false positive (sequences that are not ncRNA and correctly classified as being ncRNA). Because of the large lengths of genomes, a large number of false positives would be generated if the specificity is not high. Therefore, high specificity is critical.

Single Sequence Free Energy Z Score Estimation
Following previous practice, the single sequence folding free energy Z scores were used in Multifind to estimate folding stability [8]. The accuracy of the Z-score estimation by SVMs was evaluated by benchmarking on randomly generated sequences. 1,000 random sequences were generated with GC content, G/GC ratio and A/AU ratio randomly picked within the range of 25% to 75%, and each sequence was shuffled 1,000 times to generate a background set of sequences. The average folding free energy change and standard deviation in folding free energy change for the background sequences were calculated and were used to determine Z scores for the 1,000 random sequences. The SVMs were also used to predict the Z scores on these sequences. The predicted Z score was plotted against sampled Z score (S1A Fig) and was shown to be highly correlated by the linear correlation coefficients (R free energy z = 0.998). The correlation shows that these SVMs have high prediction accuracies for the folding free energy Z score.

Ensemble Defect
For an RNA sequence to be functional, it needs to be able to fold into stable structures. Additionally, the number of structures it can fold into needs to be limited [27]. The ensemble defect of a secondary structure describes how different the structure is from its alternatives, weighted by the ensemble probability [28]. By predicting the structure with the minimum ensemble defect of a RNA sequence, the compactness of its conformational space can be quantified. In this study, the mean predicted minimum ensemble defect Z score of all the input sequences was taken as an input for the SVM training. The accuracy of the Z score prediction, taken over the same data set and background sequences used to evaluate the prediction accuracy of the folding free energy change Z score, is illustrated in S1B Fig

ncRNA classification on Rfam dataset
The training dataset is composed of 164 families containing 22,308 real ncRNA alignments and the same number of negative control alignments, acquired by shuffling the columns of the ncRNA alignments. All the ncRNA alignments were acquired from Rfam seed alignments [18,19]. To test the classification method, a cross-validation approach was used, with four rounds. In each round, families were randomly chosen to form a testing test, containing roughly 10% of all the alignments in the dataset. Testing sets were chosen according to family identity, i.e. a family either appeared in the training or testing set but not both, which avoided homology between the training set and testing set. The four testing sets do not overlap, therefore they are completely independent. Receiver-operator characteristic curves (ROC curves) were plotted to demonstrate the quality of classification. These curves show the tradeoff in sensitivity (the fraction of ncRNA alignments correctly classified as ncRNA) and specificity (the fraction of shuffled alignments correctly classified as not being ncRNA) by plotting sensitivity as a function of false positive rate, i.e. 1-specificity. This is done by iterating over the probability threshold for which a sequence is classified as ncRNA. A perfect classifier would have a point in the upper-left-hand corner of the plot at 100% sensitivity and 100% specificity.
The three features, SCI, single sequence folding free energy Z score, and single sequence ensemble defect Z score provide information that can identify ncRNA as compared to sequences from shuffled genome alignments. ROC curves were plotted for each feature (Fig 1).
Four SVM classification machines were trained to output classification probability using four training sets and tested on the four testing sets. RNAz [8,14], LocARNATE+RNAz [14], Dynalign/SVM [10] and Multifind without ensemble defect Z score were also tested on the four testing sets. For the tests of Dynalign/SVM, two sequences from each alignment were randomly chosen because Dynalign/SVM is limited to two sequences as input. For RNAz calculations, the sequences were aligned using ClustalW, as done previously [8,14], and then this alignment was used as input. For LocARNATE+RNAz [14], multiple unaligned sequences are first taken as input to LocARNATE, which then outputs a structural alignment that can serve as input for RNAz. For each feature, cut offs were scanned to generate ROC curves. The hypotheses are that alignments with higher SCI, lower free energy Z scores and lower ensemble defect Z scores are more likely to be ncRNA. The ROC curves were generated using the entire Rfam dataset. ROC curves were generated for all four testing sets (Figs 2-5) and there was variability across the four sets. Multifind has higher sensitivity than RNAz and LocARNATE+RNAz at most specificities across the four testing sets. Because Dynalign/SVM can only take two sequences as input, its performance was not as good as Multifind, LocARNATE+RNAz or RNAz. For each testing set, plots were made for both all specificities and for the high-specificity regions (1-Specificity 0.10). For genome scans, the most important part of the ROC curve is the high-specificity region (Specificity > 0.98) because scans performed at low specificity would generate large numbers of false positives because of the relatively low prevalence of ncRNAs in genomes. In all sets, Multifind performed best in this high-specificity region, although RNAz performed similarly to Multifind on set two (Fig 3).
One hypothesis was that Multifind will perform better than RNAz on low similarity alignments because Multilign aligns and folds multiple sequences simultaneously. To test this hypothesis, each testing set was divided into two categories with Shannon entropy larger or smaller than 0.3, and the accuracies of the classifiers measured using ROC curves for each category (Figs 6-9). On testing sets with high entropy, Multifind has a distinct advantage over RNAz. The advantage of LocARNATE+RNAz over RNAz is also apparent because LocAR-NATE aligns and folds sequences simultaneously. In most high-entropy testing sets, Multifind also has higher sensitivity than LocARNATE+RNAz at all specificities. At the highest specificities (Specificity > 0.98), Multifind outperforms LocARNATE+RNAz in all four data sets. Figs 4 and 8 also show that Multifind has some advantage over Multifind without ensemble defect on the 3rd testing set, therefore ensemble defect can provide independent predictive power.
A considerable number of ncRNAs are known in these genomes. Known ncRNA locations in E. coli and S. coelicolor genomes were acquired from the Rfam database 10.1 [18,19]. Additional ncRNAs in the S. coelicolor genome identified by deep sequencing experiments were also included [13]. ncRNA locations in S. cerevisae were acquired from the NCBI database [42]. The distribution of the lengths of the ncRNAs acquired from the above mentioned databases are provided in S1 Table. A window that has over 30% of its nucleotides overlapping with any ncRNA or which contains over 50% of nucleotides of a ncRNA was identified as a ncRNA window. The distribution of the windows in all the genome alignments according to the percentage of nucleotides that overlap with a ncRNA is provided in S2 Table. Multifind, LocARNATE+RNAz and RNAz were applied on these windows. To evaluate the results of these three methods, instead of plotting ROC curves, plots of true positives as a function of total number of predicted candidates were used. This is because it is unknown whether unannotated regions are truly not ncRNA. The ratio between true positives and total candidates is a predicted success rate, assuming that most predicted ncRNAs not annotated as ncRNA are false positives.
The plots for the benchmarks on these three genomes do not show an advantage for Multifind or LocARNATE+RNAz when all windows are considered (Figs 10-12). Further analysis, however, showed that Multifind, LocARNATE+RNAz and RNAz discover different known ncRNAs (Fig 13). Table 1 also shows that the true positives in the most probable candidates predicted by Multifind, LocARNATE+RNAz and RNAz have different mean sequence similarities. Multifind and LocARNATE+RNAz tend to predict alignments with high Shannon entropy to be ncRNA. This suggests Multifind and LocARNATE+RNAz have an advantage for prediction on low similarity windows, which corresponds to the benchmarks on the Rfam sequences. To test this hypothesis, all the windows of the yeast alignment were divided into low similarity (S<0.3) and high similarity (S>0.3) categories. True candidates versus predicted candidate curves were plotted on these two sets of windows separately (Fig 14). Results showed, for high similarity windows, RNAz shows a clear advantage, but for low similarity windows, Multifind and LocARNATE+RNAz performed better.

Time Consumption
Multifind inherently scales O(N 6 M) for M sequences of length N. In spite of the use of heuristics to accelerate the calculation [10,43], its time consumption was higher than for RNAz, which scales O(N 3 ) and LocARNATE, which empirically scales O(N 4 M 2 ). To quantify the time usage of Multifind, LocARNATE and RNAz, two benchmarks were done on 100 randomlychosen alignments from the Rfam training set and 100 randomly-chosen alignments from the yeast data set. The results ( Table 2) showed Multifind consumes more time on the Rfam training set than LocARNATE+RNAz and RNAz. But on the yeast data set, Multifind consumes about as much time as LocARNATE+RNAz. The difference in time required by Multifind on two data sets of about the same size was because of the alignment envelope [43] that constrains Multifind's alignment space based on a Hidden Markov Model posterior alignment probability between sequence pairs. Sequence pairs in the yeast data set have much higher percentage identity, hence a much more concentrated posterior alignment probability and a narrower alignment envelope. Discussion A ncRNA detection method called Multifind, based on Multilign, was developed. The benchmarks on alignments extracted from Rfam show that Multifind performed better overall on Rfam testing sets than RNAz and LocARNATE+RNAz. Its advantage is more obvious on lowidentity alignments, where it performs better than RNAz and similarly to LocARNATE +RNAz. Benchmarks on genomes, however, showed that RNAz is more effective overall in detecting known ncRNAs in genome alignments. Further analysis showed Multifind and LocARNATE were more sensitive at discovering known ncRNAs in low-similarity genome alignment regions, and Multifind, LocARNATE+RNAz and RNAz independently predict a significant number of non-overlapping candidates. The latter point was also shown by the study of Vockenhuber et al. [13] where Dynalign and RNAz were compared.
The above results suggested there is no single best method for ncRNA discovery in genomes; different methods independently provide different correct candidates (Fig 13). Multifind, RNAz and LocARNATE+RNAz are applicable on genome-alignment regions with different sequence similarities. Multifind and LocARNATE+RNAz apply better on regions with low similarity, and RNAz applies better on regions with high similarity. The benchmark showed, for the yeast genome alignment, an average Shannon entropy of 0.25 would be a reasonable threshold for applying different methods. Interestingly, among the 74,484 windows of yeast alignment, 88% (65,886) are low similarity and therefore only 12% (8,597) are high similarity. But low-similarity windows include only 47% of all the known ncRNAs, showing an enrichment of known ncRNAs in high-similarity windows. Therefore, it can be argued that, because it is more convenient to search for functional elements in highly conserved regions in genomes, there are possibly more unknown ncRNAs in low-similarity genome-alignment regions. Also, discovering ncRNAs in low-similarity alignments presents a technical barrier that cannot be overcome without paying a computational price. Multifind is therefore a beneficial tool in For benchmarks on genomes, the three methods are all applied on sliding windows with the size of 100 nucleotides. Performing de novo ncRNA discovery on windows is a common practice to limit the computational cost for scanning [44,45]. This practice does not necessarily  In addition to using Multilign, Multifind introduces an additional discriminating feature that has not been previously used for ncRNA discovery, ensemble defect. This feature quantifies the compactness the folding space of a putative ncRNA. An SVM trained with this feature can outperform a method trained without this feature for some data sets (Figs 4 and 8). This Discovery of Novel ncRNA Sequences in Multiple Genome Alignments supports the hypothesis that ncRNA sequences will fold specifically to up to only a few structures.
Multifind is available as part of the RNAstructure [22] package (http://rna.urmc.rochester. edu/RNAstructure.html). It is provided under the GNU public license.

Conclusions
A new method, Multifind, that identifies conserved ncRNA from input unaligned sequences was developed. Benchmarks on Rfam datasets showed Multifind performs better than RNAz and LocARNATE+RNAz, especially on dissimilar sequences. Benchmarks on genomes also showed Multifind and LocARNATE+RNAz are more successful than RNAz on alignments of dissimilar sequences. Because each of the three methods finds a distinct subset of the known ncRNAs, a comprehensive search for ncRNAs would use all three tools.    Supporting Information S1 Fig. ΔG°Z Scores. (A) and ensemble defect Z Scores (B) predicted using the SVM compared with calculated from shuffling 1000 times on 1000 randomly generated sequences. The correlation between predicted ΔG°Z Score and sampled ΔG°Z Score was R = 0.998. The correlation between predicted ensemble defect Z Score and sampled ensemble defect Z Score was R = 0.999. (EPS) S1