Dynalign II: common secondary structure prediction for RNA homologs with domain insertions

Homologous non-coding RNAs frequently exhibit domain insertions, where a branch of secondary structure is inserted in a sequence with respect to its homologs. Dynamic programming algorithms for common secondary structure prediction of multiple RNA homologs, however, do not account for these domain insertions. This paper introduces a novel dynamic programming algorithm methodology that explicitly accounts for the possibility of inserted domains when predicting common RNA secondary structures. The algorithm is implemented as Dynalign II, an update to the Dynalign software package for predicting the common secondary structure of two RNA homologs. This update is accomplished with negligible increase in computational cost. Benchmarks on ncRNA families with domain insertions validate the method. Over base pairs occurring in inserted domains, Dynalign II improves accuracy over Dynalign, attaining 80.8% sensitivity (compared with 14.4% for Dynalign) and 91.4% positive predictive value (PPV) for tRNA; 66.5% sensitivity (compared with 38.9% for Dynalign) and 57.0% PPV for RNase P RNA; and 50.1% sensitivity (compared with 24.3% for Dynalign) and 58.5% PPV for SRP RNA. Compared with Dynalign, Dynalign II also exhibits statistically significant improvements in overall sensitivity and PPV. Dynalign II is available as a component of RNAstructure, which can be downloaded from http://rna.urmc.rochester.edu/RNAstructure.html.

This document provides the Supplementary Materials for the paper (1). The first section provides the complete and detailed recursions for Dynalign II and an overview of the algorithm. The second section gives an example comparing the prediction results of Dynalign and Dynalign II on a tRNA sequence pair demonstrating the improvement brought by Dynalign II. The third section summarizes results from the grid search performed for selection of the affine gap penalty parameters for inserted domains. The fourth section provides accuracy results under an exact base pair match requirement instead of the criterion used for the results in the main manuscript. The fifth section provides sensitivity and PPV values for sequences stratified by percent identity, and the final section reports the p-values for the statistical significance of the improvements in Dynalign II over Dynalign and Dynalign II over Fold.

The Complete Recursions for Dynalign II
The dynamic programming recursions for Dynalign II operate on two four dimensional arrays: V(i, j, k, l) and W(i, j, k, l) and six two dimensional arrays W3(i, k), W5(i, k), W1 single (i, j), W2 single (k, l), WE1 single (i, j), and WE2 single (k, l) that were defined in the Methods section in the main manuscript. Among these, prior to the main Dynalign II recursions, the arrays, W1 single (i, j), W2 single (k, l), WE1 single (i, j) and WE2 single (k, l) are initialized by single sequence structure prediction algorithms, specifically the minimum G methods programmed in the RNAstructure package (2). V(i,j,k,l) and W(i,j,k,l) are filled for j up to 2N 1 -1 and l up to 2N 2 -1. V and W array members with index j and l bigger than N 1 and N 2 represent fragments including both the 5' and the 3' ends of the two sequences, called exterior fragments. Array members with all the indices smaller than N 1 and N 2 represent consecutive nucleotides joined by phosphodiester bonds, called interior fragments. The exterior fragment arrays are filled to facilitate the determination of suboptimal structures and energy dot plots.

Detailed Recursions for V(i, j, k, l):
where penalty(i, j) is the penalty term at the end of a helix that applies to AU or GU base pairs (3,4). If either the base pairs i-j or k-l are forbidden, V(i, j, k, l) is set to "infinity", i.e., a large positive value. In the above two equations, V hairpin (i,j,k,l) is for closing hairpin loops and V exterior (i,j,k,l) is for closing exterior loops. V hairpin (i,j,k,l) is used in interior fragments and V exterior (i,j,k,l) is used in exterior fragments.
V hairpin (i, j, k, l) considers two hairpins closed by base pairs i-j and k-l: where ∆G°h airpin (i, j) is the G of the hairpin closed by base pair i-j. V internal/stack (i, j, k, l) considers conserved internal loops/bulge loops/helix extensions in both sequences closed by base pair i-j and base pair k-l: The search over the indices a, b, c and d that determine the length of the conserved internal/bulge is constrained to an interval of 20 possibilities each to limit the overall computational complexity to O(N 6 ), while maintaining coverage of most biologically encountered situations. G motif (m, n, p, q) is the G of the motif closed by base pair m-p and n-q. When n=m+1 and q=p-1, the motif is a helix extension, meaning these two base pairs are stacking neighbors, which can be also represented as G stack (m, n, p, q), when n>m+1 and q<p-1, the motif is an internal loop, and when n>m+1, q=p-1 or n=m+1, q<p-1, it is a bulge loop.
V internal/stackII (i, j, k, l) handles two structural variations in Dynalign II and is defined in terms of: 1) V internal/stackII1 (i, j, k, l) and V internal/stackII2 (i, j, k, l), which handle an internal loop aligned with a consecutive set of stacking base pairs, and 2) V internal/stackII3 (i, j, k, l) and V internal/stackII4 (i, j, k, l), which handle inserted stacking base pairs/internal loops/bulge loops. The recursions for these are defined as follows where a set of c or d consecutive base pairs in sequence 1 or 2, respectively, are aligned with an internal loop.
where the motif closed by base pair k-l and (k+c)-(l-d) or by i-j and (i+c)-(j-d) is inserted with the penalty term added.

V multibranch (i, j, k, l) considers two multiple branch loops(MBL) formed in the two sequences closed by base pair i-j and base pair k-l:
where a, b, c and d enumerate all the combinations of dangling ends on base pair i-j and k-l. The G sum associated with these dangling ends is ∆G°d angle (a, b, c, d), where an index of 1 indicates no dangling end and 2 indicates a dangling end. x is the number of unpaired nucleotides in the multibranch loop and y is the number of gaps, both created by the dangling ends. i' and k' are the nucleotides separating two fragments to guarantee at least two multibranch loops are formed in the two sequences.

V domain_insertion (i, j, k, l) considers two multibranch loops formed in two sequences closed by base pairs i-j and k-l with an extra domain inserted in one sequence:
The four terms consider the four possibilities for the location of the inserted domain, viz., the 3' side of sequence 2, the 3' side of sequence 1, the 5' side of sequence 1, or the 5' side of sequence 2. Recursions for these are given as: where W1 single (p, q) is the minimum ∆G° for the fragment [p-q] in sequence 1 that will be an inserted domain inside a multibranch loop (W2 single (m, n) is similarly defined for sequence 2). The indices a, b, c, d, x and y have the same meanings as in the calculation of V multibranch (i, j, k, l) . The parameters ∆G°d omain_opening and ∆G°d omain_elongation are the initiation and elongation G parameters, respectively, for the affine gap penalty associated with inserted domains.
where a, b, c and d indicate dangling ends and have the same meanings as in the V multibranch (i, j, k, l) calculation. W5(m, n) is the minimum sum of the G of fragments from the 5' end to nucleotide m in sequence 1 and from the 5' end to nucleotide n in sequence 2. W3(p, q) is the minimum sum of the Gof fragments from nucleotides p to N 1 in sequence 1and q to N 2 in sequence 2. W(i, j, k, l):

Detailed Recursions for
. 17 The four terms over which the minimum is evaluated consider four different structural conformations and recursions for these terms follow.

W extend (i, j, k, l) considers unpaired nucleotide addition to a smaller fragment:
where a, b, c and d enumerate all the combinations of adding unpaired nucleotides at the four ends of the two fragments, where 1 indicates an added nucleotide and 0 indicates no added nucleotide. |a-b|+|b-d| is the number of gaps generated, a+b+c+d is the number of unpaired nucleotides generated, both caused by the deletions/insertions this step models.

W branch (i, j, k, l) considers the formation of a helical branch:
where a, b, c and d enumerate all the combinations of dangling ends on base pairs (i+a)-(j-b) and (k+c)-(l-d). The G sum associated with these dangling ends is ∆G°d angle (a, b, c, d). x and y have the same meanings as in the V multibranch (i, j, k, l) calculation. penalty(i+a, j-b) and penalty(k+c, l-d) accounts for the penalty term for AU or GU pairs at the ends of helix.

W domain_insertion (i, j, k, l) considers a domain inserted in one of the sequences:
. 21 These four terms consider the four possibilities identical to those considered in the computation of V domain_insertion (i, j, k, l) : The rationale for these recursions is similar to that for the calculation of V domain_insertion (i, j, k, l). W5(i,k):

27
where a, b, c and d enumerate all the combinations of dangling ends on base pairs (i'+a)-(i-b) and (k'+c)-(k-d). The G sum associated with these dangling ends is ∆G°d angle (a, b, c, d). y is the number of gaps, created by the dangling ends.

W5 domain_insertion (i, k) considers an extra domain inserted in one sequence:
. 28 These two terms in the minimization consider different positions for the insertion of the extra domain. In the calculation of W5 domain_insertion (i, k), the extra domain can only be inserted at the 3' end of either sequence. Therefore there are only two possibilities to be considered: where WE1 single( m, n) is the minimum G of fragment from nucleotide m to n in sequence 1 with m and n being nucleotides in an exterior loop. WE2 single (m, n) corresponds to the analogous fragment in sequence 2.

Grid Search for Optimal Domain Insertion Parameters G domain_opening and G domain_elongation
A two-dimensional grid search for optimal G domain_opening and G domain_elongation was performed on 66 sequence pairs selected from twelve group I Intron IC1 subgroup sequences. The values for sensitivity and PPV over the grid of parameter values considered in the search are shown in the following tables.

Prediction Accuracy Evaluated Under an Exact Base Pair Matching Criterion
The statistics summarizing the accuracy of structure predictions provided in the main manuscript (and in the preceding section) allow one nucleotide index in base pairs to differ by +/-1 when comparing predictions with known secondary structures. This accounts for the fact that comparative analysis often cannot resolve the exact pair, and because pairs can be dynamic.
Here, corresponding statistics for prediction accuracy under an exact base pair matching requirement for the comparison between predictions and known structures are provided, i.e. a predicted base pair i-j is deemed correct if and only if the base pair i-j is in the known structure.  RNase P RNA 1×10 -6 6×10 -17