DCJ-RNA - double cut and join for RNA secondary structures

Background Genome rearrangements are essential processes for evolution and are responsible for existing varieties of genome architectures. Many studies have been conducted to obtain an algorithm that identifies the minimum number of inversions that are necessary to transform one genome into another; this allows for genome sequence representation in polynomial time. Studies have not been conducted on the topic of rearranging a genome when it is represented as a secondary structure. Unlike sequences, the secondary structure preserves the functionality of the genome. Sequences can be different, but they all share the same structure and, therefore, the same functionality. Results This paper proposes a double cut and join for RNA secondary structures (DCJ-RNA) algorithm. This algorithm allows for the description of evolutionary scenarios that are based on secondary structures rather than sequences. The main aim of this paper is to suggest an efficient algorithm that can help researchers compare two ribonucleic acid (RNA) secondary structures based on rearrangement operations. The results, which are based on real datasets, show that the algorithm is able to count the minimum number of rearrangement operations, as well as to report an optimum scenario that can increase the similarity between the two structures. Conclusion The algorithm calculates the distance between structures and reports a scenario based on the minimum rearrangement operations required to make the given structure similar to the other. DCJ-RNA can also be used to measure the distance between the two structures. This can help identify the common functionalities between different species.


Background
DNA is a biological blueprint that a living organism must have to exist and remain functional. RNA holds the guidelines for this blueprint. RNA is responsible for transferring the genetic code from the nucleus to the ribosome to build proteins. It is identified as a series of letters with bases {A, C, G, U}. RNA's secondary structure is required to define the functionality of RNA molecules. In contrast to representing the genome as a sequence, representing it as a secondary structure provides more insight into the genome's function. In this paper, RNA's secondary structure is presented using a component-based representation, which was recently proposed in 2011 [1]. In contrast to similarity between gene orders, identifying the similarity of functioning between two structures has a greater impact on comparing species. Comparing two species based on their secondary structures provides more information and reveals more accurate evolutionary scenarios [2]. Comparison of two species based on their secondary structures can also be combined with existing sequence-based algorithms to enhance sequence-based algorithms efficiency [3]. This helps create more accurate phylogenies [4].
The paper outline is as follows -the RNA secondary structure is presented using a component-based representation. The researchers proceed to describe the measures that are used to determine the similarity between components of the given structures. Genome rearrangement in terms of sequences and its operations, sorting scenario, and distance measures are summarized. We then propose a DCJ-RNA rearrangement algorithm and explain it in detail. Two case studies using real data are presented, illustrating the detection and application of the proposed rearrangement operations for real RNA secondary structures. The results demonstrate that the proposed algorithm provides one evolutionary scenario that shows how to alter one structure to make it similar to the other or the same as the other. Preliminary work has been presented as a poster in [5].

RNA secondary structure component-based representation
Badr and Turcotte [1] propose a component-based structure to define interacting and non-interacting patterns as follows -the representation can be used to define interacting and non-interacting patterns for RNA secondary structures. A pattern (P = {p 1 , p 2 . .. p m }) is defined by its sub-patterns (P i , 0 < i < m). Each subpattern is defined by its length and intermolecular (INTERM) and intramolecular (INTRAM) components. For non-interacting patterns, there are no INTERM components. These components are defined by their opening bracket (OB), closing bracket (CB), length, and relative locations within the sub-patterns. In the INTERM component, OB and CB are located in two different sub-patterns. In the INTRAM component, OB and CB are located in the same subpattern. In the INTERM component, OB and CB must be in different sub-patterns, which suggests that there must be at least two sub-patterns to have INTERM components. OB is located in p i , and CB is located in another sub-pattern (p j ), where j > i and 1 ≤ j ≤ m. OB and CB are defined by their lengths and locations relative to the beginning of p i . Thus, INTERM = {OB, CB, j, len}. In INTRAM components, OB and CB have to be in the same subpattern, which indicates that there must be at least one sub-pattern to have INTRAM components. OB and CB are located in p i , where 1 ≤ i ≤ m. OB and CB both are defined by their location and length. Therefore, INTRAM = {OB, CB, len}. Figure 1 shows an example of a non-interacting pattern.
Similarities between two RNA secondary structures (Alignment distance) Badr and AlTurki [6] propose a similarity measure based on aligning two secondary structures that are presented using a component-based representation. The algorithm extracts the features of each component, which are OB, CB, and length. The similarity between two structures depends on the component's position, full length, and stem length. These measures are used in the new proposed algorithm. The equations that are applied to calculate the similarity between two components, a i in structure A and b j in structure B, d(f ai , f bj ), can be found in [6]. The similarity measure between two components is used to calculate the dynamic programming matrix using the method proposed by Needleman and Wunsch [7]. The alignment score between two structures is Fig. 1 An example of a component-based representation calculated using Eq. 1, while the percentage of the similarity between two structures is calculated using Eq. 2 [6].
Score percentage a; b where Max(a, b) = Max {Score(a, a), Score(b, b.)} RSmatch [8], which is another alignment distance, is a tool for aligning RNA secondary structures and is also used for motif detection. Determined with widely used algorithms for RNA folding, it decomposes the secondary structure of RNA into a set of atomic structural components. These components are further organized using a tree model to capture the structural particularities. RSmatch can find the optimal global or local alignment between two RNA secondary structures using two scoring matrices -one for singlestranded regions and the other for double-stranded regions. Jiang et al. [9] define the alignment of trees as a measure of similarity between two secondary structures in tree representation.

Sequence-based genome rearrangements
Genomes can be modeled using permutations. Each gene can be allocated once at the genome and assigned a unique number. A gene is modeled by a signed integer when the gene strand is known to biologists [10,11].

Rearrangement operations
Two genomes can have the same number of genes but may have different orders. A sequence of operations can be applied to change one genome into another. The most common rearrangement events or operations are as follows [12,13]: Inversion -This reverses the orientation of a gene (or a group of genes). Transposition -This changes the order of a gene (or a group of genes). In other words, if the gene is located in one index, it is moved to another index. Gain -This adds a gene (or a group of genes) to a genome. Loss -This removes a gene (or a group of genes) from a genome. Duplication -This duplicates a specific gene (or a group of genes) within a genome.

Distance measures
The distance between two genomes is the minimum number of events or operations that are required to transform one genome into the other. Yancopoulos et al. [14] first proposed double cut and join (DCJ) operations. A DCJ operation consists of cutting a genome at two distinct positions and joining the four resulting open ends in a different way. Since a gene (e.g., a) has an orientation, its two ends, namely the extremities, can be distinguished and denoted as at (tail) and ah (head). An adjacency in a genome is either the extremity of a gene that is adjacent to one of its telomeres or a pair of consecutive gene extremities in one of its chromosomes.
DCJ distance consists of two operations -cut, which cuts an adjacency in two telomeres, and join, which connect two telomeres to form an adjacency. A model in which any operation consists of two cuts followed by two joins on the extremities is considered a DCJ operation [15]. DCJ allows for multi-chromosomal genomes with both circular and linear chromosomes.
DCJ distance can be easily calculated with the assistance of an adjacency graph, which is a two-part multigraph in which each partition corresponds to the set of adjacencies of one of the two input genomes. An edge connects the same extremities of genes in both genomes. In other words, a one-to-one correspondence exists between the set of edges in an adjacency graph and the set of gene extremities. Vertices have degree one or two. Therefore, an adjacency graph is a collection of paths and cycles. DCJ distance can be define as follows: In this equation, c (G 1 , G 2 ) is the number of cycles, and p (G 1 , G 2 ) is the number of odd paths in the adjacency graph.

Sorting scenario
One related issue is identifying a sorting scenario for the given distance, which provides the operations themselves. A single or number of possible solutions or sorting sequences can be found.
Bergeron et al. [11] provide an algorithm to obtain the DCJ operation in O(n) time (Algorithm 1). Mathematically, sorting using DCJ operations is simple. As with DCJ distance, DCJ operations take two adjacencies or telomeres, cut the adjacencies/telomeres, and create new adjacencies or telomeres. There are several DCJ operation types. A DCJ operation may create two adjacencies by cutting two adjacencies. A DCJ operation may also create an adjacency and telomere by cutting an adjacency and removing a telomere. In addition, a DCJ operation can consist of forming two telomeres by cutting an adjacency. Finally, DCJ operations may create an adjacency by removing two telomeres.

Method: DCJ-RNA algorithm
The RNA component-based rearrangement algorithm uses a component-based representation [2] that allows for the unique description of any RNA pattern and shows the main features of the pattern efficiently. The proposed algorithm also uses the DCJ algorithm to describe rearrangement operations. It uses classical operations (inversions, translocations, fissions, fusions, transposition, and block interchanges) with a single operation and provides multi-chromosomal genomes. The DCJ-RNA algorithm (Algorithm 2) is described next.
The DCJ-RNA algorithm completes three main steps: Step 1 -Alignment of similar components based on their component lengths and stem lengths.
In this step, calculate the similarity between components in terms of their component lengths and stem lengths [6]. Similar components are assigned together, beginning with those with the greatest similarity. The similarity measure that is used in this step is as follows - Then, a matrix (m × n) is built; the entries are the component similarities in terms of component length and stem length. The rows represent the components of the first structure, and the columns represent the components of the second structure. We then search for the maximum entry (greedy) in the matrix. If it is greater than the threshold enhancement (ε) (the minimum similarity score between two components), the components are assigned together, and the corresponding row and column are deleted. If maximum similarity appears in more than one entry, the position similarity is compared between those components only and the assigned components with the greatest similarity in position. Table 1 shows the matrix structure.
Step 2 -Permutation generation In this step, a corresponding permutation is generated for each of the two structures. This is completed by determining the components to be inserted or deleted, as well as the order of the similar components using the alignment that is generated from step 1. A twodimensional array of 3 Χ in size (the maximum number of components in A or B + 1) is constructed and identified as SortArray. The first row contains the desired structure, the second row contains the deleted components from the actual structure, and the third row contains the inserted components from the desired structure. An index value of zero for the first row is reserved for the number of components in the actual structure. An index value of zero for the second row is  reserved for the number of deleted components. For third row, an index of zero is reserved for the number of components. Table 2 shows the SortArray structure.
The component numbers are used to determine the permutations in the DCJ algorithm [16]. Two permutations are provided. The first is for the given or actual permutation, and the second permutation is for the desired one.
Each permutation has two chromosomes - For the first permutation -The first chromosome is the actual structure of the components, and the second chromosome is the inserted components.
For the second permutation -The first chromosome is the desired structure, and the second chromosome consists of the deleted components.
Each permutation is represented by its adjacencies and telomeres. Finally, the DCJ algorithm is applied to the first and second permutations as input.
The DCJ algorithm [17] is modified in the way that it is applied to sort the first chromosome from the second permutation; this changes the first chromosome of the first permutation. The second chromosome of the second permutation consists of the deleted components, which do not need to be sorted.

Example
In order to clarify the steps of the algorithm, real RNA secondary structures from the Genomic tRNA Database [18] are used as examples. The first structure is for E. The measure weights are equal to one, and threshold enhancement (ε) is equal to 0.5.
Step 1 -Alignment of similar components based on their component lengths and stem lengths.
In this step, the similarity between components is calculated in terms of their component lengths and stem lengths. Similar components are assigned together, beginning with those with the greatest similarity (greedy).
In this example, the similarity between components is shown in the matrix in Table 3. First, the maximum number is one. The components are assigned together, and the row and column are removed. In this case, d 1 (a 3 , b 3 ) and d 1 (a 3 , b 4 ) are at the same position, so the nearest components are assigned in terms of their position (a 3 and b 3 ). The same case applies for d 1 (a 5 , b 3 ) and d 1 (a 5 , b 4 ). The maximum value, which is 0.83, is searched for once again. Then, a 2 and b 2 are assigned, and the row and column are deleted. The next value is 0.39, which is less than the threshold enhancement (ε) value, suggesting that b 1 must be inserted and that a 1 must be deleted. Then, a 4 is deleted because no other components remain from the second structure.
Step 2 -Permutation generation In this step, similar components are mapped according to the process outlined in the previous step. The inserted components and deleted components are then identified (Table 4).
Step 3 -Applying the DCJ algorithm. Then, each genome is represented with its adjacencies and telomeres to ensure that the DCJ algorithm can be applied; the first and second permutations are as follows:  Figure 3 shows the given structures following each rearrangement operation, as well as the similarity score with the original structure after applying each rearrangement operation. It also shows the final desired operation.
To demonstrate the effect of the DCJ-RNA on increasing the similarity between the structures, the CompPSA algorithm [6] is used to calculate the similarity between the structures before and after applying the algorithm. The similarity between the structures is 42% before applying any changes and increases to 94% after applying the DCJ-RNA algorithm (Fig. 4).

Results and discussion
To test and validate the DCJ-RNA algorithm, extensive experiments are conducted, three experiments are applied to three different datasets.

Datasets
There are three different datasets -adjust dataset, accuracy dataset and scalability dataset. In this section, each dataset is described in detail.

Adjust dataset
This dataset consists of three real RNA structures named A, B and C shown in Fig. 5 where selected from the NCBI GenBank [16]. it is used to determine the best threshold enhancement (ε) value. There are two cases for RNA similarities. Dissimilar sequences and exact/ approximate similar structures, structures A and B are used. In other case, dissimilar structures and exact/approximate similar sequences, structures A and C are used.

Accuracy dataset
The accuracy dataset is used to calculate the performance and accuracy of the DCJ-RNA algorithm using different RNA structure sizes. This dataset consists of three pairs of RNA structures that are chosen from the Gen-Bank [19] and Rfam database [20] and differ in size. The first pair of RNA structures consists of two small RNA structures; named D and E, as shown in Fig. 6. The second pair consists of two medium RNA structures; named F and G, as shown in Fig. 7.
The third pair consists of two large RNA structures; named H and I, as shown in Fig. 8.

Scalability dataset
The scalability dataset is used to calculate the scalability of the time and memory performance of the DCJ-RNA algorithm using different RNA structure sizes. This dataset consists of 11 RNA structures based on the first RNA structure, A, in the adjust dataset. Then the second structure is a duplicate of the first one, the third structure is a duplicate of the second one, and so on. The RNA structures' numbers, names, sizes, and number of components are shown in Table 5. The first six RNA structures (J, K, L, M, N, and O) are shown in Fig. 9.

Experiments
Three experiments are conducted -threshold adjustment, performance accuracy, and time and memory performance experiments, the experiments are obtained using real and simulated data in [19].

Threshold adjustment experiment
Threshold adjustment experiments are conducted to determine the best threshold enhancement (ε) value that gives the minimum number of rearrangement operations to make the RNA structures exactly the same or approximately similar.

Experiment setup
The used dataset is the adjust dataset, while fixed parameters are W P equals 0 and W cl and W sl equal 1. Experiments are conducted for 10 values of threshold enhancement (ε) from 0 to 1.

Experiment results
We change the value of the threshold enhancement (ε) from 0.0, 0.1, 0.2, … 1.0 and obtain the result shown in Table 6 for both cases -similar structures with dissimilar sequences and similar structures with dissimilar sequences. As illustrated in Table 7, when the threshold enhancement (ε) equals 1.0, it means that the RNA structures are exactly similar but the number of the rearrangement operations is greater than the other values. On the other side, when threshold enhancement (ε) equals 0.0, it means that when the desired structure has less than or equal number of components as compared to the given structure, the order of the components is changed, and no components are added or deleted.
From results, it can be seen that when the structures are similar, the best threshold enhancement (ε) equals 0.6, because of the similarity between structures and the number of rearrangement operations is reasonable; the structures after sorting for each threshold enhancement   Fig. 10. For the same reason, when the structures are dissimilar, the best threshold enhancement (ε) equals 0.8.

Performance accuracy experiment
The performance accuracy experiment is conducted to show the accuracy of the DCJ-RNA algorithm with different RNA sizes. To test the effect of the DCJ-RNA algorithm and calculate the similarity between structures, the CompPSA algorithm [6] is used.

Experiment setup
The dataset used is accuracy dataset. Since all three RNA structures pairs are similar in their structures and dissimilar in their sequences, the threshold enhancement (ε) equals 0.6 and fixed parameters are W P equals 0 and W cl and W sl are equal to 1.
Experiment results DCJ-RNA was applied to three pairs of RNA structures -small, medium, and large RNA structures. Each experiment is discussed in detail in the following.

Small pairs of RNA structures
Step

-Alignment of Similar Components Based on Component Lengths and Stem Lengths
Calculate the similarity between components as shown in Table 8. Then assign similar components together whenever the similarity between them is greater than or equal to threshold enhancement (ε), which is 0.6. Here, assign D 1 with E 1 , E 4 with D 3 , E 2 with D 2 , and add E 3 .
Step 2 -Permutation Generation   Construct SortArray, fill it as shown in Table 9. After that, construct the permutations to apply the DCJ algorithm.
Step 3 -Apply the Double Cut and Join Algorithm Construct the permutations to apply the DCJ algorithm. First permutation is (chr 1 = {1,2,3} and chr 2 = {6}). (Note -permutation represented as a sequence of numbers, to differentiate between the first structure's components and the second structure's components, we represent the second structure's component i as i + N, where N equals the number of components in the first structure.) The second permutation is -(chr 1 = {1,2,6,3} and chr 2 = {}). Represent each genome with its adjacencies and telomeres to apply the DCJ algorithm, the first and second permutations are as follows: The similarity between the given structures D and E is 58% before applying any changes, while it increases to 85% after applying the DCJ-RNA algorithm; see Fig. 11.

Medium pairs of RNA structures
Step 1 -Alignment of Similar Components Based on Component Lengths and Stem Lengths Calculate the similarity between components as shown in Table 10, then, assign F 7 with G 6 , F 6 with G 5 , F 4 with G 3 , F 3 with G 2 , F 5 with G 1 , delete F 1 , delete F 2, and add G 4 .
Step 2 -Permutation Generation   Construct SortArray, fill it as shown in Table 11. After that, construct the permutations to apply the DCJ algorithm.

Large pairs of RNA structures
Step 1 -Alignment of Similar Components Based on Component Lengths and Stem Lengths Calculate the similarity between components as shown in Table 4.7, then, assign H 1 with I 2 , H 2 with I 3 , H 3 with I 4 , H 4 with I 5 , H 5 with I 6 , H 6 with I 7 , H 7 with I 8 , H 8 with I 9 , H with I 10 , H 10 with I 11 , H 11 with I 12, and insert I 1 .
Step 2 -Permutation Generation Construct SortArray fill it as shown in Table 12. After that, construct the permutations to apply the DCJ algorithm.

Time & Memory performance experiment
The time and memory performance experiment is conducted to test the performance of the DCJ-RNA algorithm using different RNA structure sizes.

Experiment setup
The scalability dataset is used, while fixed parameters W P equals 0 and W cl and W sl are equal to 1. Threshold enhancement (ε) equals 0.6 since  structures are similar. The two structures in each experiment are identical which means the similarity between them is 100%.
Experiment results Consider the maximum number of components to be N; the time complexity of step 1 is O(N log N) for the worst case. Each time we have to search for the maximum value for N values then discard the row and column related to maximum value, as a result, the next search is applied to (N-1) components and so on. The time complexity of the second step is O(N), since this step determines the inserted components and the deleted components. The algorithm moves through the entries only once to fill SortArray in which they are all of size N. For step three, the time complexity is O(N) since the DCJ algorithm is used. Therefore, the worst time for the entire algorithm is O (N log N). Table 13 and Fig. 14 confirm the time performance analysis empirically using the scalability dataset. The space requirement for the first step is O(N 2 ) when the same number of components are present. For the second step, the memory takes O(3 N) for SortArray. For the third step, the space of memory is O(2 N). Hence, the total space requirement for DCJ-RNA algorithm is O(N 2 ). Table 13 shows time and memory performance results from this experiment and the corresponding graph representation (Fig. 14).

Conclusion
The DCJ-RNA algorithm is proposed and is able to describe the evolutionary scenarios that are based on  rearrangements of secondary structures rather than sequences. The DCJ-RNA algorithm is optimal. Since RNA secondary structures reveal more functionality, this algorithm can help in the comparison between the functionality of structures. Real data is used to illustrate the details of the proposed algorithm. It demonstrates that the algorithm is able to detect the minimum number of rearrangement operations in order to make one structure more similar to the other. A rearrangement scenario increases similarity between the first structure and any other structure. This creates an ideal framework for applying rearrangement operations to secondary structures rather than sequences. The algorithm is applied to non-interacting patterns only. Therefore, future work should extend the algorithm to consider interacting RNA patterns. In addition, the researchers would like to explore other well-defined structures, such as chemical structures, and investigate the application of a similar approach that can define a scenario for changing one structure into another structure. Using the DCJ-RNA approach, we would also like to develop a tool that can help biologists compare RNA structures to folded RNA structures that are based on the corresponding RNA sequence. This tool, which is unavailable, would be ideal for biologists, as suggested at the RECOMB-CG conference in 2014.