A layout framework for genome-wide multiple sequence alignment graphs

Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.


Example Cycle Removal
The working of our cycle removal algorithm for gMSA graphs is illustrated for the example shown in the paper.Therefore, the results for the sets V p and E DAG as well as for the mappings σ, τ , and ϵ are shown after each vertex sequence was added.The abbreviations "GS" "CS1" and "CS2" denote the guide sequence, the comparative sequence 1, and the comparative sequence 2, respectively.

Adding the Comparative Sequence 1
After processing the comparative sequence 1 (CS1), we obtain the following results:

Adding the Comparative Sequence 2
After processing the comparative sequence 2 (CS2), we obtain the final results (the changes are highlighted in green):

Distinction from Ear Decomposition
Figure S2.In this example we will show that there is not always an ear decomposition for our graphs.
(A) The vertex sequences of the selected sequences.This is the data basis for the graph created in this example.(B) The graph representation of the data from (A). (C) The directed graph of (B) without multiple edges between the same vertices on which the ear decomposition is to be applied.The first path in the ear decomposition must be a cycle.Therefore, the first ear can only be P 0 = {v 5 , v 6 , v 5 }, because there is only this one trivial cycle in the graph.Moreover, the graph cannot be decomposed into more ears and therefore an ear decomposition is not possible on this graph.
Although the algorithm described is a decomposition into individual paths, it is not an ear decomposition.The definitions we use are based on Bang-Jensen and Gutin (2002) using the term "edge" instead of the term "arc".An ear decomposition is used to decompose a directed multi-graph D = (V, E) with at least 2 vertices into a sequence of t ears E = {P 0 , P 1 , P 2 , ..., P t }.Thereby the first ear P 0 is a cycle and each P i is a path or a cycle with the following three characteristics: • P i and P j are edge-disjoint for i ̸ = j.
• For each i = 1, .., t: If P i is a cycle, then it has exactly one vertex with V (D i−1 ) in common.Otherwise the start-and the end-vertex of P i are distinct vertices of V (D i−1 ) and no other vertex of P i belongs to V (D i−1 ).Note: Here, D i is a sub-graph of D with all vertices of the ears P 0 to P i : V (P j ) • The edges of all ears P 0 to P t form the set of edges of D: E = t j=0 E(P j ) A directed multi-graph D with at least two vertices is strongly connected if and only if it has an ear decomposition.If D is strongly connected then every cycle in D can be used as start-cycle/-ear P 0 for the ear decomposition.
However, in our case the graph does not need to be strongly connected.In fact, our decomposition can also applied to graphs without any cycles at all.This is shown using an example in Figure S2.
Moreover, the aim of the two decompositions is totally different.We want to remove cycles and thereby prepare the graph for the layering step in such a way that our requirements are fulfilled, while the ear decomposition is used to check for graphs being strongly connected.

Complexity Analysis
Let v = |V | denote the number of vertices and let s = |S| be the number of vertex sequences in the gMSA graph.The overall time complexity of our newly developed cycle removal algorithm is O(s • v 3 ).The Tables S1 & S2 show the details of our analysis.The function updateConnectionF unctions has a time complexity of O(v 2 ) and dominates the time complexity of the Algorithms 2-4.Together with the surrounding loop, this gives the overall time complexity of O(v 3 ) of each of these algorithms.

LAYER ASSIGNMENT
The longest-path algorithm assigns all vertices without outgoing edges (sinks) to the layer 1.If all immediate successors of a vertex have been assigned to a layer then that vertex is assigned the lowest of the layers of its immediate successors plus one (Healy and Nikolov, 2013).Thus, gradually all nodes are assigned to one level.Algorithm 5 shows a possible implementation of the longest path algorithm (Healy and Nikolov, 2013).U is the set of all vertices already assigned to a layer and Z is the set of all vertices assigned to a layer below the current layer.A new vertex v to be assigned to the current layer is picked among the vertices which have not been already assigned to a layer, i.e., v ∈ V \ U , and which have all their immediate successors already assigned to the layers below the current one, i.e., N + (v) ⊆ Z (Healy and Nikolov, 2013).
Algorithm 5: longestPath Healy and Nikolov ( 2013) This algorithm has a linear time complexity (Mehlhorn, 1984) and is simple to implement.Further, the resulting layering has a minimum height (Healy and Nikolov, 2013).
The following formal explanations are based on Bachmaier et al. (2010) be the incoming neighbors and the outgoing neighbors of vertex v ∈ V DAG , respectively.In an ordered properly layered graph, the sets N + ( . ) and N − ( . ) are sorted according to the order of the vertices in the respective layers.The layers on which the vertices of the block B are placed are retrieved by layers(B).The method for global sifting is described by Algorithm 6.During r sifting rounds, all block sets are processed successively (Lines 3-5).The current sifting block-set BS is processed by the function sif tingStep (Line 5; described in detail in Section 4.1.2,Algorithm 7).After the r sifting rounds, the sequence of the block-sets is returned that has a heuristically calculated minimum of edge crossings while considering our requirements.

Sifting step
Algorithm 7: siftingStep Input: Ordered list of all block-sets BS of G DAG , block-set BS ∈ BS to sift Result: Ordered list of all block-sets BS with updated odering The sifting step proposed by Bachmaier et al. (2010) (Algorithm 3) was adapted for being able to handle block-sets.During the sifting step (Algorithm 7), the optimal position-the position with the minimum number of edge crossings-of a sifting block-set BS within its block-set list BS is determined.The updated block-set list BS ′ with the sifting block-set BS placed at its optimal position is returned.
Before sifting, the variables are initialized (Lines 2-5).The sifting block-set BS is placed at the first position of the temporary block-set list BS ′ (Line 2).Using the function sortAdjacencies (Line 3; described in detail in Section 4.1.3,Algorithm 8), the vertical positions of the block-sets and the adjacency data structures of the graph are initialized.Then, the variables x (current number of edge crossings), x * (best number of edge crossings), and p * (best block-set position) are initialized (Lines 4 and 5).Since the edge crossing calculation is performed relative to the start position of the sifting block-set BS, the current number of edge crossings is always set to zero at the beginning, regardless of the absolute number of crossings.
The sifting block-set BS is consecutively swapped with all other block-sets from the block-set list BS by the function sif tingSwap (Line 7; described in detail in Section 4.1.4,Algorithm 9).Thereby, the change in the number of crossings x is calculated and updated.Moreover, the best number of edge crossings x * and the best block-set position p * are updated, if a smaller number of edge crossings was achieved using the current position of BS (Lines 8 and 9).2010) for accelerating the sifting swap (Section 4.1.4).Again, we extended this function such that it handles block-sets.The index arrays are created by the function sortAdjacencies (Algorithm 8) as part of the preparation step of the function sif tingStep (Algorithm 7).The position of a block B depends on its position within the block-set BS and the position of the block-set BS in the block-set list BS.Therefore, the block-sets are treated in the order given by the block-set list BS (Line 3, Algorithm 8) and the blocks are treated in the order given by the current block-set (Line 4, Algorithm 8).
First, a unique vertical position is set for every block B by setting the position of block B to the current position pos (Line 5), the adjacency data structures are cleared (Line 6), and the current position pos is increased by 1 ( Line 7).
Then, the adjacency data structures of the graph are initialized (Lines 11-22) similarly to the approach of Bachmaier et al. (2010), the only difference being the additional outer loop taking block-sets (Line 9) in addition to blocks (Line 10) into account.
Figure S3.The blue rectangles in the graphs represent vertices and the black circles represent dummy vertices.The black frames around single vertices or sequences of dummy vertices illustrate blocks.Each block has a unique vertical position and is associated with a block-set.In this figure, there are two block-sets that are illustrated by the green and the red rectangles.The vertical swap of the green block-set with the red block-set and the corresponding crossing calculation is illustrated in two steps.In doing so, each block of the green block-set is swapped with each block of the red block-set corresponding to their order in the block-sets.The blocks in a block-set are sorted by their lowest layer.The green block-set is the sifting block-set BS s and the block order is processed backwards.The red block-set is the next block-set BS n and the block order is processed forward.Therefore, the last block of the green block-set is first swapped with the first block of the red block-set.The crossing number is determined relative to the initial situation, where the count is set to zero.The intermediate step shows the situation after swapping the green block-set containing the dummy vertices first with the red block-set containing v 2 and then with the red block-set containing v 3 .As one intersection has been replaced by another by these swaps, the relative crossing number remains zero.After swapping the second green block-set containing v 1 first with the red block-set containing v 2 and then with the red block-set containing v 3 , the crossing number is minus two, i.e., the number of edge crossings was reduced.After positioning the block-sets, the vertical space of the block-sets was minimized and all block-sets are as close as possible to the guide sequence.At this point, some vertical positions of the block-sets might lead to a layout that does not reflect the connection of vertices.Such a case is shown in Figure S4 (B), where the dark gray block-set is close to the guide sequence, but not connected to any of its vertices.On the other hand, it is connected to the uppermost block-set, but separated from it.

Optimization for Special Cases
This special case can be improved in a post-processing step.It is only discussed for the block-sets above the guide sequence as the situation for block-sets below the guide sequence is symmetrical and thus can be handled accordingly.In this particular special case, the following two conditions must both be met, to move a block-set BS one position up: 1.The vertical positions of all block-sets connected to the block-set BS are above the vertical position of the block-set BS.
2. One vertical position above the block-set BS, none of the horizontal positions (layers) of the block-set BS may be occupied by another block-set.
The vertical positions are processed from the uppermost position to the position above the guide sequence.If there is a block-set BS at the current vertical position which meets both conditions, this block-set BS is moved up until at least one of the two conditions is no longer fulfilled.This procedure allows us to move nested special cases of this type vertically upwards one after the other.In the example of the dark highlighted block-set from Figure S4 (B) & (C), the two conditions are met.Condition 1 & 2 allows shifting the block-set at most one positions vertically upwards.
As there are countless possibilities for such improvements and detecting some of them is very complex, only one simple and important case is improved (Figure S4 (B) & (C)).

Complexity Analysis
Let v = |V DAG | denote the number of vertices in the DAG and let bs = |BS| be the number of block-sets in the DAG.To put this in perspective, it should be noted that bs ≤ v but usually bs is much smaller than v.The complexity is analyzed in the order in which the algorithm is described in the paper.
To find the guide sequence in the block-set list BS the complexity is O(bs) since in worst case every block-set has to be checked.Every vertex of a block-set has to be processed to stack the block-set at the next free vertical position.This is done for every block-set in BS and every vertex is unambiguously assigned to a block-set, therefore the complexity for this step is O(v).
For the optimization of the special case, every block-set in BS is processed by checking the two conditions described in Section 5.1.To check Condition 1, all connected block-sets of the current block-set have to be processed which leads to a complexity of O(bs).To check Condition 2 (for one block-set) the complexity is O(v • bs) because in the worst case every block-set has a unique vertical position that must be checked (except the one of the guide sequence and the current block-set) by processing all vertices of this block-set.Therefore, the overall complexity for the optimization of the special case is O(v • bs 2 ).
The overall time complexity of our newly developed vertical position algorithm is also O(v • bs 2 ) since the complexity of the special case handling dominates this algorithm.

EDGE ROUTING
Figure S5.The blue boxes represent the vertices, whereby every vertex consists of 3 areas: the up-, the straight-, and the down-space.The dotted line in the straight-edge space of (A) illustrates the vertical center of the vertex.The gray boxes represent the edge positions."FS" is the abbreviation for free space and represents the amount of free space between two edges."BS" stands for border space and represents the amount of free space between the border of a vertex and the vertical line of an edge.

Preprocessing
During the preprocessing step, first, for each layer, the vertices are sorted according to their vertical position.Then, the right and the left sides of every vertex are processed consecutively.All edges starting or ending at the right side of a vertex are categorized according to their vertical orientation to create an up-, a straight-, and a down-edge list.In the straight-edge list, there can be at most one edge in forward direction and one edge in backward direction.The edges of the up-and down-edge lists are sorted by their vertical difference.Thus, the edges in two different directions between the same vertices, which are associated to the same edge in the DAG, will be next to each other.Simultaneously in this step, the maximally needed space for the straight-edges (independent of being needed on the right or on the left border of the vertex) is calculated, to reserve this space for the straight-edges in the middle of every vertex, only (Figure S5 (A)).Therefore, the vertical space needed of the potentially existing edge in forward direction is summed with the vertical space needed of the potentially existing edge in backward direction and the corresponding free space between.The left side of a vertex is processed symmetrically.

Calculating relative edge-vertex connection coordinates
Calculating the relative edge-vertex connection coordinates is performed for each vertex.We describe this calculation only for the right side of a vertex, since the left side is processed symmetrically.

Vertices without Dummy Vertices
Only those vertices that are not dummy vertices (v ∈ V \ V D ) are processed.First, the edges from the up-edge list are placed at the up space of the vertex.Those edges are placed successively according to their vertical differences, where the edge with the smallest difference is placed closest to the straight space and the edge with the highest difference closest to the top of the vertex (Figure S5 (A)).For edges in two different directions between the same vertices (associated to the same edge in the DAG), the edge in forward direction is always placed above the edge in backward direction.A certain amount of free space "FS" separates the edges from each other.Afterwards, the straight-edges are placed in the straight-space, where the vertical center of the required straight-space is placed on the vertical center of the vertex.Finally, the edges from the down-edge list are placed at the down-space of the vertex.Like the up-edges, the down-edges are placed successively according to their vertical difference, where the edge with the smallest difference is placed closest to the straight-space and the edge with the highest difference closest to the bottom of the vertex (Figure S5 (A)).Again, the edge in forward direction is always placed above the edge in backward direction, if these edges are associated to the same edge in the DAG.
For two horizontally adjacent vertices on the same y-position having both up-or down-edges those edges in the same direction might (partially) overlap in their horizontal lines.To avoid this, first the edges from the left vertex and afterwards the edges from the right vertex are positioned at the respective vertices.The space used for the edges of the left vertex is blocked and can not be used at the right vertex (Figure S5 (C)).
Simultaneously, the height needed for every vertex (needed vertex height, independent whether needed on the right or the left border of the vertex) is calculated as being twice the maximally needed up-or down-space of the vertex (whichever is larger) plus the maximally needed straight-space.This information will be used during the fourth step (Section 7.1.4).

Dummy Vertices
Since the dummy vertices are not drawn in the final layout, dummy edges are handled separately for calculating the relative edge-vertex connection coordinates.This is done in order to avoid vertical differences between the left and the right edge of the dummy vertex, which would produce an offset in the long edge drawn.All connection points between the dummy vertices and the dummy edges are processed like straight edges.This leads to the edges touching each other and thus, every long edge looks like one continuous edge in the final drawing, although it consists of several parts.
There is one special case for which this procedure does not work.When two dummy vertices are on the same vertical coordinate adjacent to each other and the vertical orientation of the connected edges of both of the dummy vertices are up, the edges may overlap (Figure S6 (A)).To avoid overlapping edges in this special case, the connected edges of one dummy vertex are shifted a bit downwards (Figure S6 (B)).In other constellations with dummy vertices, overlapping edges are avoided by other mechanisms of this algorithm (e.g., Figure S6 (C)).

Horizontal edge layering in the inter layer space
For determining the horizontal position of the vertical lines of the edges in the inter layer space, only the edges on the right side of the left layer are considered.The two edge lists lef t and right are necessary for this step.The vertices of a layer are processed top down.For each vertex, its up-edge list E up and its downedge list E down are used.The up-edge list is sorted in descending order from largest to smallest vertical difference.The down-edge list is sorted in ascending order from smallest to largest vertical difference.For every vertex v the two lists lef t and right are filled in the following way:

Figure S4 .
Figure S4.The blue rectangles in the graphs represent the vertices, while the gray rectangles represent block-sets.GS is the abbreviation for guide sequence.(A) Each block-set has a unique vertical position which represents the index in the block-set list BS (initial situation after vertex ordering.(B) The vertical positions of the block-sets after the compaction.The block-sets are as close as possible to the guide sequence.The vertical position of the darker block-set is a special aesthetically unpleasing case that can be improved by the post-processing step.(C) The vertical position of the darker block-set was improved in a post-processing step by shifting the block-set one position upwards.Thus, this block-set is now closer to the block-sets it is connected to, vertical long edges are avoided, and the readability is improved.
FigureS5.The blue boxes represent the vertices, whereby every vertex consists of 3 areas: the up-, the straight-, and the down-space.The dotted line in the straight-edge space of (A) illustrates the vertical center of the vertex.The gray boxes represent the edge positions."FS" is the abbreviation for free space and represents the amount of free space between two edges."BS" stands for border space and represents the amount of free space between the border of a vertex and the vertical line of an edge.(A) A detailed view of placing edges at the right border of a vertex is shown.(B) The final edge layout between two layers created by our approach is shown.One of the vertices is shown in (A) in detail.(C) The detailed view of the down-space of two opposite vertices from (B) is shown.To avoid (partial) overlaps, first the down-edges from the left vertex and afterwards the down-edges from the right vertex are placed.The space used for the down-edges of the left vertex is blocked and can not be used at the right vertex.(D) The detailed view of the horizontal layering of the vertical lines of all edges from (B) in the inter layer space is shown.Every vertical line of an edge has a unique horizontal position in the layout to avoid additional intersections.

Figure S6 .
Figure S6.The blue boxes represent vertices and the gray dots represent the position of the dummy vertices.Dummy vertices are not drawn but are only important for the position of the long edges.(A) In the case, that two dummy vertices are on the same vertical coordinate adjacent to each other and the vertical orientation of the connected edges of both of the dummy vertices are up, the edges may overlap.(B) To avoid such overlapping, the connected edges of one dummy vertex are shifted a bit down.(C) In the case, that two dummy vertices are on the same vertical coordinate adjacent to each other and the vertical orientation of the connected edges of both of the dummy vertices are down, an overlapping of edges is not possible.