Linear-Time Superbubble Identification Algorithm

DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological consequences. Whole-genome sequencing techniques currently produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats in order to filter them out; one such complex subgraph class is superbubbles. In this paper, we propose an O(n+m)-time algorithm to report all superbubbles in a directed acyclic graph with n nodes and m edges.


Introduction
Since the publication of the first draft of the human genome [1,2], the field of genomics has changed dramatically.Recent developments in sequencing technologies (see [3], for example) have made it possible to sequence new genomes at a fraction of the time and cost required only a few years ago.With applications including sequencing the genome of a new species, an individual within a population, RNA molecules from a particular sample, and using DNA sequencing as a readout platform in molecular biology techniques such as nuclear run-on assays, sequencing remains at the core of genomics.
Whole-genome sequencing creates masses of data (tens of gigabytes) in the form of short sequences (reads); genome assembly involves piecing together these reads to form a set of contiguous sequences (contigs) representing the DNA sequence in the sample.Traditional assembling algorithms rely on the overlaplayout-consensus approach [4], representing each read as a node in an overlap graph, and each detected overlap as an edge between the appropriate nodes.
These methods have proved their use through numerous de novo genome assemblies [5].
Recently, a fundamentally different approach based on de Bruijn graphs [6] was proposed in [7], where representation of data elements is organised around words of k nucleotides, or k-mers (not reads).In a de Bruijn graph, each node represents a series of overlapping k-mers.Adjacent k-mers overlap by k − 1 nucleotides.The marginal information contained by a k-mer is its last nucleotide.The sequence of those final nucleotides is called the sequence of the node.In a de Bruijn graph, fragment assembly is reduced to finding a trail or Eulerian path that visits each edge in the graph exactly once.
However, experimental errors and repeats significantly complicate the de Bruijn graph by adding false edges to it.Efficient and robust filtering methods were proposed to simplify the graph by filtering out motifs; such as tips, bubbles, and cross links, which proved to be caused by sequencing errors [8].Specifically, a bubble consists of multiple edges (with the same direction) between a pair of vertices commonly caused by a small number of errors in the centre of reads.Although these types of motifs are simple and can easily be identified and filtered out, more complicated motifs prove to be challenging.
Recently, a complex generalization of a bubble -a superbubble, was proposed as an important subgraph class for analyzing assembly graphs [9].A superbubble is defined as a minimal subgraph H in the de Bruijn graph with exactly one start node s and one end node t such that: (1) H is acyclic, (2) there is no edge from a node not in H going to a node in H\{s} and (3) there is no edge from a node in H\{t} going to a node not in H.It is clear that many superbubbles are formed as a result of sequencing errors, inexact repeats, diploid/polyploid genomes, or frequent mutations.Thus, efficient detection of superbubbles is essential for the application of genome assembly [9].
Onodera et al. gave an O(nm)-time algorithm to detect superbubbles, where n is the number of nodes and m is the number of edges in the graph [9].Very recently, an O(m log m)-time algorithm to solve this problem for a graph with m edges was proposed [10].Note that this time bound is dominated by the cost of computing superbubbles in a directed acyclic graph; the paper proposed a linear-time partitioning of the given graph to a set of directly acyclic subgraphs, in each of which superbubbles are then detected.In this paper, we propose an O(n+m)-time algorithm to compute all superbubbles in a directed acyclic graph.

Properties
The concept of superbubbles was introduced and formally defined in [9] as: Definition 1.Let G = (V, E) be a directed graph.For any ordered pair of distinct vertices s and t, s, t is called a superbubble if it satisfies the following: -reachability: t is reachable from s; -matching: the set of vertices reachable from s without passing through t is equal to the set of vertices from which t is reachable without passing through s; -acyclicity: the subgraph induced by U is acyclic, where U is the set of vertices satisfying the matching criterion; -minimality: no node in U other than t forms a pair with s that satisfies the conditions above; Nodes s and t, and U \{s, t} used in the above definition are the superbubbles entrance, exit and interior, respectively.
Next, we state a few properties of superbubbles which clarify the situation and motivate linear-time enumeration of superbubbles.
Note that Lemma 1 does not exclude the possibility that a node is the entrance of a superbubble and the exit of another superbubble.The lemma is proved in [9].

Lemma 2 ([10]
).Let G be a directed acyclic graph.We have the following two observations.1) Suppose (p, c) is an edge in G, where p has one child and c has one parent, then p, c is a superbubble in G.
2) For any superbubble s, t in G, there must exist some parent p of t such that p has exactly one child t.
Lemma 2 is proved in [10].In a similar way, we will prove the following: Lemma 3.For any superbubble s, t in a directed acyclic graph G, there must exist some child c of s such that c has exactly one parent s.
Proof.Assume that all the children of s have more than one parent.Then, there must be some cycle or some child c which has a parent that does not belong to the superbubble s, t .This is a contradiction.

Finding a Superbubble in a Directed Acyclic Graph
The main outcome of the paper is an algorithm, SuperBubble, that reports all of the superbubbles in a directed acyclic graph.The algorithm expects the given graph to have exactly one root (the node with in-degree 0) and one terminal (node with out-degree 0).The algorithm starts by topologically ordering the nodes of the graph and then identifying all the candidates of the possible entrances and exits of superbubbles according to Lemmas 2 and 3.The aim of this algorithm is accomplished with the help of algorithm ValidateSuperBubble, explained in the following section, which checks whether a given candidate s, t is a superbubble or not; if it is not, the algorithm returns a possible entrance for a superbubble that ends at t (if any).Algorithm SuperBubble relies on the topological sort of a directed acyclic graph G defined as follows: A topological ordering, ordD, of a directed acyclic graph G = (V, E) maps each node to a priority value such that ordD(x) < ordD(y) holds for all edges x → y ∈ E. There exist well-known linear-time algorithms for computing the topological order of a directed acyclic graph [11, ?].Here, we are expecting G to have one root and one terminal node.If G has more than one root then a node r ′ is added to V and an edge from r ′ to each existing root is added to E. The same can be done if G has more than one terminal, where a new terminal node t ′ is added to V and an edge from each existing terminal to t ′ is added to E. Furthermore, the algorithm can report only those superbubbles which do not start at r ′ or do not end at t ′ , with no extra cost.For example, a topological sort of graph G in Figure 1 is given in Figure 2. Note that a topological sort of a graph may not be unique.The algorithm also checks each node in V , in topological order, to identify whether it is an exit or an entrance candidate (or both).According to Lemmas 2 and 3, a node v is an exit candidate if it has at least one parent with exactly one child (out-degree 1) and an entrance candidate if it has at least one child with exactly one parent (in-degree 1).The size of the candidates-list is at most 2n, thus the cost of constructing a doubly-linked list of all the candidates is clearly linear on the size of G.The elements of the candidates-list are expected to be ordered according to ordD, and each candidate in the list should be labelled as an exit or an entrance candidate.Note that if node v is both an exit and an entrance candidate, then v appears twice in the candidates-list; first as an exit and then as an entrance candidate (see Figure 3).Additionally, each node v is preprocessed to point in O(1)-time to its nearest previous entrance candidate as follows: PreviousEntrance(v) returns an entrance candidate s such that ordD[s] ≥ ord[v], and ordD[s] is as large as possible.Note that in case v is an entrance candidate, PreviousEntrance(v) will return v itself.For example, PreviousEntrance(v 6 ) = v 5 , and PreviousEntrance(v 13 ) = v 13 .After the candidates-list of G has been computed, SuperBubble processes it in decreasing topological order (backwards ℓ be the list of candidates, SuperBubble examines the candidates in a decreasing order: j is an exit candidate, then the algorithm calls ReportSuperBubble to find and report the superbubble ending at v ′ i , that is superbubble v ′ i , v ′ j , for some entrance candidate v ′ i .Algorithm ReportSuperBubble also recursively finds and reports all the nested superbubbles between v ′ i and v ′ j . Note that ReportSuperBubble is called for each exit candidate in a decreasing order either by algorithm SuperBubble or through a recursive call to identify a nested superbubble.Each call to ReportSuperBubble(start, exit) checks the possible entrance candidates between start and exit starting with the nearest entrance candidate (to exit) in topological order.For the graph in Figure 1, the algorithm detects and reports five superbubbles v 8 , v 14 , v 3 , v 8 , v 5 , v 7 , v 11 , v 12 and v 1 , v 3 .Here both v 5 , v 7 and v 11 , v 12 are nested superbubbles.

SuperBubble(G)
1 Remark 2. It is also possible to design the algorithm so as to move forward in topological order instead of backwards.
In this section, we describe the ValidateSuperBubble subroutine.The ability to validate a candidate superbubble depends on the following result related to the Range Minimum Query problem: The Range Minimum Query problem, RMQ for short, is to preprocess a given sequence A[1..n] for subsequent queries of the form: "Given indices i, j, what is the minimum value of A[i..j]?The problem has been studied intensively for decades and several O(n), O(1) -RMQ data structures have been proposed, many of which depend on the equivalence between the range minimum query and the lowest common ancestor (LCA) problems [12][13][14].
In order to check whether a superbubble candidate s, t is a superbubble or not, we propose to utilize the range min/max query problem as follows: -For a given graph G = (V, E) and for each node v ∈ V , calculate two nodes u 1 and u 2 which are furthest (topologically) from v, such that u 1 and u 2 are a parent and a child of v respectively.
-For a given superbubble candidate s, t , where s and t are an entrance and an exit candidate respectively (satisfying Lemmas 1 & 2), if s, t is a superbubble then the following two conditions are valid For example, Figure 4 represents both OutP arent and OutChild arrays computed for graph G in Figure 1.Furthermore, a candidate v 5 , v 8 is not a superbubble as RangeMin(OutP arent, ordD It should be clear that after O(m + n) processing, validating a superbubble requires O(1) time which is the cost for range max/min query.Algorithm ValidateSuperBubble(start, end) is designed to return a suitable entrance candidate for a superbubble ending at endN ode as follows: ValidateSuperBubble(startNode,endNode)

Algorithm Analysis
In this section, we analyse the correctness and the running time requirement of our algorithms.
Proposition 1.For a given exit candidate e, let s be the first entrance candidate which has been returned by the subroutineValidateSuperBubble(s, e).Then s, e satisfies the minimality criterion for the superbubble starting at s and ending at t.In the first case, ordD[e] = ordD[e ′ ] + 1 implies e is the only child of e ′ which not only makes e ′ a start candidate but also e ′ , e a superbubble by definition.If that had been the case, ValidateSuperBubble would have been called for (e ′ , e) before (s, e), and e ′ would have been the first entrance candidate returned (not s).
For the second case, each node v, such that ordD[e ′ ] < ordD[v] < ordD[e], is reachable from s through e ′ .This is because s, e is a valid superbubble and it is also true for all the children of e ′ .However, at least one child of e ′ has exactly one parent, otherwise there must be some cycle .Thus, the node e ′ is also a start candidate but also e ′ , e a superbubble by definition.If that had been the case, ValidateSuperBubble would have been called for (e ′ , e) before (s, e), and e ′ would have been the first entrance candidate returned (not s).
In both situations, s cannot be the nearest start candidate to the left of e returned by ValidateSuperBubble(s, e) which completes the proof.
⊓ ⊔ Proposition 2. Given s and t, the candidates for an entrance and an exit of a superbubble in G, respectively, ValidateSuperBubble subroutine correctly checks whether s, t is a superbubble or not.
Proof.Let start and end be two integers such that ordD[s] = start and ordD[t] = end.The directed acyclic graph G = (V, E) as defined, has a single root r and a single terminal r ′ ; this implies that any node v ∈ V is reachable from r and, at the same time, can reach r ′ .This is also true for s, t and any node v such that ordD First, we will show that t is reachable from s. Recall that t is an exit candidate, so, it has a parent p with out-degree = 1.Assume that t is not reachable from s, then there must be a path from r t which does not involve s.This implies that either OutP arent[end] < start or there exists a node v such that start < ordD[v] < end, OutP arent[v] < start and there exists a path r v t which is a contradiction.Similarly, we can show that s can reach t.Consequently, any node v such that start < ordD[v] < end satisfies the matching criterion of the superbubble.
The acyclicity criterion is guaranteed by acyclicity of G and the minimality is satisfied by the design of Algorithm ReportSuperbubble which assigns each exit of a superbubble to the nearest entrance and by the correctness of Proposition 1. Proof.We consider a run of SuperBubble.Let s 1 , t 1 , s 2 , t 2 , • • • , s k , t k be the successive superbubbles reported just after the execution of Line 15 of ReportSuperBubble, where ordD(t 1 ) > ordD(t 2 ) > ... > ordD(t k ).First, we show that each s i , t j reported by the algorithm in Line 15 is a superbubble.This is proved by the correctness of Proposition 2. Also, no superbubble is missed out by the algorithm as all exit candidates are considered exactly once and later deleted from the candidates-list.Secondly, we have already explained that the running time cost of the topological sorting of G and computing the candidates-list is O(n+ m).Furthermore, all the list's operations cost constant time each, and sum up to a linear cost O(n), as there are at most 2n candidates in the list.Finally, the total number of times ValidateSuperBubble subroutine is called is O(m).This is achieved by marking each entrance candidate (Line 11) to avoid repeated checks.Each call for ValidateSuperBubble costs O(1).
Therefore, for a directed acyclic graph G = (V, E), the total running time for reporting all superbubbles is O(n + m), where n = |V | and m = |E|.⊓ ⊔

Final Remarks
We presented an O(m + n)-time algorithm to compute all superbubbles in a directed acyclic graph, where n is the number of nodes and m is the number of edges.Our next goal is to practically evaluate our algorithm and compare its behaviour with an earlier result [9].It would also be interesting to investigate other superbubble-like structures in assembly graphs, such as complex bulges [15].

⊓ ⊔ Theorem 1 .
Given a directed acyclic graph G = (V, E), SuperBubble detects and reports all superbubbles in G in decreasing order of the topological ordering of their exit nodes in O(m + n) time, where n = |V | and m = |E|.
for each node v in topological order do We use mark to keep track of the entrance candidates which we have already checked prior to the current exit position being considered.It allows us to avoid checking the same path of entrance candidates repeatedly.