Linking indexing data structures to de Bruijn graphs: Construction and update

DNA sequencing technologies have tremendously increased their throughput, and hence complicated DNA assembly. Numerous assembly programs use de Bruijn graphs (dBG) built from short reads to merge these into contigs, which represent putative DNA segments. In a dBG of order k, nodes are substrings of length k of reads (or k-mers), while arcs are their k + 1-mers. As analysing reads often require to index all their substrings, it is interesting to exhibit algorithms that directly build a dBG from a pre-existing index, and especially a contracted dBG, where non-branching paths are condensed into single nodes. Here, we exhibit linear time algorithms for constructing the full or contracted dBGs from suffix trees, suffix arrays, and truncated suffix trees. With the latter the construction uses a space that is linear in the size of the dBG. Finally, we also provide algorithms to dynamically update the order of the graph without reconstructing it.


Introduction
In life sciences, determining the sequence of bio-molecules is an essential step towards the understanding of their functions and interactions within an organism.Powerful sequencing technologies allow to get huge quantities of short sequencing reads that need to be assembled to infer the complete target sequence.These constraints favour the use of a version of the de Bruijn Graph (dBG) dedicated to genome assembly -a version which differs from the combinatorial structure invented by N.G. de Bruijn [1].Given a set S = {s 1 , . . ., s n } of n reads and an integer k, an assembly de Bruijn Graph, or for short simply de Bruijn Graph, stores each k-mer (k-long substring) occurring in the reads as nodes and has an arc joining two k-mers if they appear as successive (and hence overlapping) k-mers in at least one read.
The dBG is then traversed to extract long paths, which will form the contigs, i.e., the sequence of sub-regions of the molecule.In non-repetitive regions, the layout of the reads dictates a simple path of k-mers without bifurcations.Any simple path between an in-branching node and the next out-branching node, can then be contracted into a single arc without loosing any information on the graph structure.The sequences of such simple paths are called unitigs (the contraction from unique and contigs).The version of the dBG where each such "non-branching" path is condensed into a single arc is termed the Contracted dBG (CdBG).
Sequencing technologies of the second generation can yield hundreds of millions of reads.Compared to the overlap graph or to the string graph, which were used with previous technologies, the dBG has a number of nodes that is not proportional to the number of reads: it depends on a user controlled parameter k, termed the order of the dBG.Its memory usage can be fine tuned through this parameter.
In bioinformatics, dBGs are heavily exploited for genome assembly [2], but for other purposes as well.Actually, some programs mine the dBG to seek graph patterns representing mutations, large insertions/deletions, or chromosomal rearrangements [3].Others use it to correct sequencing errors in long reads [4].
The de Bruijn Graph is usually built directly from the set of reads, which is time and space consuming.Several compact data structures for storing dBGs have been developed [5,6] including probabilistic ones [7].The emphasis is placed on the practical space needed to store the dBGs in memory.Moreover, some recent assembly algorithms put forward the advantage of using for the same input, multiple dBGs with increasing orders [8], thereby emphasising the need for dynamically updating the dBG.In all cases, the construction algorithms need to scan through the whole set of reads.
Several genome assembly programs used hash tables to store the k-mer of the reads and allow navigating through the arcs of the dBG, but these solutions suffer from several limitations regarding e.g.functionalities and flexibility.With hash functions, it is often not possible to add extra information to the nodes, like for instance the number of times a k-mer is observed in the read set, which is used as a confidence measure.Hash tables make it difficult to compute the contracted dBG or to change the value of k.The main advantage of sophisticated hash functions is their memory footprint.For instance Minia [9] offers a very space efficient storage to handle the dBG based on cascading Bloom filters, which are a type of hash functions.This hash table based solution was used for long read error correction and also proves efficient in that context [4].
In studies involving the analysis of sequencing reads, distinct tasks require to index either all substrings, or the k-mers of the reads.For instance, fast dBG assembly programs first count the k-mers before building the dBG to estimate the memory needed [7].Another example: some error correction software build a suffix tree of all short reads to correct them [10].Hence, before the assembly starts, the read set has already been scanned through and indexed.It can thus be efficient to enable the construction of the dBG for the subsequent assembly, directly from the index rather than from scratch.For these reasons, we set out to find algorithms that transform usual indexes into a dBG or a contracted dBG.It is also of theoretical interest to build bridges between well studied indexes and this graph on words.Despite recent results [11,12], formal methods for constructing dBGs from suffix trees are an open question.In comparison, Simpson and Durbin have proposed an algorithm to build the String Graph from a FM-index [13].
Here, we present algorithms to build directly the CdBG from a Generalised Suffix Tree or from a Generalised Suffix Array of the reads [14][15][16][17].These algorithms take space and time that are linear in the input size.These well-known data structures index all substrings of the reads, and not only their k-mers.This results in one drawback and in one advantage.
The drawback is their space occupancy.We will then consider an indexing data structure that reduces the set of indexed substrings: the truncated suffix tree [18,19].We introduce the reduced truncated suffix tree (TST) and then show how to construct with this index both the dBG and CdBG in time and space that are linear in the size of the final dBG, rather than in the cumulated length of the reads.By size of the dBG we mean the sum of number of nodes, plus the number of arcs.This algorithm achieves an optimal time and space complexity.
The advantage is the counterpart: as substrings of all lengths are indexed, it allows to update the order of the graph, that is to change dynamically the value of k without reconstructing the dBG.Finally, we provide efficient algorithms for increasing or decreasing the value of k.Of course, if one uses the truncated suffix tree instead of the full suffix tree, only some updates remain possible.Our results nevertheless remain applicable to the truncated suffix tree, where the order can be dynamically decreased.
This article includes results that appeared in [20,21].

Indexing data structures
Suffix trees are well-known indexing data structures that enable to store and retrieve all the factors of a given string.
The suffix tree of a string y of length s can be build in time and space in O (s) on a constant size alphabet [14,22].Then, it is possible to check if a pattern x of length m is a factor of a string y of S in time O (m).Counting the number of occurrences of x in y can also be done in time O (m) while enumerating the positions where x occurs in y can be performed in time O (m + occ), where occ denotes the number of occurrences of x in y.Suffix trees can be adapted to a finite set of strings and are then called Generalised Suffix Trees (GSTs).Thus, given a set S of n strings of total length S on a constant size alphabet, the generalised suffix tree for S can be build in time and space O ( S ).For a detailed exposition of properties of suffix trees we refer the reader to [17].Suffix trees have been widely studied and used in a large number of applications (see [15] and [17]).In practice, they consume too much space and are often replaced by the more economical suffix arrays [16], which have the same properties [23].

Notation about strings
Here we introduce a notation and basic definitions.• Suff k (S) is the set of suffixes of length k of words of S.

Classical definition of de Bruijn graph
All definitions below refer to the set S; however, as S is clear from the context, we simply omit the "in S" in the notation.
For a word w of F (S), • Support(w) is the set of pairs (i, j), where w is the substring (s i ) |w|, j .Support(w) is called the support of w in S.
• RC(w) (resp.LC(w)) is the set of right context (resp.left context) of the word w in S, i.e., the set of words w such that w w ∈ F (S) (resp.w w ∈ F (S)). • w is the word w w where w is the longest word of RC(w) such that Support(w) = Support(w w ).In other words, such that w and w w have exactly the same support in S.
• w is the word w where w is the longest prefix of w such that Support(w ) = Support(w).
In other words, w is the longest extension of w having the same support as w in S, while w is the shortest reduction of w with a support different from that of w in S.These definitions are illustrated in a running example presented in Fig. 1.
We give the definition of a de Bruijn graph for assembly (dBG for short), which differs from the original definition of a complete graph over all possible words of length k stated by de Bruijn [1].V  An equivalent definition of E + k can be stated using the left instead of right context: Examples of arcs are displayed on Fig. 2. The size of D BG + k is denoted by and defined as size Note that another, simpler definition of the arcs in the de Bruijn graph coexists with that of Definition 1. There, an arc links u to v if and only if u overlaps v by k − 1 symbols.This graph is denoted by , where: The arcs of E − k satisfy less constraints than those of Both definitions are illustrated on Fig. 3. Some assembly programs use D BG − k [9].All the algorithmic results that we obtain for D BG + k remain valid for D BG − k .In the sequel, we focus only on D BG + k .Let us introduce now the notions of extensibility for a substring of S and that of a Contracted dBG (CdBG for short).

Definition 2 (Extensibility).
Let w be a word of F (S).
Let w be a word of .The word w is said to be a unique k -mer of S if and only if k ≥ k and for all i ∈ [1..k − k + 1], (w) k,i ∈ F (S) and for all j ∈ [1..k − k], (w) k, j is right extensible and (w) k, j+1 is left extensible.
), is a directed graph where: Examples of CDBG + k are displayed on Fig. 4. Note that in the previous definition, an element w in V + k,c does not necessarily belong to F (S), since w may only exist as the substring of the agglomeration of two words of S. Thus, let w be a k -mer unique maximal by substring with k ≥ k: With this argument, we have both following propositions.
According to Propositions 1 and 2, CDBG + k is the graph D BG + k where the arcs (u, v) are contracted if and only if u is right extensible and v is left extensible.

Constructive characterisation of the de Bruijn graph
Let k be a positive integer.We define the following three subsets of F (S).
A word of Init Exact k is either only the suffix of some s i or has at least two right extensions, while the first k-mer of a word in Init k \ Init Exact k has only one right extension.
For w an element of Init k , f irst k (w) is a k-mer of S. Given two words w 1 and w 2 of Init k , f irst k (w 1 ) and f irst k (w 2 ) are distinct k-mers of S. Furthermore for each k-mer w of S, there exists a word w of Init k such that f irst k (w) = w .From this, we get the following proposition.To define the arcs between the words of Init k , which correspond to arcs of D BG + k , we need the following proposition, which states that each single letter that is a right extension of w gives rise to a single arc.Proposition 5.For w ∈ Init Exact k and a ∈ ∩ RC(w), there exists a unique w ∈ Init k such that last k−1 (w)a is a prefix of w .

Proof. Let w be a word of Init Exact k and a a letter of RC(w). By definition of right context, last
The set Init k represents the nodes of D BG + k .Let us now build the set of arcs that is isomorphic to E + k .Let w be a word of Init k and Succ k (w) denote the set of successors of f irst k (w): We know that for each letter a in RC(w), there exists an arc from We consider two cases depending on the length of w: According to Proposition 3, w ∈ Init Exact k and hence last k−1 (w) ∈ SubInit k .Therefore, the outgoing arcs of w in D BG + k are the arcs from w to w satisfying the condition of Proposition 5.Then, For simplicity, from now on, we confound the graph we build with D BG + k .

Constructive characterisation of the contracted de Bruijn graph
To do the same with CDBG + k , initially we begin by explaining the algorithm that we use to build this graph and in the second time we need to characterise the concepts of right and left extensibility in terms of word properties.

Our algorithm to build CDBG +
k .We present a generic algorithm to build incrementally CDBG + k .It is explained in terms of words, and does not depend on any indexing data structure.In following sections, we will use this generic algorithm and explain how it can be performed efficiently using a specified indexing structure.
The main algorithm (Algorithm 2) explores D BG + k to find the nodes kept in CDBG + k and set all single arcs that represent whole non-branching paths of D BG + k that are properly contracted.The key point is to find all starting nodes of simple paths and explore these paths from them; the exploration is done by Algorithm 1.
A more detailed explanation.First, note that to build D BG + k it suffices to know the set Succ k (.) for each node.The algorithm below simulates a traversal of D BG + k without building it, and stores only one node per unique maximal k -mer of D BG + k .For such a k -mer, say m, we choose to represent it by the node v such that f irst k (v) is a prefix of m.In D BG + k , m is represented by a simple (i.e., non-branching) path and v is its first node.In the traversal algorithm, for a current starting node v c in Init k , we traverse the simple path until we arrive at a node u having several successors or such that its only successor is not left extensible (i.e., has several predecessors).In other words, until we find u such that u is not right extensible or next(u) is not left extensible.In D BG + k , there exists a simple path between v c and u, and this must build a single node in CDBG + k .To contract this path, we choose to keep v c , and for any successor w of u, we insert an arc between u and w, as this arc cannot be contracted.Noting that w necessarily starts a chain (having at least a single node), if w is not yet in CDBG + k , we launch a new path exploration starting from w, one gets that f irst k (w) is the prefix of a node of CDBG + k , and thus w can appropriately represent the path.Now, if w already belongs to CDBG + k , the case is trickier.If v f stores the first v c called by the procedure, it may not be the starting node of a path, but be anywhere inside a path.Two cases arise.If v f is considered during the while loop, then it is not at the start of a simple Input : The partial contracted graph CDBG + k as (V , E), two nodes v f and v c .v f the initial starting node, and v c the current starting node.Output: The updated contracted graph (V , E ), which now contains all paths starting from v c .
Input : A set of words S.
path: hence we must update V by exchanging v f with v c and terminate the exploration.Otherwise, v f is traversed during the for loop (as the value of w), then it is a successor of u and the beginning of a simple path: we just add an arc linking v c to w and stop.Finally, if w already belongs to V but w = v f , we also add an arc linking v c to w and stop.
The process performed by Algorithm 1 augments the partial graph CDBG + k restrained to the nodes visited when exploring the path starting from v c .It suffices now to ensure that all arcs of D BG + k are examined, which Algorithm 2 does.More precisely, it starts by visiting the simple paths starting at nodes having no predecessors (otherwise these nodes would not be visited).Once this is done, one must explore all nodes not yet marked and continue until all nodes have been visited/marked.
From the above discussion, we obtain the following theorem.
Theorem 2. Assume one can determine in constant time for an arc (u, v) of E + k,c , whether u is right extensible and whether v is left extensible.Then, with the sets Init k , Init Exact k and SubInit k , Algorithm 2 builds a graph that is isomorphic to CDBG + k in linear time in the size of these sets.
Remark.Executing Algorithm 2 does not require to build D BG + k , since the set of successors Succ k (u) of any node u is computed in constant time.

Characterisation of the concepts of right and left extensibility. By the construction of D BG +
k , we get the following properties, which will turn useful for the construction of the CdBG from specific indexes (Section 3 and 4).
Proposition 7. Let w be a word of Init k such that f irst k (w) is right extensible.Let the letter a be the unique element of Proof.Let (i, j) be a pair of Support( f irst k (w)).We have Hence (b In summary, this section gives a formulation of the dBG of S in terms of words.Now assume that the substrings of the words are indexed in a data structure, e.g. a generalised suffix array.How can we build the dBG or the contracted graph directly from this structure?To achieve this, it suffices to compute the three sets Init k , Init Exact k , SubInit k , as well as the sets Support(.)and Succ k (.) for some appropriate substrings.In the following sections, we exhibit algorithms to compute

D BG +
k and CDBG + k for two important indexing structures and for a home-made truncated data structure.

From a generalised suffix tree
Suffix Trees (ST) belong to the most studied indexing data structures.A generalised ST can index the substrings of a set of words.Generally for this sake, all words are concatenated and separated by a special symbol not occurring elsewhere.However, this trick is not compulsory, and an alternative is to keep the indication of a terminating node within each node.

The suffix tree and its properties
The Generalised Suffix Tree of a set of words S is the suffix tree of S, where each word of S does not necessarily finish by a letter of unique occurrence.Hence, for each node v of the Generalised Suffix Tree of S, we keep in memory the set, denoted by Suff (v), of pairs (i, j) such that the word represented by v is the suffix of s i starting at position j.Let us denote by T the generalised suffix tree of S (from now on, we simply say the tree) and by V T its set of nodes.For v ∈ V T , Children(v) denotes its set of children and f (v) its parent.See Fig. 5 for an example of GST.Some nodes of T may have just one child.The size of the union of Suff (v) for all node v of T equals the number of leaves in the generalised suffix tree when the words end with a terminating symbol.Hence, the space to store T and the sets Suff (.) is linear in S .By simplicity, for a node v of T , the word represented by v is confused with v.For each node v of T , v ∈ F (S).As all elements of F (S) are not necessarily represented by a node of T , we give the following proposition.

Proposition 8. The set of nodes of T is exactly the set of words w of F (S) such that d(w
We recall the notion of a suffix link (SL) for any node v of T (leaves included We consider the same two cases as for the construction of E + on p. 6, but in the case of a tree.Let v ∈ Init k .and 6c).
We have that sl(v) is a node of V T .As |v| > k, |sl(v)| ≥ k.Thus, there exists an element of Init k between the root and sl(v).We associate v with this node, i.e. f irst k (sl(v)) .
We illustrate these two cases in Fig. 5: Case 1. Case where v is 6,6 , sl(v) is 7,7 , the unique child u of v is 3 , and sl(u ) is 4 , which is in Init k .
Case 2. Case where v is 1 , sl(v) is 2 , and f irst k (sl(v)) is .
In both cases, building the arcs of E + requires to follow the SL of some node.The node, say u, pointed at by a SL may not be initial.Hence, the initial node representing the associated first k-mer of u is the only ancestral initial node of u.We equip each such node u with a pointer p(u) that points to the only initial node on its path from the root.In other words, for any u / ∈ Init k such that |u| > k, one has p(u) := f irst k (u) .
The algorithm to build the D BG + k is as follows.An initial depth first traversal of T allows to collect the nodes of Init k and for each such node to set the pointer p(.) of all its descendants in the tree.Finally to build E + , one scans through Init k and for each node v one adds Succ k (v) to E + using the formula given above.Altogether this algorithm takes a time linear in the size of T .Moreover, the number of arcs in E + is linear in the total number of children of initial nodes.This gives us the following result.For the left extensibility of the single successor of a node, one only needs the size of support of some nodes (Proposition 7).Let us see first how to compute (Support(.))on the tree, and then how to apply Proposition 7.
Proposition 10.Let v be a word of F (S) and V T ( v ) denotes the set of nodes of the subtree rooted in v .
Along a traversal of the tree, we can compute and store (Support(v)) and (Support(v) ∩ {(i, and by Proposition 7, f irst k (sl(u)) is left extensible.
is left extensible takes constant time.To conclude, as for any initial node v, we can compute in O (1) time its set of successors Succ k (v), its right extensibility, and the left extensibility of its single successor, we can readily apply Algorithm 2 to built CDBG + k and we obtain a complexity that is linear in the size of D BG + k , since each successor is accessed only once.This yields Theorem 4.

Theorem 4. For a set of words S, building the Contracted de Bruijn Graph of order k, CDBG +
k takes linear time and space in |T |, i.e., in S .

From a generalised suffix array
In the previous subsections we have shown how to build de Bruijn graphs from suffix trees.Suffix trees are very elegant data structures but they are too space-consuming in practice.In many applications they have been replaced by suffix arrays that are equivalent data structures and are more space economical.We will now show how to build de Bruijn graphs from suffix arrays.
Let SA and LCP be the generalised enhanced suffix array of S: is the length of the longest common prefix between suffixes stored in SA[i − 1] and in SA[i], and Let us recall the definition of an lcp-interval.

Definition 4 ([23]
).An interval [i, j], 1 ≤ i < j ≤ S is called a lcp-interval of value , also denoted by -[i, j], iff: Let us now recall the definitions of the previous and next smaller values (PSV and NSV) arrays.
Definition 5 ([23]).For 2 ≤ i ≤ S : The direct inclusion among lcpintervals defines a tree relationship called the lcp-interval tree (see [23,Def. 4.4.3,p. 87]).Given an lcp-interval -[i, j], its parent lcp-interval -[i , j ] can be easily computed in constant time using the arrays LCP, PSV and NSV.Then: Actually the lcp-interval tree does not need to be explicitly build and the sets can be computed by a single scan of the SA and LCP arrays.
For an lcp-interval -[i, j] ∈ Init k we have (Support(s

Theorem 5. The de Bruijn graph of order k, CDBG +
k , for a set of words S can be built in a time and space that are linear in S using the generalised suffix array of S.

Transition from a truncated structure to de Bruijn graphs
This section is organised as follows.In Section 4.1, we define a simple condition that a set of input strings must satisfy to allow building a generalised index and sketch a modification of McCreight's algorithm [14] for doing so.In Section 4.2, we introduce the reduced truncated suffix tree and specialise the previous algorithm for constructing it efficiently.Finally, in Section 4.3 we show how to construct both the de Bruijn Graph and its contracted version in optimal time from the reduced truncated suffix tree.

Set of chains of suffix-dependant strings and tree
Here, we introduce the notion of suffix dependence between strings, and the notion of chain of suffix-dependant strings in order to define a unified index that generalises both the suffix tree [14] and the truncated suffix tree [18].First, let us define the concept of suffix-dependant strings and of chains of suffix-dependant strings.Definition 6.

A string
x is said to be suffix-dependant of another string y if x[2..|x|] is prefix of y.
2. Let w be a string and m be a positive integer smaller than |w| − Let R = {C 1 , . . ., C n } be a set of tuples such that for each i ∈ [1, n], C i is a chain of suffix-dependant strings of the string s i .For i ∈ [1, n] With R and S, we can easily compute R .In the sequel, we use R to demonstrate our results, and R to state the complexities of algorithms.Indeed, in the case where C i is the tuple of each suffix of s i , the size of Let w be a string; w may occur in distinct tuples of R .Thus, we define N(w) the set of (i, j) such that w = C i [ j].In other words, N(w) is the set of coordinates of the elements of R that are equal to w.
We define a contracted version of the well-known Aho-Corasick tree [17].In fact, we apply nearly the same contraction process that turns a trie of a word into its compact Suffix Tree [17].Consider the Aho-Corasick tree of S, in which each node represents a prefix of words in S. We contract the non-branching parts of the branches except that we keep all nodes representing a word that belongs to a tuple in R .From now on, let T (R ) denote this contracted version of the Aho-Corasick tree of S.
N and L denote respectively the set of nodes and the set of leaves of T (R ).Furthermore, we define for each node v of T (R ) two weights: • s(v) is the number of times that an element of a tuple of R is equal to the word represented by v (i.e., s(v is the number of times that the first element of a tuple of R is equal to the word represented by v (i.e., t(v) := Let w be a string, we put Succ(w) = {(i, j) | (i, j − 1) ∈ N(w) and j ≤ |C i |}.We define H as the subset of L such that: Below we present an algorithm that constructs T (R ), and computes for each node v in N , the weights s(v) and t(v) and a possible link P 0 .
Construction of T (R ).Now, we give an algorithm to construct T (R ).We use the version of McCreight's algorithm given by Na et al. [18] on our input and we build for each leaf v, s(v), t(v) and P 0 (v).For building T (R ), we start with a tree that contains only the root.Then, for each word w in every chain C , we create or update (if it exists) the node w as follows.
Assume that we keep in memory the node v that has been processed just before w.
If w is the first word of C , we go down from the root by comparing w to the labels of the tree.If we create the node w, s(w) and t(w) are initialised to 1, and P 0 (w) to nil.If w already exists on the tree, we increment s(w) and t(w) by 1.
If w is not the first word of C , we start from v, and as in McCreight's algorithm, we create or arrive on the node representing w.If we need to create this node, s(w) is initialised to 1, t(w) to 0, and P 0 (w) to nil.Otherwise, we add 1 to s(w).We set P 0 (v) = w.
The loop continues with the next word until the end, and we obtain T (R ).
Theorem 6.For a set of chain of suffix-dependant strings R , we can construct T (R ) in O ( S ) time and space.
Proof.To begin with, let us to prove that T (R ) is in O ( S ) space.Its number of leaves equals C ∈R |C|.Hence, its number of nodes is at most 2 C ∈R |C| − 1 ≤ 2 S , and its number of edges is at most 2 S .Thus the size of T (R ) is in O ( S ).
Clearly, the construction algorithm of T (R ) computes both weights s(.) and t(.), and the possible link P 0 (.) correctly.
For the complexity, for each chain of suffix-dependant C i of R , the length of the traverse path on the tree is equal to |w i |, thanks to the use of the suffix links.Thus as in McCreight's algorithm, the complexity is in O ( S ). 2 Now, we are equipped with an algorithm that builds T (R ) for any set of chains of suffix-dependant strings.Let us review some instances of sets S, for which T (R ) is in fact a well-known tree.
• If C := ∪ w∈S {tuple of suffixes of w}, then T (C ) is the Generalised Suffix Tree of S (see Fig. 7a).We have that the restrained mapping sl(.) is an example of a possible link.
• If B k := ∪ w∈S {tuple of k-mer of w and suffixes of length k < k of w}, then T (B k ) is the generalised k-truncated suffix tree of S, as defined in [19] (which generalises the k-truncated suffix tree of Na et al. [18]).
• If A k := ∪ w∈S {tuple of k + 1-mer of w and suffixes of length k of w}, then T (A k ) is the truncated suffix tree that we define below in Section 4.2 (see Fig. 7b).

Our truncated suffix tree
First, we define the following notation.1.For all i ∈ [1, |S|] and j ∈ [1, |s i | − k + 1], A k,i denotes the tuple such that its jth component is defined by

1.
A k,i is a chain of suffix-dependant strings of s i .

For all
2. For the second point By applying the algorithm described in Section 4.1 to the set A k (Definition 7), and by using Theorem 6, we get the following result.
Corollary 1.We can construct T (A k ) in O ( S ) time and space.

Experimental results
We tested the two data structures GST and TST on real biological data.We considered a set of 2249632 Illumina reads of yeast of length 101 and performed tests for subsets of size 100, 1000, 10000, 100000, 1000000 and for the whole set.We counted the number of nodes of the GST and of the TST for various values of k (5, 10, 20 and 40).We used the gsuffix1 of [19].It should be noted that their implementation of the TST stores all the suffixes shorter than k producing thus more nodes than our TST.Fig. 8 displays the results.It can be seen that for small sets, TSTs do not save many nodes compared to the GST except for very small values of k but that for large sets TSTs save a lot of nodes for small values of k, they save more than two third of nodes for k = 20 and almost half of the nodes for k = 40.We also performed experiments with longer reads from Pacific Biosciences technology (not shown here).In this case, as expected, TSTs save less nodes than for Illumina reads.

De Bruijn graph via the truncated suffix tree
Here, we describe an algorithm that builds the de Bruijn Graph of S starting from the generalised truncated suffix tree of S.

De Bruijn graph
Proposition 12 states that there does not exist any leaf in T (A k ) representing a word strictly shorter than k.

Proposition 12. Let v be a leaf of T (A
For a possible link P 0 , we define the mapping P from H to N .H , N and L have the same definition as before, but applied to the T (A k ).H can be seen in this case as the set of leaves of length k + 1 of T (A k ).We define the mapping P as follows: The mapping P can be constructed in linear time in O ( S ).In fact, for each v ∈ H , P (v) can be constructed in O (1) time because in this case, P 0 According to the definitions of a possible link P , and of A k , for any node v in L, P (v) is the shortest node of T (A k ) such that v is a prefix of P (v).
Proof.We begin by building T (A k ).With T (A k ), we can build Init k , SubInit k and Init Exact k as we do on the generalised suffix tree of S. By using P as the suffix link, we can build the graph (V , E) satisfying Let us note that two cases of arcs arise depending on whether the starting node v represents a word of length k or of length k + 1.These cases correspond to the two terms in the union above.
This graph is isomorphic to D BG + k .
Let b be the application from Fig. 9 shows an example of de Bruijn graph of order 2 built from T (A 2 ).z be a node of SubInit k−1 .Assume z is a parent of some node of Init k−1 .If the latter is a source node, then z is inserted on line 12; if the latter is a target node, z is inserted on line 21.Otherwise, z must be pointed to by a suffix link of some node v.As the suffix link removes the first letter, the difference of word length between v and z is only one.Hence, v must be a node representing exactly a (k − 1)-mer and must belong to Init k−1 .This case is detected on line 5 for a source node (line 14 for a target), and z is properly inserted on line 7 (resp.on line 16).This ends the correctness proof of Algorithm 3. The updates of nodes in cases 1 (see Fig. 6 on p.9) are illustrated in Fig. 10.Looking at the tree rooted in sl(s) and whose leaves are the Succ k (s), one can determine if one faces the case illustrated in Fig. 10a when changing k to k − 1.
Clearly, the two nested loops of Algorithm 3 scans over E + k .The instructions inside can all be performed in constant time.The complexity of Algorithm 3 is thus linear in the number of arcs of E + k .Moreover, since it outputs what it needs as input, one can iterate this algorithm over any interval of values of k.Finally, the construction algorithm that starts directly from the suffix tree is asymptotically optimal and takes the same time complexity as the dynamic update of the dBG order.

Conclusion and perspectives
De Bruijn Graphs (dBG) are intricate structures and intensively exploited for assembling large genomes from short sequences.Understanding their complexity can help improving their representations or traversal algorithms.We investigate algorithms to transform indexing data structures of the input words into a dBG of those words and propose linear time algorithms when starting from Suffix Trees and Suffix Arrays to build directly a contracted dBG.Although the algorithms need a slight adaptation, all results obtained are clearly valid for both definitions of the dBG: D BG + k and D BG − k .Moreover, we show that this approach provides a way to update dynamically the graph when one changes its order k.Algorithms enabling a dynamic update represent a theoretical challenge as well as an exciting avenue for improving genome assembly methods [24,8].Other topics for future research include transforming compressed indexes, such as a FM-index [23], into a dBG, implementing a practical contracted dBG representation for DNA taking into account k-mers and their reverse complements based on these algorithms.

Fig. 3 .
Fig. 3.With solid arcs only, the graphs correspond to D BG + 2 (a) and D BG + 3 (b) for our running example.With both solid and dotted arcs, they represent D BG − 2 (a) and D BG − 3 (b).

Fig. 4 .Definition 3 .
Fig. 4. The graphs correspond to CDBG + 2 (a) and CDBG + 3 (b) for our running example.Definition 3. A contracted de Bruijn graph of order k, denoted by CDBG + k= (V + k,c , E + k,c), is a directed graph where:

Proposition 4 .
There exists a bijection between Init k and the set of the k-mers of S. According to Definition 1 and Proposition 4, each vertex of D BG + k can be assimilated to a unique element of Init k .As the vertices of D BG − k are identical to those of D BG + k , there exists also a bijection between Init k and the set of vertices of D BG − k .

1 begin 2 u 3 /
:= v c ; mark u / search the node ending the chain that goes through v c 4 while u is right extensible and next(u) is left extensible do 5

8 u 9 /
:= next(u); mark u / now explore the path starting in the successor of u

Fig. 5 .Fig. 6 .Case 2 .
Fig.5.The generalised suffix tree for our running example and the constructed de Bruijn graph for k := 2. Square nodes represent words that occur as a suffix of some s i , circle nodes are the other nodes of T .Nodes in grey are those used to represent the nodes of the dBG.Each square node stores its positions of occurrences in S; for simplicity, we display the starting position as a number and the word of S in which it occurs as its colour, instead of showing the list of pairs (i, j).The solid curved arrows are the edges of the de Bruijn graph for k := 2; those coloured in red correspond to Case 1 and those in blue to Case 2. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Theorem 3 .
For a set of words S, building the de Bruijn Graph of order k, D BG + k takes linear time and space in |T |, i.e., in S .3.1.3.Construction of CDBG + k In Section 2.3, we have seen an algorithm that allows to compute directly CDBG + k provided that one can determine if a node v is right extensible and if next(v) is left extensible, where next(v) denotes the only successor of v. Let us see how to compute the extensibility in the case of a Suffix Tree.By applying Proposition 6 in the case of a tree, for an element v of Init k , f irst k (v) is right extensible if and only if |v| > k or (Children(v)) = 1.Thus checking the right extensibility of a node takes constant time.

Fig. 7 .
Fig. 7. (a) The generalised suffix tree for the set of words {bacbab, bbacbaa, bcaacb, cbaac, cbabcaa}.The part above the green line corresponds to the TST T (A 2 ), which is shown in (b).(b) The truncated suffix tree T (A 2 ) for the same set of words.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

otherwise 2 .
and A k is the set of these tuples:A k := n i=1 A k,i .Proposition 11.

Fig. 8 .
Fig. 8. Number of nodes of the GST vs the TST for k = 5, 10, 20, 40 and the percentage compared to the GST for Illumina reads of length 101.

Fig. 9 .
Fig. 9.The de Bruijn graph of order 2 built on T (A 2 ).The solid curved arrows are the edges corresponding to the first part of the definition of E + k , while those in blue correspond to the second part.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

4. 3 . 2 .Proposition 14 .Theorem 8 .Fig. 10 .
Fig. 10.Illustrating an update of k, the dBG order: (a) from k to k − 1 and (b) from k to k + 1.This figure shows only the cases of a type 1 node that is not right extensible.At order k, the square node in dotted line has two possible right extensions: two (red) arcs leaving it.In (a), at order k − 1 its parent belongs to Init k−1 and it becomes right extensible.One sees that the tree structure helps in determining this.In (b), the pendant situation occurs, where the children of the square node in dotted line belong to Init k+1 and they both become right extensible.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) An alphabet is a finite set of letters.A finite sequence of elements of is called a word or a string.The set of all words over is denoted by , and ε denotes the empty word.For a word x, |x| denotes the length of x.Given two words is the k-mer of x starting in position i, i.e., (x)k,i = x[i .. i + k − 1].Thus we have f irst k (x) = (x) k,1 and last k (x) = (x) k,|x|−k+1 .We denote by ( ) the cardinality of any finite set .Let S = {s 1 , . .., s n } be a finite set of words.It is our running instance for all the following.Let us denote the sum of the lengths of the input strings by ∈ , 1 ≤ i ≤ n, s i = uw v}.• F k (S) the set of factors of length k of S where k is a positive integer, i.e., F k (S) = F (S) ∩ k .
x and y, we denote by x • y or simply xy the concatenation of x and y.For every 1 ≤ i ≤ j ≤ |x|, x[i] denotes the i-th letter of x, and x[i .. j] denotes the substring or factor x[i]x[i + 1] . . .x[ j].Let k be a positive integer.If |x| ≥ k, f irst k (x) is the prefix of length k of x and last k (x) is the suffix of length k of x.Then a substring of length k of x is called a k-mer of x.For i such that 1 ≤ i ≤ |x| − k + 1, (x) k,i S := v

Definition 1 .
Let k be a positive integer.The de Bruijn graph of order k for S, denoted by D BG + k ), whose vertices are the k-mers of words of S and where an arc links u to v if and only if u and v are two successive k-mers of a word of S, i.e.: 2. Examples of arcs from D BG + k .(a) shows letters in the right context of ba, and (b) the successors of node ba in D BG + 2 ; one for each letter in RC(w) ∩ .(c) shows letters in the left context of ba, and (d) the predecessors of node ba in D BG + 2 ; one for each letter in LC(w) ∩ .
and there exists a unique outgoing arc of w: that from w to w[2 .. k] .Indeed, by definition of Init k , w[2 .. k] ∈ Init k , and thus Succ k (w) = { w[2 .. k] }.Now, we can build integrally D BG + k or more exactly an isomorphic graph of D BG + k .Theorem 1.With the sets Init k , Init Exact k and SubInit k , we can build an isomorphic graph of D BG + k in linear time in the size of these sets.
for any node v of D BG + k without predecessors 4 // and build CDBG + k from v 5 for v ∈ Init k do 6 if there exists no w such that v ∈ Succ k (w) then 7 (V , E) := (V , E) Build AuxC D BG(V ∪ {v}, E, v, v) 8 // explore D BG + k from any node not yet visited 9 for v c an unmarked node of Init k do