BIGMAC : breaking inaccurate genomes and merging assembled contigs for long read metagenomic assembly

Background The problem of de-novo assembly for metagenomes using only long reads is gaining attention. We study whether post-processing metagenomic assemblies with the original input long reads can result in quality improvement. Previous approaches have focused on pre-processing reads and optimizing assemblers. BIGMAC takes an alternative perspective to focus on the post-processing step. Results Using both the assembled contigs and original long reads as input, BIGMAC first breaks the contigs at potentially mis-assembled locations and subsequently scaffolds contigs. Our experiments on metagenomes assembled from long reads show that BIGMAC can improve assembly quality by reducing the number of mis-assemblies while maintaining or increasing N50 and N75. Moreover, BIGMAC shows the largest N75 to number of mis-assemblies ratio on all tested datasets when compared to other post-processing tools. Conclusions BIGMAC demonstrates the effectiveness of the post-processing approach in improving the quality of metagenomic assemblies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1288-y) contains supplementary material, which is available to authorized users.


Introduction
De-novo assembly is a fundamental yet difficult [1] computational problem in metagenomics. Indeed, there is currently an open challenge for metagenomic assembly using short reads, titled "Critical Assessment of Metagenomic Interpretation (CAMI [2]). " On the other hand, emerging sequencing technologies can provide extra information and make the computational problem more tractable. For example, long reads are increasingly being shown to be valuable in the de-novo assembly of single genomes [3]. Therefore, it is natural to study metagenomic assembly using long reads. Current assembly approaches for long reads focus on developing more optimized assemblers to leverage the power of the data. However, significant engineering effort is usually involved to build an end-to-end assembler. We take a different perspective, focusing the design effort on a post-processor that improves assembled contigs using the original long read data (Fig. 1). The main question is whether we can achieve quality improvement with this approach using typical long reads. This post-pocessing approach is attractive because it leverages existing software. Consequently, we can focus design effort and computational resources to specifically address thorny issues arising from the nature of new data, complex repeats, varying abundances and noise. Moreover, since the long reads are part of the input, the post-processor can bring back information missed upstream. In single genome assembly, FinisherSC [4] has demonstrated the effectiveness of this approach. In this paper, we investigate the effectiveness of this post-processing approach for metagenomic assembly.
BIGMAC is a post-processor to improve metagenomic assemblies with long read only data. It first breaks contigs at potentially mis-assembled locations and subsequently scaffolds contigs. In our experiments, BIGMAC  (left). Improvement in assembly quality using post-processor BIGMAC on three different datasets (right). BIGMAC shows the effectiveness of the post-processing approach for long read metagenomic assembly demonstrates promising results on several mock communities using data from the Pacific Biosciences long read sequencer. Inputs to BIGMAC include assembled contigs from HGAP [5] and the original raw reads with adapters removed. After assembly and post-processing, we use QUAST [6] to evaluate and compare the assembly quality of contigs before and after using BIGMAC. As shown in Fig. 1, BIGMAC improves the quality of the contigs by reducing the number of mis-assemblies while maintaining/increasing N50 and N75. This demonstrates the effectiveness of the post-processing approach for metagenomic assembly with long reads.

A top-down design of BIGMAC
We use a hypothetical yet representative set of input data to illustrate the design of BIGMAC in a top-down manner. Let g 1 , g 2 be two genomes of abundances λ 1 , λ 2 respectively. Assume that they share a long repeat in the middle, that is, g 1 = x 1 ry 1 , g 2 = x 2 ry 2 . Unfortunately, an upstream assembler mis-assembles the reads and produces two contigs c 1 , c 2 with incorrect joins at the repeat. That is, c 1 = x 1 ry 2 , c 2 = x 2 ry 1 . As an assembly post-processor, BIG-MAC takes the mis-assembled contigs c 1 , c 2 and original reads as input. Its goal is to reproduce g 1 , g 2 . To achieve this, we immediately recognize that there should be components for fixing mis-assemblies and scaffolding contigs. In BIGMAC, they are respectively Breaker and Merger. An illustration is given in Fig. 2.
Breaker is further divided into two subcomponents: Signal Detector and Signal Aggregator. Signal Detector collects signals that indicates mis-assemblies and Signal Aggregator subsequently makes decisions on breaking contigs based on the collected signals. In our example, the coverage fluctuation between λ 1 , λ 2 along the contigs c 1 , c 2 and the presence of long repeat r are useful signals that Signal Detector collects. After aggregating these signals, Signal Aggregator decides on breaking both the contigs c 1 and c 2 at the starting points of the repeat r. Therefore, the output contigs of Breaker are x 1 , x 2 , ry 1 , ry 2 .
Merger is also divided into two subcomponents: Graph Operator and Contig Extender. With information from the original reads, Graph Operator establishes connectivity of the input contigs using string graphs. Then, based on the evidence from spanning reads and contig coverage, Contig Extender extends input contigs into longer contigs. In our example, the input contigs to Merger are x 1 , x 2 , ry 1 , ry 2 . Graph Operator forms a string graph with edges x 1 → ry 1 , x 1 → ry 2 , x 2 → ry 1 and x 2 → ry 2 . Contig Extender analyzes the coverage depth of the related contigs and decides to merge contigs to form x 1 ry 1 and x 2 ry 2 , thus reproducing the correct genomes.
Finally, we remark that BIGMAC uses overlap information between reads and contigs or among contigs. In our implementation, the overlap information is provided by running MUMmer [7] on appropriate strings. However, we note that one can implement the pipeline by replacing MUMmer with other aligners.

Breaker: breaking inaccurate genome
After the functional decomposition of BIGMAC in the previous section, we are ready to investigate its first component: Breaker. We note that the goal of Breaker is to fix mis-assemblies. In order to achieve that, we need to collect sensible signals related to mis-assemblies and subsequently aggregate the signals to make the contig breaking decisions.

Signal Detector
Signal Detector collects three important signals related to mis-assemblies.
Palindrome We are interested in palindromes because of their relationship to a form of chimeric reads, the adaptor-skipped reads, which are common in today's long read technology [8]. Since assemblers get stuck at these chimeric reads, the palindrome pattern in reads propagates to the corresponding contigs. Thus, the pattern of palindrome is a strong signal indicating mis-assemblies, particularly when the palindrome is long. A string tuple (a, b) is called a wrapping pair if the reverse complement of a is a prefix of b or the reverse complement of b is a suffix of a. x is called a palindrome if it is the concatenation of a wrapping pair (a, b), that is x = ab. The wrapping length of x is max x=ab,(a,b)is a wrapping pair min(a.length, b.length). For example, x = ACGGCCG is a palindrome of wrapping length 3; (a, b) = (ACGG, CCG) is a wrapping pair because the reverse complement of b is CGG, which is a suffix of a. Since the minimum length of a and b is min(4, 3) = 3 and the wrapping length of x cannot exceed 3, the wrapping length for x is 3.
Signal Detector collects information about palindromes by aligning each contig against itself. It then identifies palindromes with long wrapping length because that indicates mis-assemblies. The corresponding palindromes' information is then put into S palindrome . To improve sensitivity, Signal Detector allows approximate match when searching for palindromes. We note that approximate matches are determined by computing the edit distance of the corresponding strings. To determine approximate matches in BIGMAC, we use nucmer in MUMmer [7] with default parameters and with option -maxmatch. Two strings are considered as approximately matched when nucmer reports so.
Repeat Since long repeats confuse assemblers, their endpoints are possible candidates for mis-assemblies. Let s 1 = axb, s 2 = cxd, a repeat between s 1 , s 2 is specified by the endpoints of x in s 1 , s 2 . For example, s 1 = CAAAAT, s 2 = GAAAAG, the endpoints of the repeat AAAA are the position specified by ! in C! AAAA! T, G! AAAA! G. Signal Detector aligns contigs against themselves to find the repeats. It then marks down the positions of the endpoints and puts them in a set called S repeat . To improve sensitivity, Signal Detector allows approximate matches when searching for repeats. We note that approximate matches are determined by computing the edit distance of the corresponding strings. Moreover, it only considers repeats that are maximal and are of significant length. Coverage Significant coverage variation along contigs can correspond to false joins of sequences from different genomes with different abundances. Coverage profile is the coverage depth along the contigs. For example, the coverage profile along a string s = ACGT is (1, 2, 2, 1) if the reads are AC, CG, GT. Signal detector aligns original reads to the contigs to find the coverage profile, which is called S coverage .

Signal Aggregator
After Signal Detector collects signals regarding palindromes S palindrome , repeats S repeat and coverage profile S coverage , Signal Aggregator uses them to determine the breakpoints on the input contigs C. The algorithm is summarized in Algorithm 1.
ChimericContigFixing The goal of ChimericContigFixing is to fix the contigs formed from chimeric reads. We simply break the palindromes in S palindrome at locations corresponding to their wrapping lengths. After removing redundant contigs, ChimericContigFixing returns the broken palindromes with the unchanged contigs. S filter = LocatePotentialMisassemblies(S repeat , C ) Locate mis-assemblies caused by repeats 6: C = ConfirmBreakPoints(S filter , S coverage , C ) Confirm mis-assemblies using coverage 7: return C LocatePotentialMisassemblies The goal of the subroutine LocatePotentialMisassemblies is to locate potential mis-assemblies caused by repeats. We study the design of this subroutine in this section.

Motivating question and example
We can declare all the endpoints of approximate repeats in S repeat to be potential mis-assemblies. While this is a sensible baseline algorithm, it is not immediately clear whether it is sufficient or necessary. It is thus natural to consider the following question.
Given a set of contigs, how can we find the smallest set of locations on contigs to break so that the broken contigs are consistent with any reasonable ground truth? To illustrate our ideas, we consider an example with contigs The baseline algorithm of breaking contigs at the starting points of all the long(≥ 2L) repeats breaks the contigs 4 times (i.e. a|b|cde, f |bcg, h|cdi). However, interestingly, we will show that one only need to break the contigs 3 times to preserve consistency (i.e. x 1 = ab|cde, x 2 = fb|cg, x 3 = h|cdi) and that is optimal.
Modelling and problem formulation We will formalize the notions of feasible break points, feasible ground truth, consistency between sets of contigs, sufficiency of break points to achieve consistency and the optimiality criterion.
We use a graph theoretic framework. Specifically, we study a directed graph G = (V , E) with m sources S and m sinks T where ∀v ∈ S ∪T, indeg(v) = outdeg(v) and parallel edges between two vertices are allowed. This is used to model a fully contracted De Bruijn graph formed by successive K-mers of the contigs. Vertices V are substrings of the contigs and edges E correspond to potentially misassembled locations on contigs. In our example, the set of vertices is V = {a, b, c, d, e, f , g, h, i} and the set of edges is E = {ab, fb, bc 1 , bc 2 , hc, cd 1 , cd 2 , cg, de, di}. We use subscripts to differentiate parallel edges joining the same vertices. The graph corresponding to our running example is shown in Fig. 3.
Given such a graph G, we note that E is the set of all feasible break points because each edge in the graph corresponds to a potentially mis-assembled location on contigs. A feasible ground truth corresponds to a set of m edge-disjoint source-to-sink trails that partitions the edge set E. For simplicity, we represent a trail as a sequence of the vertices in G, where the edges linking the vertices are ignored. For example, {abcde, fbcdi, hcg} is a feasible ground truth while {abcg, fgde, hcdi} is another feasible ground truth. The set of all feasible ground truths is GT.
We recall that our high level goal is to choose a set of feasible break points R ⊆ E so that the broken contigs are consistent with any feasible ground truth. So, we need to define the notion of broken contigs and consistency between two sets of contigs under the current framework. Let gt ∈ GT, we denote gt\R be a set of trails after the removal of the edge set R. In particular, for the original contig set C ∈ GT, C\R is the set of broken contigs for the set of feasible break points R. For example, if R = {bc 1 , bc 2 , hc} and C = {abcde, fbcdi, hcg}, C\R = {ab, cde, fb, cdi, h, cg}. To capture consistency between two sets of contigs, we use the following definition. Given two sets of trails T 1 , T 2 , we say that T 1 is consistent with T 2 if ∀x ∈ T 1 , ∃y ∈ T 2 s.t. x ⊆ y and ∀x ∈ T 2 , ∃y ∈ T 1 s.t. x ⊆ y . We call R a sufficient breaking set with respect to (C, GT) if ∀gt ∈ GT, C\R is consistent with gt\R. In other words, R is a set of feasible break points that allows the broken contigs to be consistent with any feasible ground truth. Although this notion of sufficient breaking set is a natural model of the problem, it depends on the original contig set C, which is indeed not necessary. Specifically, we show that we have an equivalent definition of sufficient breaking set without referring back to the original contig set. Let us call R a sufficient breaking set with respect to G, Fig. 3 The graph corresponding to our example contig set x 1 = abcde, x 2 = fbcg, x 3 = hcdi is shown. We note the optimal set of break points by the red dotted line or simply a sufficient breaking set, if ∀gt 1 , gt 2 ∈ GT, gt 1 \R is consistent with gt 2 \R.

Proposition 1 R is a sufficient breaking set with respect to (C, GT) if and only if R a sufficient breaking set with respect to G.
Proof The backward direction is immediate because C ∈ GT. We will show the forward direction as follows. Let g 1 , g 2 ∈ GT and we want to show that g 1 \R is consistent with g 2 \R. Since R is a sufficient breaking set with respect to (C, GT), g 1 \R is consistent with C\R. Therefore, Now, we state our optimization criterion. The goal here is to minimize the cardinality of the sufficient breaking set, formally as Eq. 1.
We will show that if we solve Eq. 1 for our running example, the answer is 3. This thus shows that the baseline algorithm of breaking contigs at all starting points (in our example, there are 4 of them) of all long repeats is not optimal.
Proof First, by inspecting the 6 feasible ground truths in GT, we note R = {bc 1 , bc 2 , hc} is a sufficient breaking set with respect to G. Second, we note that the edge set need to disconnect sources and sinks, otherwise, we can find g 1 , g 2 ∈ GT such that g 1 \R, g 2 \R are inconsistent. This means |R| need to be ≥ mincut of the graph, which is 3.

Algorithm development and performance guarantee
Next we describe an algorithm that finds a sufficient breaking set with respect to G. Let us denote a boolean variable b e on each edge e ∈ E, OutEdges(v) are the sets of incoming edges and outgoing edges at v respectively. Prev(v), Succ(v) are the sets of predecessor vertices and successor vertices of v respectively. Our algorithm solves the following minimization problem (Eq. 2) as a proxy to (Eq. 1). where, In other words, it includes either all the incoming edges or all the outgoing edges for every vertices with at least 2 successors and at least 2 predecessors to R. We then seek R with minimum cardinality among the choices available. If G can be decomposed into connected components, we can optimize ( b) independently on each connected component. In our implementation, if the size of the connected component is not too large, we optimize the objective function by exhaustion. Beyond a certain threshold, we simply output a feasible solution without optimizing. Indeed, in our experiments on real data, most of the connected components are not that large and this step typically takes a few minutes. But we remark that for more complex applications, one can further optimize the algorithm. For example, one can first topologically sort the vertices and then use dynamic programming to solve Eq. 2 along the sorted vertices.
We note that the algorithm described gives an optimal solution for our running example. Moreover, we show performance guarantee of the algorithm as follows.

Proposition 3 Solving Eq. 2 gives a sufficient breaking set R if the graph G is fully contracted.
Proof Let R be the set of edges selected by the algorithm. For any two gt 1 , gt 2 ∈ GT, we want to show that gt 1 \R and gt 2 \R are consistent. By symmetry, it suffices to prove that if x ∈ gt 1 \R, then ∃y ∈ gt 2 \R s.t. x ⊆ y. If |x| = 2, it is immediate because every edge other than those in R are covered. If |x| ≥ 3, we will show that it is also true using proof by contradiction. If ∀y ∈ gt 2 \R, x ⊆ y, we can find a sub-trail x = (a 1 , a 2 , . . . , a k , a k+1 ) of x such that ∃y ∈ gt 2 \R s.t. x = (a 1 , . . . , a k ) ⊆ y but ∀y ∈ gt 2 \R, x ⊆ y. This implies ∃a * = a k+1 s.t. (x , a * ) ⊆ z for some z ∈ gt 2 \R. Since the edge (a k , a k+1 ) ⊆ x ∈ gt 1 \R, we know that (a k , a k+1 ) is not in R. Similarly, (a k , a * ) ∈ R because (a k , a * ) ⊆ y ∈ gt 2 \R. But since |Succ(a k )| ≥ 2, the fact that our algorithm does not include (a k , a * ), (a k , a k+1 ) in R implies that |Pred(a k )| = 1, namely Pred(a k ) = {a k−1 }. Note that we are considering a fully contracted graph. So, the fact that a k−1 exists implies that |Succ(a k−1 )| ≥ 2. But our algorithm has not removed edge (a k−1 , a k ). This gives |Pred(a k−1 )| = 1. Inductively, we get |Pred(a i )| = 1∀2 ≤ i ≤ k. But we know that (a k , a k+1 ) ⊆ w for some w ∈ gt 2 \R. Since the edges along (a 1 , . . . , a k+1 ) are not in R, this gives, x = (a 1 , . . . , a k+1 ) ⊆ w ∈ gt 2 \R. This contradicts the assumption about x .

Proposition 4
If the graph G is fully contracted DAG without parallel edges, then solving Eq. 2 returns a sufficient breaking set that is of optimal cardinality, OPT.
Proof It suffices to show that for any sufficient break- Because it is a DAG, it means we can find gt 1 ∈ GT such that ∃x, y, x , y such that (x, v, y) ∈ gt 1 and (x , v, y ) ∈ gt 1 . Now consider gt 2 to be a clone of gt 1 except that it has (x, v, y ), (x , v, y) instead of (x, v, y ), (x , v, y). We note that gt 2 ∈ GT. And because there are no parallel edges, (x, v, y) is not a subset of any of the trails in gt 2 . So, we find two distinct gt 1 , gt 2 ∈ GT such that gt 1 , gt 2 are not consistent. This contradicts the fact that R is a sufficient breaking set.
It is noteworthy that if we are given any set of contigs from any gt ∈ GT, we still obtain the same set of broken contigs after the operation of removal of redundant trails, RemoveRedundant (i.e. we eliminate the contigs in a set A to form a minimal subset B ⊆ A in which ∀x = y ∈ B, x ⊆ y). This can be formalized as follows.
Proof Let s i = RemoveRedundant(gt i \R) for i ∈ {1, 2}. By symmetry, it suffices to prove that s 1 ⊆ s 2 ∀x ∈ s 1 ⊆ gt 1 \R, ∃y ∈ gt 2 \R, such that x ⊆ y. Note that we can also find some x * ∈ s 2 such that y ⊆ x * . This gives x ⊆ y ⊆ x * . However, since we have no redundant trails in s 1 , we get x = x * . Thus x = x * ∈ s 2 .
To apply BIGMAC to real data, we have to implement the described algorithm with some further engineering. This includes methods to tolerate noise, to handle double stranded nature of DNA, and to construct De Bruijn graph from the repeats. Interestd readers can refer to the Additional file 1 : Appendix for these implementation details.

ConfirmBreakPoints
The goal of ConfirmBreakPoints is to utilize the coverage profile S coverage to confirm breaking decisions at potentially mis-assembled locations specified in S filter . For the sake of simplicity, we now consider a string s of length L, and a set of positions Y = {y i } 1≤i≤k of s which are potential mis-assemblies. Furthermore, let us assume that all mis-assemblies are caused by mixing genomes of different abundances. Using Y , we can partition s into segments {s i } 0≤i≤k of lengths respectively as { i } 0≤i≤k . We let x i be the number of reads that start in segment s i , which can be estimated from S coverage . The question is how to classify the points in Y as true misassemblies or as fake mis-assemblies.
One can use heuristics, like comparing coverage depth difference before and after each y i . However, this is not ideal. For example, if we have coverage depth before and after y 1 differing by 1X, we would expect it to be a much stronger signal for true mis-assembly if the lengths 0 , 1 are of order of 100 K rather than of 100 and this cannot be seen by considering coverage depth difference alone. For the case of just two segments of equal length and if we assume the x i 's are independent Poisson random variables, there is a popular statistical test, C-Test [9], that can make the decision. Formally, if x 1 ∼ Poisson(m 1 ) and x 2 ∼ Poisson(m 2 ), then C-Test is a test to decide between the hypothesis of H 0 : m 1 = m 2 against H 1 : m 1 = m 2 . C-Test first considers x 1 + x 2 to compute the decision boundary and it then decides whether to reject H 0 based on x 1 /x 2 and the previously derived decision boundary. The intuition is that x 1 + x 2 corresponds to the amount of data, which determines the confidence of a statistical test. Thus, if x 1 + x 2 is large, a small perturbation of x 1 /x 2 from 1 can still be very significant and can correspond to a confident rejection of H 0 .
However, we still need to think carefully about how to apply C-Test to our problem. One method is to directly apply k independent C-Test on each of the junctions y i . However, that method cannot take advantage of the information from neighboring segments to boost the statistical power at a junction. Therefore, we develop the algorithm IterativeCTest. IterativeCTest performs traditional C-Test but in multiple iterations. Initially, it directly applies k independent C-Test on each of the junctions y i . Some of the segments are merged and we aggregate the counts from the merged segments to continue to the next C-Test at the remaining unmerged junctions in Y . This process is repeated until the algorithm converges. Finally, we use the breaking decisions from IterativeCTest to break the incoming contigs and return the fixed contigs.

Merger: merging assembled contigs
After fixing mis-assemblies, we are ready to study another pillar of BIGMAC: Merger. Merger establishes connectivity among contigs and subsequently makes decisions to extend contigs.

Graph operator
The goal of Graph Operator is to identify candidates for contig extension. It forms and transforms a string graph using the overlap information among original reads and contigs. It subsequently analyzes the graph to find candidates for contig extension. The overall algorithm is summarized in Algorithm 2. G.FindExtensionCandidates() Identify candidates for contig extension 8: return G Mapping We obtain mapping among contigs and reads. This provides the fundamental building block to construct the connectivity relationship among contigs and reads.

FormGraph
The goal of FormGraph is to establish connectivity among contigs. We use contig-read string graph as our primary data structure. Contig-read string graph is a string graph [10] with both the contigs and the reads associated with their endpoints as nodes. The size of the graph is thus manageable because most reads are not included in the graph. Then, we construct an overlay graph (which we call the contig graph) such that the nodes are the contigs with weights on nodes being the coverage depth of contigs. In the contig graph, there is an edge between two nodes if there is a sequence of reads between the associated contigs. With these data structures, we can switch between macroscopic and microscopic representation of the contig connectivity easily.
GraphSurgery GraphSurgery simplifies the contig graph. This includes removal of transitive edge and edge contraction. For nodes u, v, w, if we have edges u → v, u → w and w → v, then we call u → v a transitive edge. We perform transitive reduction [10] on the contig graph to remove transitive edges. Removing these transitive edges prevents us from finding false repeats in the next stage. To improve robustness, there is a pre-processing step before transitive reduction. If the contig w is too short and there are no reads that form head-to-tail overlap between w, u or w, v, then we can have a missing edge for transitive reduction to operate properly. Thus, we add the missing edge (either from u to w or w to v) back when we find contigs and related reads that suggest the missing edge might be there.
An edge u → v is contractable when the outgoing degree of u and the incoming degree of v are both 1. We contract edges to fill gaps. Our experience with Finish-erSC is that data are dropped in the assembler and so reconsidering them as a post-processing step can potentially fill some gaps. However, there is a caveat. In establishing connectivity in contig-read string graph, we only use reads close to each contig's endpoints (as a way to save computation resources), we may miss some legitimate connections. Therefore, very long repeats prevent the detection of linkage of contigs, thereby allow contigs to be erroneously merged. To address that, if there exists contig w that is connected (by some reads) to a region close to the right end of u/left end of v, then we avoid contraction of u → v.
FindExtensionCandidates FindExtensionCandidates identifies candidates for contig extension by identifying local patterns in the contig graph. One form of extension candidates is a pair of contigs that are connected without competing partners. This corresponds to the contigs joined by a contractable edge. Another form of extension is a set of contigs that are connected with competing partners. This corresponds to the contigs linked together in the presence of repeats. In the contig graph, the repeat interior can either be represented as a separate node or not. If the repeat interior is represented as a separate node, the local subgraph is a star graph with the repeat node at the center. Otherwise, the local subgraph is a bipartite graph consisting of competing contigs. After identifying the contigs associated with a specific repeat, we can then merge contigs in the next stage.

Contig extender
After the operations by Graph Operator, we have extracted the potential contig extension candidates from the contig graph. It remains to decide whether and how to merge the corresponding contigs. In a high level, Contig Extender aims at solving the Contig Merging Decision Problem.

Contig merging decision problem Given a set of incoming contigs I and a set of outgoing contigs O whose coverage depth and read connectivity information is known. Decide how to form an appropriate bipartite matching between I and O.
While we do not intend to rigorously define the statement of Contig Merging Decision Problem now, it is clear that appropriate matching corresponds to one that achieves high accuracy. Contig Extender carefully analyzes the read connectivity and contig coverage to make decisions on merging contigs. In the coming discussion, we first study an effective heuristic that captures the essence of the problem. After that, we will study how to rigorously define the Contig Merging Decision Problem in a mathematical form and suggest an EM-algorithm in solving that.
Heuristic in solving the contig merging decision problem When the cardinality of the set of incoming contigs I and the set of outgoing contigs O are both 1, a natural decision is to merge them. Thus, the focus here is to study the case when |I| > 1 or |O| > 1. We select the reads that uniquely span one contig a in the incoming set and one contig b in the outgoing set. If there are multiple such reads, then we decide that a, b should be joined together provided that there does not exist any paths of reads that lead a to other contigs in the outgoing set and similarly for b. Moreover, we perform similar tests using contig coverage. If the coverage depth between two contigs is very close, they will be declared to be a potential merge pair. Then, we test whether there are any close runner-ups. If not, they should be merged. To combine the decisions made using spanning reads and coverage depth, we have a subroutine that discards all conflicting merges. We find that this heuristic for decision making works quite well in our datasets. However, in the coming discussion, we will study how to make the extension decisions in a more principled and unified manner.
General framework in solving the contig merging decision problem The challenge for the Contig Merging Decision Problem is the tradeoff for many physical quantities (e.g. abundance, edit distance of reads, noise level, number of spanning reads, etc). We address this by defining a confidence score based on parameter estimation. For simplicity of discussion, we assume that k is the cardinality of both the set of incoming contigs and the set of outgoing contigs. The goal is to find the best perfect matching with respect to a confidence score.
Let M be a perfect matching of contigs among incoming and outgoing groups I and O. Under M, there are k merged contigs. Let the observation be the set of related reads X = {R i | 1 ≤ i ≤ n}. We define the parameters θ = {(λ j , I j ) 1≤j≤k }, where λ j is the normalized abundance of contig j and I j is genomic content of the contig j. Note that 1≤j≤k λ j = 1. So, the parameter estimation problem can be cast as s M = max θ P θ (X | M), where s M is the confidence score of the matching M. Finally, the best perfect matching can be found by comparing the corresponding confidence score.
Note that we have hidden variables Z = (Z i ) 1≤i≤n which indicate the contigs that reads X are extracted from (i.e. Z i ∈ {1, 2, . . . , k}). If we assume the length of the contig j to be j and q to be the indel noise level (i.e. probability of 1 − 2q to be remained unaltered at each location), then we can use an EM-algorithm to obtain an estimate of θ. Specifically, the expected value of the log likelihood where M j (R, I (t) ) = δ j=argmin j d (R,I (t) j ) is the assignment of reads to the most similar contig (with tie breaking using λ (t) ), d (A, B) is the best local alignment score, j ) 1≤j≤k are the genomic contents of the contigs at iteration t and λ (t) = (λ j ) 1≤j≤k are the estimated abundances at iteration t. By maximizing the log likelihood function with respect to θ (t+1) , we have the update formulas as Note that the update of I j * requires multiple sequence alignment. In general, the problem is NP-hard [11]. However, for noisy reads extracted from several contigs, the problem is not as difficult. For example, in the case of pure substitution noise, an efficient optimal solution is a column-wise majority vote. Despite the elegance and feasibility of this approach, it is still computationally more intense than the heuristic. Therefore, in our implementation of BIGMAC, the heuristic is the default method used in Contig Extender. But we also provide an experimental version of the EM-algorithm which can be used when users specify -option emalgo=True in their commands.

End-to-end experiments Synthetic data
We verify that BIGMAC correctly improves the incoming contigs using the following synthetic data. We generate two synthetic species of different abundances ( 2 7 , 5 7 ). They are composed of random nucleotide sequences of length 5 M each and share a common segment of length 12 K. We uniformly sample indel noise corrupted reads of length 6 K from the genomes, with coverage depth of 20X and 50X respectively. We also artificially construct misassembled contigs by switching the genome segments at the 12 K repeat.
The contigs and the reads are passed through BIGMAC. BIGMAC breaks the contigs at the mis-assembled point and joins them back correctly. One can see the sample run on [12]. This is also the example that we discuss in the top-down design of BIGMAC.

Real data
We test the performance of BIGMAC in improving metagenomic assembly on PacBio only data. We use different datasets of different characteristics. Dataset 1 consists of a mock community of 5 species [13], with genomes of high similarity. Dataset 2 consists of a mock community of 23 species [14], with genomes of diverse abundances. We also remark that we have tested BIGMAC on a third PacBio only dataset (Dataset 3): a real experiment involving Desulfuromonas biwabikus, D. soudanensis and some other unknown genomes. We note that we know the complete ground truth for the metagenomes in both Dataset 1 and 2 but only know part of the ground truth for Dataset 3. We take the output of HGAP and use the raw reads to improve them using BIGMAC. We observe that in these datasets, BIGMAC reduces the number of mis-assemblies while maintaining/increasing N50 and N75. The results of BIGMAC is summarized in Table 1, where the quantity of mis-assemblies is obtained from the QUAST reports. By default, QUAST's statistics are based on contigs of size >= 500 bp. Interested readers can refer to the Additional file 1: Appendix for more details of the assessment. The data is provided in our online distribution and users can reproduce the results by issuing a single command python reproduce.py.

Synthetic data
We use the synthetic data in "Synthetic data" section to evaluate and benchmark BIGMAC, FinisherSC [4], SSPACE_LongRead [15] and PBJelly [16]. BIGMAC is the only tool among the tested tools that fix the misassembled contigs and merge them back correctly. Other tested tools output the same mis-assembled contigs.

Real data
We perform end-to-end testing to compare performance of different tools. The comparison is shown in Table 2. BIGMAC shows the largest N75/# Mis-assemblies on all three datasets and largest N50/# Mis-assemblies on two Table 1 Performance evaluation of BIGMAC on several mock communities out of three datasets. Indeed, in the only dataset that BIG-MAC does not have the largest N50/#Mis-assemblies, the number of contigs(i.e. L50) beyond N50 is 7. Due to the small number of contigs, this particular measurement on that dataset may not be significant. We also remark that BIGMAC is the only tool that improves contiguity (N50 and N75) and the number of mis-assemblies.

Simulation and testing on independent components
We perform micro-benchmarking on individual components of BIGMAC. The settings and results are summarized as follows.

Analysis of LocatePotentialMisassemblies
We test our break point finding algorithm used in LocatePotentialMisassemblies of Breaker on the synthetic data of x 1 = abcde, x 2 = fbcg, x 3 = hcdi discussed in the previous section. The algorithm succeeds in identifying the minimum number of break points as 3. Also, it succeeds in identifying the minimum number of break points as 2 in the presence of double stranded DNA, in the test case of x 1 = abcd, x 2 = ec b f , where b , c are the reverse complement of b, c respectively.

Analysis of ConfirmBreakPoints
We test IterativeCTest used in ConfirmBreakPoints of Breaker on synthetic data. We simulate mis-assemblies and variation on abundances by generating a sequence of Poisson random variables and compare the performance of the algorithms on the simulated data as follows. We generate a sequence of 100 numbers by 100 independent Poisson random variables. The Poisson random variables have parameters of either 20 or 50. To select the parameters, we simulate 100 steps of a two-states Markov chain with transition probability of 0.1. We then evaluate the performance of C-Test and IterativeCTest on finding the true transition points, which correspond to the junctions of mis-assemblies. Taking average from 100 rounds of simulation, the recall of both C-Test and IterativeCTest are 0.93, meaning that they both can identify almost all transition points. C-Test has precision of 0.75 while the precision of IterativeCTest is of 0.87, meaning that Itera-tiveCTest can avoid more fake mis-assemblies.

Analysis of merger
We compare Merger with other tools on synthetic data as follows. We use a synthetic contig set {x 1 , x 2 , r, y 1 , y 2 } where the ground truth genomes are (x 1 , r, y 1 ), (x 2 , r, y 2 ). The coverage depth of (x 1 , y 1 ) and (x 2 , y 2 ) are 20X and 50X respectively. We pass the reads together with the contig set to FinisherSC, PBJelly, SSPACE_LongRead to perform scaffolding. We note that BIGMAC is the only tool the can scaffold the contigs correctly into 2 contigs by using the abundance information among the tested tools. Other tools either do not change the input or simply merge r with some of {x 1 , x 2 , y 1 , y 2 }, resulting in 4 contigs.

Runtime of BIGMAC
The runtime of BIGMAC is summarized in Table 3. We use 20 threads to run the tool on a server computer. The server computer is equipped with 64 AMD Opteron(tm) Processor 6380(8 cores) with frequency 2500 MHz and 362 GB RAM. We note that the majority of the time is spent on alignment of contigs and reads by MUMmer.

Discussion
There are two main implications from the experiments performed. First, we show that post-processing assemblies is a feasible alternative in improving assembly quality to building another assembler from scratch. This is demonstrated by BIGMAC showing simultanous improvement in terms of number of mis-assembly and contiguity. We note that this is a non-trivial feature because all other tested tools fail to achieve it. Second, BIGMAC is competitive when compared to the existing post-processing tools. In particular, it shows better N75/# Mis-assemblies than all other tested tools in all tested datasets. Moreover, BIGMAC is also the only tool that can handle abundance information, which makes it an attractive candidate for improving metagenomic assembly.

Future work
We remark that the creation of BIGMAC sheds light on many interesting future direction. For example, it would be interesting to apply similar ideas to hybrid data, which has a lot of potential in the context of metagenomics. Moreover, it would also be useful to try BIGMAC and other post-processing methodology on more real data to better characterize the approach.

Conclusion
We study an agile approach in developing de novo metagenomic assemblers: post-processing metagenomic assemblies with original input data. BIGMAC demonstrates the effectiveness of the post-processing approach in improving the quality of metagenomic assemblies using long reads. BIGMAC thus serves as an example that developing post-processors is a good alternative to building end-toend assemblers, which typically involves more engineering efforts.

Additional file
Additional file 1: Supplementary materials include implementation details, data analysis, theoretical analysis and commands used to run different tools. (PDF 492 kb)