ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

Abstract Summary The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. Availability and implementation ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. Supplementary information Supplementary data are available at Bioinformatics online.

decreasing. This setting is more computationally intensive, especially for short contigs, and we observe that the "majority increasing/decreasing" approach works similarly well. Contigs with an associated contig minimizer run of length one or those that do not satisfy the specified orientation threshold are not assigned an orientation, and will not be incorporated into a path.
We also use the minimizer positions of the reference assembly in the minimizer graph to determine gap sizes (number of 'N's) between scaffolded target contigs. The contig minimizer runs are joined together by edges in the graph created based on the reference assembly. For a given pair of contigs merged by ntJoin (cA, cB), there is an associated edge between terminal minimizers of these contigs in the minimizer graph (mA, mB) (Supplementary Figure S3). Since this edge is connecting two different contigs in the target assembly, these minimizers must be adjacent in a sequence in the reference. Therefore, the estimated gap size is calculated as: Calculating a, b in the formula above compensates for the distance from the terminal minimizer in the target contig to the end of the contig, as the terminal minimizers are not always at the very end of the contigs. If multiple reference assemblies are being used, the gap distance is the mean of the estimations from each supporting reference assembly.

Misassembly detection and correction
As shown in Supplementary Figure S1, if there is a misassembly in the input target contigs (shown in red), the minimizers of a contig may not fully map to the same region of the reference, causing a branching point in the minimizer graph (node with degree > 2). In that case, as long as the reference assembly has a higher user-specified weight than the target assembly, the graph filtering step will remove the edges supported by only the target assembly. This causes different regions of the misassembled contigs to be incorporated into different paths. Thus, ntJoin will cut the input contig at that putative misassembly point, and the regions of the broken contig will then be incorporated into the correct position, as suggested by the minimizer graph. See Supplementary Figure S2 for more details.
We subsequently used ntJoin to scaffold a short read H. sapiens assembly using a draft long read H. sapiens assembly as the reference. For this test, we used the H. sapiens (NA12878) ABySS assembly scaffolded with MPET data as the target assembly and the Shasta assembly with and without polishing as the reference assembly (v1.0.2; t=4, target_weight=1, reference_weights='2', n=1) (Table S1).
To compare the results of ntJoin with existing reference-guided scaffolding tools, we ran each ntJoin scaffolding experiment described above using Ragout (Kolmogorov, et al., 2018) and Ragoo (Alonge, et al., 2019). Ragout was run using default parameters, and the input 'hal' alignment generated with Progressive Cactus (Armstrong, et al., 2019). To generate the alignments in a reasonable time frame, 64 threads were used for the human Progressive Cactus alignments, while four threads were used for the C. elegans alignments. Ragoo uses minimap2 (v2.17-r941) (Li, 2018) for the alignment stage, and was run using the chimeric contig detection option (v1.1; -t 4 -b -C). We found that when using a draft assembly as the reference, Ragoo omitted some sequences in the output files. To account for this, we tallied the sequences missing from the output scaffolds file, and rescued those sequences prior to subsequent analysis.
To calculate the true positive rate (TPR), positive predictive value (PPV) and F-score metrics as described in Hunt, et al. (2014) and Di Genova, et al. (2018), for each experiment assessed, we created a fasta file where each entry was a join made by the scaffolder under scrutiny. When making these fasta files, we respected any cuts made by the scaffolder as well as the chosen gap lengths. Then, we assessed the correctness of the joins using QUAST (v5.0.2, -t 24 --fast --scaffold-gap-max-size 100000), using default minimum contig length and alignment length settings for the C. elegans runs and "--min-alignment 115" and "--min-contig 450" for the human runs. Then, the detailed misassembly report was parsed to identify misassemblies that were due to the joins made by the scaffolder, as opposed to misassemblies that were present prior to scaffolding. If we define P as the number of potential joins, TP as the number of correct joins and FP as the number of incorrect joins: Here, the number of placed contig regions refers to the number of contig regions scaffolded accounting for any cuts that the scaffolder makes during its execution.

Supplementary Figures
Supplementary Figure S1. The ntJoin algorithm. In this example, the target assembly indicated in blue is being scaffolded using the reference assembly, indicated in purple. (a) First, ordered minimizer sketches along with their base pair coordinates (indicated as numbers below vertices) are generated for each input assembly. (b) Next, minimizers that are repetitive (seen multiple times within the assembly) or unique to an assembly (not found in all input assemblies) are filtered from the sketches. For illustrative purposes, the grey vertical dotted lines indicate identical minimizers between the input target and reference assemblies. (c) The filtered, ordered minimizer sketches are then used to create an undirected minimizer graph, where the nodes are minimizers and edges are created between minimizers that are adjacent in at least one of the assembly sketches. The edge weights are the sum of the weights of the assemblies supporting the edge. For example, in the schematic, if the incident minimizers are only adjacent in the reference, the edge has weight=2, but if the minimizers are adjacent in both the reference and the assembly, the edge has weight=3. (d) The graph is first filtered globally, removing edges with a weight less than the specified threshold (n, default 1). Next, the edges incident to nodes with a degree > 2 are filtered with an increasing edge weight threshold until the degree is <= 2. (e) This graph filtering results in linear paths of minimizers. Because we track the origin contig and position of each minimizer, these minimizer paths can also be seen as a series of runs of minimizers from the same contig ("contig minimizer runs"). The order of the contig minimizer runs defines the contig ordering in the output paths. Whether a run of minimizer positions is largely increasing (positive) or decreasing (negative) is used to infer the relative orientation of each contig. If there is only one minimizer for a contig in the path, it cannot be oriented. (f) The contigs are scaffolded based on the ordered and oriented contig paths.

Supplementary Figure S2. Example of misassembly detection and correction in ntJoin.
In this example, the colours (brown, blue) indicate different chromosomes. There is a misassembly in the target assembly, as evident from the chimeric colouring on the first assembly scaffold. (a, b) First, ntJoin generates ordered minimizer sketches for both sequences, and constructs an undirected graph where the nodes are minimizers, and edges are created between adjacent minimizers. The minimizers from the misassembled segment of the target assembly are indicated by a green outline. (c) The misassembly creates a branch in the undirected graph. The incident edges of that branch node are filtered with an increasing edge weight threshold. For the branch node, the incident edge found only in the target assembly is filtered out (weight = 1), while the incident edge found only in the reference is retained (weight = 2). (d) After the edge filtering, two linear paths remain, which are then translated back to sequence, resulting in the correction of the misassembly. Figure S3. Inferring gap distances in ntJoin. In this example, contigs cA and cB in the target assembly with terminal minimizers mA, mB are joined based on these minimizers being adjacent in the reference sequence. a and b are the distances (in bp) between the terminal minimizers of cA and cB and the ends of the contigs. The gap distance is calculated as:

Supplementary
Supplementary Figure S4. Benchmarking tools. Correctness, contiguity and benchmarking results of scaffolding an ABySS C. elegans short read assembly using the C. elegans Bristol N2 reference genome with ntJoin, Ragout and Ragoo.   Figure S10. Assembly correctness. Jupiter plot (Chu, 2018) showing consistency between various human genome assemblies and the reference human genome (GRCh38).

Supplementary
Supplementary Figure S11. Effect of the "cut" and "no cut" parameters. Correctness, contiguity and benchmarking results of scaffolding various (a) C. elegans and (b) Human (NA12878) assemblies with ntJoin, Ragout and Ragoo in modes which can cut input contigs (corresponding to all other experiments shown in this paper) and where contigs are not cut. For the C. elegans test, an ABySS assembly was scaffolded with the C. elegans (Bristol N2) reference. For the human tests, a NA12878 ABySS assembly (scaffolded with MPET data) was scaffolded with the human reference genome (GRCh38), a NA12878 Shasta assembly polished with ntEdit was scaffolded with the human reference genome and the NA12878 ABySS assembly (scaffolded with MPET data) was scaffolded with the Shasta assembly polished with ntEdit.

Supplementary Tables
Supplementary Table S1. Reference and draft assemblies used in our experiments.  Table S8. Benchmarking results of scaffolding a human (NA12878) ABySS short read assembly (scaffolded with MPET data) with ntJoin using the human reference genome GRCh38. The assembly shown in Figure 1 is bolded.  Table S16. Assembly statistics of baseline genome assemblies of Crocodylus porosus (saltwater crocodile), Gavialis gangeticus (gharial crocodile) and Alligator mississippiensis (American alligator). The NG50 length metric is based on an estimated genome size of 2.7 Gbp (St John, et al., 2012), and the "tetrapoda" lineage was used for the BUSCO analysis (Simão, et al., 2015).