Filling gaps of genome scaffolds via probabilistic searching optical maps against assembly graph

Background Optical maps record locations of specific enzyme recognition sites within long genome fragments. This long-distance information enables aligning genome assembly contigs onto optical maps and ordering contigs into scaffolds. The generated scaffolds, however, often contain a large amount of gaps. To fill these gaps, a feasible way is to search genome assembly graph for the best-matching contig paths that connect boundary contigs of gaps. The combination of searching and evaluation procedures might be “searching followed by evaluation”, which is infeasible for long gaps, or “searching by evaluation”, which heavily relies on heuristics and thus usually yields unreliable contig paths. Results We here report an accurate and efficient approach to filling gaps of genome scaffolds with aids of optical maps. Using simulated data from 12 species and real data from 3 species, we demonstrate the successful application of our approach in gap filling with improved accuracy and completeness of genome scaffolds. Conclusion Our approach applies a sequential Bayesian updating technique to measure the similarity between optical maps and candidate contig paths. Using this similarity to guide path searching, our approach achieves higher accuracy than the existing “searching by evaluation” strategy that relies on heuristics. Furthermore, unlike the “searching followed by evaluation” strategy enumerating all possible paths, our approach prunes the unlikely sub-paths and extends the highly-probable ones only, thus significantly increasing searching efficiency. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04448-2.

genome repeats longer than sequencing reads always create ambiguities in path finding, making assembly approaches yield only separate paths (called contigs) rather than the complete genomes [2]. The longer reads by third generation sequencing [3,4], and long-distance linking information by pair-end, mate-pair, or mapping technologies, will definitely help genome assembly methods to resolve the ambiguities incurred by repeats [5]. The study [6] provides an elaborated review on the methodological progresses and perspectives in the integration of short-range and long-range information for improving assembly contiguity.
Among the technologies that provide long-distance information across repeats, optical mapping has its unique advantage in measuring long genome fragments. For example, the BioNano Saphyr platform can measure genome fragment up to 2 megabases [7]. Unlike genome sequencing technologies, optical maps record locations of specific enzyme recognition sites, say GCT CTT C and GAA GAG C for enzyme BspQI, along a genome. By identifying these sites from contigs, we can easily align contigs onto optical maps, and then order them into scaffolds [8]. However, the short contigs that contain insufficient enzyme recognition sites usually cannot be reliably aligned onto optical maps, thus creating a variety of gaps in scaffolds and making them far from complete genomes. Filling these gaps with nucleotide sequence will considerably improve the completeness of genome assembly.
A great variety of approaches have been proposed for filling gaps directly using sequencing reads, including SOAPdenovo [9], GapFiller [10], GMCloser [11], PBJelly [12] and LR_Gapcloser [13]. These approaches, however, are infeasible for filling gaps of the scaffolds obtained via optical maps since these gaps are often much longer than sequencing reads. To fill these large gaps, Nagarajan et al. proposed to use contig paths in assembly graph instead of the short sequencing reads [14]. Here, assembly graphs refer to the product of assembling sequencing reads using graph theory, which contains contigs as nodes and connections among them as edges.
Recent progresses to improve assembly contiguity also include Bionano solve pipeline, BiSCoT [15], and Novo&Stitch [16]. Briefly speaking, Bionano solve pipeline uses a module called "Hybrid Scaffold", which sets the identified gaps with N-base rather than filling them using genome sequence. BiSCoT aims to resolve the N's gap between contigs inserted by Bionano scaffolding through merging two contings that share a genomic region. Novo&Stitch proposes a novel method that uses optical maps for accurate assembly reconciliation.
To fill gaps, we can choose a contig path that connect two boundary contigs of a gap, and then uses the corresponding nucleotide sequences. Thus, the successful gap-filling relies on two steps: (1) searching contig paths in assembly graph, and (2) evaluating the consistency between contig paths and optical map of the gaps of interest [17][18][19]. The two steps, i.e., searching and evaluating contigs paths, can be combined in various ways. For example, OMACC [17] employs the "searching followed by evaluation" strategy. Specifically, for the two boundary contigs of a gap, OMACC first searches assembly graph for all possible contig paths to connect them. Next, OMACC evaluates each possible contig path in terms of the difference between path length and gap size and selects the best path to fill the gap. By rescaling optical maps and estimating the number of repeat copies within gaps, OMACC achieved better accuracy than the previous studies [14,17].
In contrast to OMACC, AGORA employs the "searching by evaluation" strategy [18]. That is, AGORA uses a modified depth-first search (DFS) to identify the most likely contig path. At each search step, AGORA selects an edge to extend the current subpath according to several heuristics, say the decreasing order of edges, the consistency between this edge's in silico map to the experimental optical maps. AGORA uses the first found contig path to fill a gap. These heuristics could greatly improve searching efficiency; however, they might also lead to potential errors in genome reconstruction.
In summary, the "searching followed by evaluation" strategy has high accuracy but low efficiency, whereas the "searching by evaluation" strategy has high efficiency but low accuracy. Thus, the tradeoff between accuracy and efficiency remains a challenging task.
In this study, we propose an accurate and efficient approach to gap filling. Unlike the existing "searching by evaluation" methods heavily relying on heuristics, our approach uses a stochastic model to calculate the similarity between optical maps and contig paths. Using the calculated similarity to guide path-finding, our approach achieves higher accuracy than the existing approaches using heuristics. In addition, unlike the "searching followed by evaluation" methods, our approach maintains only a small set of highly probable sub-paths and prunes the unlikely ones, thus significantly improving efficiency.
We evaluated nanoGapFiller on simulated optical maps of 12 species and real optical maps of 3 species. On the simulated data sets, nanoGapFiller fills the gaps with high accuracy in minutes. Moreover, nanoGapFiller always fills more gaps than OMACC. On real data sets, OMACC cannot fill any gap, while nanoGapFiller successfully fills all of the identified gaps. We also showed that our pruning strategy could significantly reduce running time without sacrificing accuracy. Thus, nanoGapFiller should benefit various downstream genomic studies by improving the completeness of genome reconstruction with aid of optical maps.

Experiment setting and evaluation criteria
We evaluated accuracy and efficiency of nanoGapFiller on simulated optical maps of 12 species and real optical maps of 3 species. The real optical maps were acquired using BioNano Iris platform: For E. coli, P. putida and S. coelicolor, the number of optical maps are 8644, 15000 and 17422 respectively, and the coverage are 336, 435 and 354 respectively. The simulated optical maps were generated using an in-house simulator to extract enzyme recognition sites from reference genomes. We also applied another simulator OMsim [20] that adopts different error model from our in-house simulator.
The gaps of scaffolds were identified as follows: Using the reference genome of a species, we first generated simulated next-generation sequence (NGS) reads using read simulator ART [21], and then assembled these reads into assembly graph using genome assembler SPAdes [22]. Each simulated datasets has read length of 150, coverage of 50. Next, we aligned contigs onto optical maps, and further ordered the contigs into scaffolds according to the alignments. Finally, the unaligned parts of scaffolds were identified as gaps. To make thorough evaluation, we adopted two types of alignment methods: (1) SOMA2 used by OMACC [23], and (2) refAligner used by BioNano Solve package. Compared with SOMA2, refAligner generally reports fewer alignments with higher precision, and thus generates longer and more accurate gaps.
We assessed the quality of gap filling through calculating two levels of similarity between gap filling results and the corresponding regions in reference sequences: (1) Contig path similarity (CPS): the number of contigs shared by the filled gaps and the real contig paths in reference genomes. (2) Nucleotide sequence similarity (NSS): we further calculated the base-level similarity NSS = 2 × L c /(L r + L f ) , where L f and L r denote the length of gap filling results and corresponding reference sequence, respectively, and L c denotes the longest common string between them.
In the study, we compared nanoGapFiller with the state-of-the-art software OMACC. We did not perform comparison with AGORA since it is now out of maintenance. Table 1 shows the accuracy of gap filling results on the E. coli genome. As shown in this table, nanoGapFiller successfully filled all of these 23 gaps with NSS over 99%. In contrast, OMACC could only fill 11 out of the 23 gaps and failed to fill the long gaps with over 15 contigs. Even for these 11 gaps, OMACC's quality is not always high. For example, for the gap 252216r-252238, its reference sequence consists of 7 contigs of 226 nt; however, OMACC filled this gap with 31 contigs of 1546 nt, which has considerably low similarity with the reference sequence (NSS: 25.51% ). On the other 11 species, the gap filling results again suggest the superiority of nanoGapFiller in terms of accuracy and coverage (Additional file 1: Tables 1-11 and Additional file 1: Fig. 1). As shown in Additional file 1: Tables 14, 15, and 16, nanoGapFiller also shows better performance than Novo&Stitch. As a concrete example, we showed in Fig. 1 the filling process of the gap 781738-781976r of S. coelicolor. There are two contig paths connecting the beginning site and ending site of the gap: one path contains the contig 777124 while the other path contains its reverse complement 777124r. OMACC explores the distance between the beginning site and ending site only, and thus cannot identify which path matches better with the corresponding optical map. In contrast, nanoGapFiller utilizes the enzyme recognition sites in the intermediate contigs 777124 and 781726r. Specifically, both 777124 and its reverse complement 777124r contain two sites; however, the locations of these sites differ greatly in the two contigs. nanoGapFiller exploited this difference and thus correctly identified the contig path that fills the gap.

Evaluating accuracy of gap filling
We further investigated the accuracy of nanoGapFiller on real optical maps of three species (Table 2, Additional file 1: Tables 12, 13). As shown in Table 2, only 9 gaps were identified when using SOMA2 as alignment method on E. coli species, which is less than those identified on the simulated optical maps. OMACC successfully filled 5 out of 9 gaps but failed at the other 4 gaps with over 15 contigs. In contrast, nanoGapFiller filled all of these 9 gaps with considerably high accuracy (NSS over 96%). Venn graphs suggest the superiority of nanoGapFiller in terms of coverage on these 3 species (Fig. 2). Table 1 Filling the gaps identified using simulated optical maps of E. coli genome When using refAligner to align contigs onto optical maps, only 4, 2, and 1 gaps were identified for E. coli, S. coelicolor, and P. putida species, respectively ( Table 3). The longest gap has 875Knt. OMACC failed at all of these 7 gaps. In contrast, with only one exception (252312r-252486r), nanoGapFiller successfully filled all gaps with nucleotide sequence highly similar to the reference genome (NSS over 99%). We also evaluate nanoGapFiller using OMBlast [24] as alignment tool. As shown in Additional file 1: Table 18, a total of 23 gaps are identified. Despite that these gaps are different from the gaps when using SOMA2 as alignment tool (Additional file 1: Table 17), nanoGapFiller can still successfully fill these gaps with significant gap filling performance (NSS over 97%).  In addition to evaluating our approach on simulated optical maps generated by inhouse simulator, we also repeated the evaluation process on the optical maps generated using OMsim that adopts a different error model. As shown in Additional file 1: Table 19, a total of 8 gaps were identified and for 7 out of the 8 gaps, nanoGapFiller achieves accurate gap filling with NSS exceeding 97%. These results clearly demonstrate that even using simulators with different error models, nanoGapFiller can still reliably accomplish gap filling.
Hi-C scaffolding [25,26] is a promising approach that bridge and order contigs through exploiting the contact frequencies between pairs of loci [27]. Here, we compare nanoGapFiller with 3D-DNA [28], a software for Hi-C scaffolding, using the Hi-C data downloaded from NCBI GEO (GSM2870416, GSM2870417) [29]. As shown Additional file 1: Table 20, 3D-DNA achieves largest contig, total length and N50 of 4375178 bp, 4637496 bp and 4375178 bp, respectively, which is higher than that of nanoGapFiller (894614 bp, 4597570 bp and 785645 bp, respectively). However, 3D-DNA simply fills the gaps with N-bases rather than genome sequence. Thus, we further calculate NA50 where the contigs are replaced with the blocks that can be aligned to the reference. nanoGap-Filler achieves an NA50 of 785645 bp, which is much higher than 3D-DNA (438708 bp).

Evaluating efficiency of gap filling
In this section, we analyzed the running time of nanoGapFiller. Theoretically, the probabilistic search procedure takes O(m|E|) times, where m denotes the number of sites in gaps, and |E| denotes the number of edges in the site graph. As shown in Table 4, for 11 out of 12 species, nanoGapFiller takes only minutes on an ordinary personal computer. For A. vari, the gaps contain 4,917,178 nt in total, and the site graph contains 135,667 edges, thus leading to an expensive time cost (71,067.07 s).
One of the key points of our approach is pruning the unlikely sub-paths when the matching probability is below MinimalMatchingProbability. As shown in Table 5, nanoGapFiller uses 2123 s when no pruning is applied; in contrast, it takes only 88 s when setting MinimalMatchingProbability as 10 −5 . On the other side, the gap filling results nearly never change at different settings of the pruning threshold Minimal-MatchingProbability (Table 6). Together, these observations clearly suggest that our approach perfectly balances the accuracy and efficiency in gap filling.

Improvement of completeness of genome scaffolds
Finally we examined the improvement of completeness of genome scaffolds with gaps filled. As shown in Table 7, before filling gaps, the contigs are relatively short for A. vari species (N50: 64,556 nt). After filling the gaps using OMACC, the scaffold N50 increased to 78,980 nt. In contrast, after filling gaps using nanoGapFiller, the scaffold  Real optical maps refAligner 5953 1620 1227 941 Table 6 The quality of filled gaps reported by nanoGapFiller at different settings of pruning threshold MiniMalMatchingProbability Here the gaps are identified using real optical maps and alignment method refAligner. The quality is measured using baselevel similarity (NSS) between the filled gaps and the corresponding reference genome sequence  N50 increased to 7,589,422 nt, which is remarkably longer than that was reported using OMACC. We could observe similar results on other 11 species and real datasets (Tables 8 and 9).
To acquire more detailed evaluations, we have further applied Quast [30] to calculate multiple metrics of the assembly results (Additional file 1: Tables 14, 15, and 16).

Discussion
In this study, we present an efficient and effective approach for fill gaps of scaffolds with aid of optical maps. Using probabilistic search, our approach perfectly balances the accuracy and efficiency of gap filling. The performance of our approach has been clearly demonstrated by the results on a variety of species using both simulated and real optical maps. For large genome, the current version of nanoGapFiller suffers from the limitation that it generates a large size site graph which poses high memory requirement. How to improve our approach to reduce memory requirement remains one of the future studies.

Conclusion
In conclusion, nanoGapFiller can effectively improve the contiguity of genome assembly. We expect that our approach, with potential extensions, can greatly facilitate improving completeness of genome assembly.

Notations
In genome sequencing and assembly, a contig refers to a contiguous nucleotide sequence resulting from assembly of sequencing reads, whereas a scaffold refers to a series of contigs separated by gaps of estimated length.
Unlike genome sequence reads, an optical map records locations of specific enzyme recognition sites along a molecule of DNA. Specifically, for a molecule consisting of n recognition sites s 1 , s 2 , · · · , s n , optical maps count the number of nucleotide bases between s i and s i+1 for 1 ≤ i ≤ n − 1 , which is denoted as d(s i , s i+1 ) . For example, the molecule GCT CTT CAC GCT CTT CAC TGC TCT TC has three appearances of the enzyme recognition site GCT CTT C, and the corresponding optical map records the distance between these sites, i.e., d(s 1 , s 2 ) = 9 , d(s 2 , s 3 ) = 10 . In the study, we write a site sequence as s b · · · s e , where the symbol ' · · · ' represents the intermediate sites, and s b and s e denotes the beginning and ending site of the sequence, respectively.
Most genome assembly approaches utilize graph theory to guide assembly and finally generate an assembly graph, which contains contigs as nodes and connections among them as edges. To accelerate searching optical maps against assembly graph, we transform assembly graph into site graph as follows: from the component contigs of the assembly graph, we first identify all appearances of the enzyme recognition sites. Next, we use these sites as nodes, and connect the neighboring sites with edges. Here, we say two sites are neighbors if one site can be directly reached from another one by following a contig path in the assembly graph. Each edge in a site graph is associated with a distance to represent the number of nucleotide bases between the two corresponding sites (Fig. 3).

Workflow of nanoGapFiller
nanoGapFiller takes experimental optical maps and genome assembly graph as input and generates scaffolds with gaps filled as output. As shown in Fig. 4, the workflow of nanoGapFiller mainly consists of the following three steps: (1) Scaffolding and locating gaps: Initially, nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment. Note that some regions of optical maps often fail to align with any contig, thus forming gaps in scaffolds. These gaps, repre- Fig. 3 An example of the alignment between optical map and contig path. a An alignment corresponding to the generating of x 1 · · · x 5 from s 1 · · · s 5 , where < x 1 , s 1 > , < x 2 , s 2 > , < x 4 , s 3 > and < x 5 , s 5 > are matching sites, while s 4 is a missing site and x 3 is a false-positive site. b The formal description of the alignment, where the symbol '-' represents an insertion or Deletion Fig. 4 Overall pipeline of nanoGapFiller.
Step 1. Initially, nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment. Note that some regions of optical maps often fail to align with any contig, thus forming gaps in scaffolds. Here, we identified a gap with site sequence x 3 x 4 x 5 x 6 .
Step 2. To fill this gap, nanoGapFiller searches in assembly graph the contig path (shown in red) that matches best with the site sequence x 3 x 4 x 5 x 6 .
Step 3. nanoGapFiller fills the gap with the nucleotide sequence of the best-matching contig path c 1 c 3 c 6 c 10 sented as Ns rather than normal nucleotide bases A/C/T/G, might be thousands of bases long.
For each gap, we record three features, namely, beginning site, ending site, and the site sequence excerpted from the corresponding unaligned region of an optical map. Take the gap shown in Fig. 4 as an example, its beginning site and ending site are x 3 and x 6 , respectively, and its site sequence is x 3 x 4 x 5 x 6 . (2) Finding the contig path matching best with gaps: To fill a gap of scaffolds, nanoGap-Filler searches assembly graph for the contig path that matches best with the site sequence of the gap. For this aim, nanoGapFiller uses a stochastic model to measure the similarity between a site sequence and any possible contig path, and then uses the probabilistic search technique to efficiently identify the contig path with the highest similarity. The details of the stochastic model and the probabilistic search technique will be described in later subsections. (3) Filling gaps of scaffolds: Finally, nanoGapFiller fills the gaps of scaffolds using the nucleic base sequence of the best-matching contig paths. For example, the gap shown in Fig. 4 is filled using the best-matching contig path c 1 c 3 c 6 c 10 . After filling the gaps of scaffolds, the genome completeness will be greatly improved.

Measuring the similarity between an optical map and a contig path
Consider an optical map with site sequence x 1 · · · x m and a contig path with site sequence s 1 · · · s n . nanoGapFiller calculates the probability that the contig path generates the optical map (denoted as S(x 1 · · · x m , s 1 · · · s n ) ), and then uses this probability as similarity between them. The generating process of x 1 · · · x m from s 1 · · · s n is as follows: In an ideal optical mapping experiment, an enzyme recognition site s i in the contig path will be observed and recorded as a certain site x j of the optical map, which is called matching between sites s i and x j . However, it is often the case that some recognition sites are missing (called deletion) whereas some extra sites are recorded in optical maps purely due to false-positive signals (called insertion).
To formally describe the generating process of an optical map from a contig path, we define the alignment between their site sequences. For each alignment A of the sites sequences x 1 · · · x m and s 1 · · · s n , we use S A (x 1 · · · x m , s 1 · · · s n , A) to denote the possibility that the generating process corresponding to this alignment occurs.
Among all possible alignments between x 1 · · · x m and s 1 · · · s n , we identify the one with the highest score, and then use this score as the similarity between the two site sequences, i.e., where A denotes the set of all possible alignments of the two site sequences.
We calculate S A (x 1 · · · x m , s 1 · · · s n , A) as follows: we divide the two sequences at the matching sites of A, and thus acquire several matching fragment pairs. For example, the division at the matching sites < x 2 , s 2 > and < s 2 , x 4 > yields three matching fragment pairs (see Fig. 3). For each matching fragment pair p, we calculate three scoring items, including: S(x 1 · · · x m , s 1 · · · s n ) = max A∈A S A (x 1 · · · x m , s 1 · · · s n , A), (1) Length difference item LD(p): In the ideal case, two matching fragments should have identical length. However, in an optical mapping experiment, the molecules are always stretched or compressed, leading to length difference of the matched fragments. To measure the length difference, we adopted the Laplace distribution as performed by Rmaps [31][32][33], i.e., where d denotes the length difference of the two matching fragments in p, and µ and b denotes the mean and scale parameter of the distribution, respectively. (2) Missing sites item M(p): We used the Geometry distribution [31][32][33] to model the number of missing sites m, i.e., where q denotes the probability that an enzyme recognition site is detected by optical mapping. where represents the expected number of false positive sites. In this study, the parameters were set according to the manuallyverified alignments of optical maps and contig paths of E. coli as q = 0.772, = 1526000, µ = 293nt, b = 500nt.

Identifying the best-matching contig path of a gap
Before describing our method to identify the contig path that best matches a given gap, we first present the formulation of this problem: Let x 1 · · · x m be the site sequence of the gap of interest. Through locating gaps, we have identified from assembly graph two sites that match the beginning site x 1 and the ending site x m , respectively. We denote these two identified sites as s b and s e . Thus, the objective is to find the contig path with site sequence s b · · · s e such that the score S(x 1 · · · x m , s b · · · s e ) is maximized.
The basic idea of our method is probabilistic search together with search space pruning, which can be described as follows: Starting from the beginning site x 1 , we iterate finding the best-matching sites for each site x i (1 ≤ i ≤ m) through executing the following three steps: Pruning the unlikely matching pairs: To reduce the search space, we remove the unlikely matching sites, i.e., deleting the site s from M[x i ] if Pr[x i = s] is less than a pre-defined threshold MinimalMatchingProbability. We will show experimental results that when setting appropriate threshold, the search space could be significantly reduced with little influence on finding the correct paths. Searching site graph for the site sequence that best matches a gap. In this example, the gap has site sequence x 1 x 2 x 3 x 4 with distance 8, 5, 3, respectively. Through locating gaps in Step 1, we have known that the beginning site x 1 matches s 1 , and the ending site x 4 matches s 7 . Thus, our objective is to find the path from s 1 to s 7 that best matches the gap x 1 x 2 x 3 x 4 . a Initially, we set Pr[x 1 = s 1 ] = 1 as we have known x 1 matches s 1 . Next we propagated this probability to downstream site pairs and calculated the following matching beliefs for site