BiSCoT: improving large eukaryotic genome assemblies with optical maps

Motivation Long read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal. Results We developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly. Availability BiSCoT and improved assemblies are freely available on GitHub at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/.


INTRODUCTION
. The Bionano scaffolding tool does not merge contigs even if they share labels. Instead, it inserts 13 N's gap between contigs, thus artificially duplicating the shared region. a. BiSCoT merges contigs that share enzymatic labelling sites. b. If contigs do not share labels but share a genomic region, BiSCoT attempts to merge them by aligning the borders of the contigs. c. The Bionano scaffolding tool does not handle cases where contigs can be inserted into others. BiSCoT attempts to merge the inserted map with the one containing it if they share labels. variation studies. They originate from overlaps that are not fused in the input assembly and usually 47 correspond to allelic duplications. In addition, contigs can sometimes be inserted into other contigs, these 48 cases are not handled by the Bionano scaffolding tool that discards the inserted contigs (Figure 1 case 3).

50
We developed BiSCoT, a python script that examinates data generated during a previous Bionano 51 scaffolding and merges contigs separated by a 13-Ns gap if needed. BiSCoT also re-evaluates gap sizes 52 and searches for an alignment between two contigs if the gap size is inferior to 1,000 nucleotides. BiSCoT 53 is therefore not a traditional scaffolder since it can only be used to improve an existing scaffolding, based 54 on an optical map.

56
Mandatory files loading 57 During the scaffolding, the Bionano scaffolder generates a visual representation of the hybrid scaffolds 58 that is called an 'anchor'. It also generates one '.key' file, which describes the mapping between map 59 identifiers and contig names, several CMAP files, which contain the position of enzymatic labelling sites 60 on contig maps and on the anchor, and a XMAP file, that describes the alignment between a contig map 61 and an anchor.

62
BiSCoT first loads the contigs into memory based on the key file. Then, the anchor CMAP file and contig 63 CMAP files are loaded into memory. Finally, the XMAP file is parsed and loaded. contig C k is examined at the same time as contig C n , with C k aligned before C n on the anchor. Aligned 68 anchor labels are extracted from these alignments and a list of shared labels L n,k is built. For the following 69 cases, we suppose C k and C n to be aligned on the forward strand ( Figure 1).

2/6
Case 1: contig maps share at least one anchor label 71 The last label l from L n,k is extracted and the position P l of l on both contigs C k and C n is recovered from 72 the CMAP files. In the resulting scaffold, the sequence of C k will be included up to the P l position and the 73 sequence of C n will be included from the P l position. In this case, the gap is removed, both contigs C k and 74 C n are fused and BiSCoT generates a single contig instead of two contigs initially separated by a gap in 75 the input assembly.

76
Case 2: contig maps do not share anchor labels 77 Let Size k be the size of the contig C k , Sm k and Em k the start and end of an alignment on a contig map 78 and Sa k and Ea k the corresponding coordinates on the anchor. The number n of bases between the last 79 aligned label of C k and the first aligned label of C n is then: We then have to subtract the part d k of C k after the last aligned label of C k and the part d n of C n before the 81 first aligned label of C n : Finally, we can compute the gap size g with: If g ≤ 1000, a BLAT (Kent (2002)) alignment of the last 30kb of C k is launched against the first 30kb of 84 C n . If an alignment is found and if its score is higher than 5,000, C k and C n are merged at the starting 85 position of the alignment and, as in case1, BiSCoT generates a single contig instead of two contigs initially 86 separated by a gap in the input assembly. Otherwise, a number g of Ns is inserted between C k and C n . In order to estimate the accuracy of gap sizes, we compared the gap sizes we introduced in the input 122 assembly to the ones that were estimated using optical maps (Supplementary Figure 1). We found that 123 estimated gap sizes were very close to the reality, with a mean scaled absolute error of 0.8%.

124
Validation on real data 125 Figure 2. a. Distribution of the sizes of overlapping regions in the raw assemblies. Detection was done using either Bionano labels (Case 1) or a BLAT alignment (Case 2). b. N50 contigs of raw assemblies and assemblies before or after BiSCoT treatment.
scaffolds. The same kind of results were observed in the four plant genomes with a slight decrease in 148 scaffolds NX metrics and number of Ns but an increase in contigs NX metrics (Figure 2 and Supplementary   149 Tables 2, 3, 4 and 5).

151
Thanks to the advent of long reads and optical maps technologies, it is now possible to obtain high-152 quality chromosome-scale assemblies. However, the official Bionano scaffolding tool does not always 153 perform optimally when joining two contigs. Indeed, it does not merge two sequences when they share a 154 genomic region, creating artificial gaps in the assembly. We developed BiSCoT, a tool that corrects these 155 problematic regions in a prior Bionano scaffolding and showed that it increased significantly contiguity 156 metrics of the resulting assembly, while preserving its quality.