Genome analysis Improving Bionano scaffolding with BiSCoT

Motivation: Long reads sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative when two contigs are joined. Results: We developed BiSCoT (Bionano SCaffolding COrrection Tool), a software that uses informations produced by a pre-existing assembly based on optical maps as input and improves the contiguity and the quality of the generated assembly. Availability: BiSCoT is freely available at https:/


Introduction
Assembling large and repetitive genomes, such as plant genomes, is a challenging field in bioinformatics. The apparition of short reads technologies several years ago improved considerably the number of genomes publicly available. However, a high proportion of them are still fragmented and few represent the chromosome organization of the genome. Recently, long reads sequencing techniques, like Oxford Nanopore Technologies and Pacific Biosciences, were introduced to improve the contiguity of assemblies, by sequencing DNA molecules that can range from a few kilobases to more than a megabase in size. Nevertheless and even if the assemblies were greatly improved, the chromosome-level organization of the sequenced genome couldn't be deciphered in a majority of cases. In 2017, Bionano Genomics launched its Saphyr system which was able to generate optical maps of a genome, by using the distribution of endonuclease nicking sites. These maps were used to orient and order contigs into scaffolds but the real improvement came in 2018, when Bionano Genomics introduced their Direct Label and Stain (DLS) technology that was able to produce genome maps with an N50 several folds higher than previously (Deschamps et al, Belser et al, 2018).
However, scaffolds generated with the tool provided by Bionano Genomics do not reach optimal contiguity. Indeed, when two contigs C1 and C2 are found to share labels, one could expect that the tool would merge the two sequences at the shared site. Instead, the software chooses a conservative approach and outputs the sequence of C1 followed by a 13-Ns gap and then the C2 sequence, thus duplicating the region that is shared by the two contigs (Supplementary Figure 1) and in numerous cases, these duplicated region could reach several kilobases. As an example, on the human genome we used to evaluate BiSCoT (see Results), we could detect 168 of those regions, affecting 16 genes and corresponding to around 5Mb of duplicated sequences, the longest being 94kb in size. These duplicated regions have to be corrected as they can be problematic for downstream analyses, like copy number variation studies. In addition, contig maps can sometimes be inserted into other maps, these cases are not handled by the Bionano scaffolding tool that discards the inserted maps (Supplementary Figure 2).
We developed BiSCoT, a python script that examines data generated during a previous Bionano scaffolding and merges contigs separated by a 13-Ns gap if needed. BiSCoT also re-evaluates gap sizes and searches for an alignment between two contigs if the gap size is inferior to 100 nucleotides.

Methods
Briefly, BiSCoT loads all necessary files into memory, then checks if two adjacent contigs in an alignment share labels. If so, it merges those two contigs at the shared site. If no labels are shared, a gap size g is estimated. If g ≤ 100, then a BLAT (Kent, 2002) alignment is performed between contig borders and if a match is found, contigs are merged at the corresponding alignment position. Finally, BiSCoT generates an AGP file describing changes in the contig organization and a FASTA file corresponding to scaffolds.

Mandatory files loading
During the scaffolding, the Bionano scaffolder generates one key file, several CMAP files and a XMAP file that will be used by BiSCoT. It also generates a visual representation of the hybrid scaffolds that is called an 'anchor'.
BiSCoT first loads the contigs into memory based on the key file, which describes the mapping between map identifiers and contig names.
Then, BiSCoT loads the anchor CMAP file and one CMAP contig file per enzyme into memory. These files contain the position of enzymatic labelling sites on contig maps and on the anchor generated during the Bionano scaffolding.
Finally, BiSCoT loads the XMAP file that describes the alignment between a contig map and an anchor.

Scaffolding
Alignments contained into the XMAP file are first sorted by their starting position on the anchor. Then, alignments on one anchor are parsed by pairs of adjacent contigs, i.e alignment of contig Ck is examined at the same time as contig Cn. Aligned anchor labels are extracted from these alignments and a list of shared labels Ln,k is built. For the following cases, we suppose Ck and Cn to be aligned on the forward strand and Ck to be aligned before Cn on the anchor (Supplementary Figure 1 and 2).

Case 1: contig maps share anchor labels
First, the right-most shared label identifier l is extracted. The position of l (Pl) on both contigs Ck (Pl(Ck)) and Cn (Pl(Cn)) is searched, based on information provided in the CMAP files. In the resulting scaffold, the sequence of Ck will be included up to the Pl(Ck) position and the sequence of Cn will be included from the Pl(Cn) position.

Case 2: contig maps don't share anchor labels
Let Sizek be the size of the contig Ck, Smk and Emk the start and end of an alignment on a contig map and Sak and Eak the corresponding coordinates on the anchor. The number n of bases between the right-most aligned label of Ck and the left-most aligned label of Cn is then: n = San -Eak We then have to subtract the part dk of Ck after the right-most aligned label and the part dn of Cn before the left-most aligned label:

dk = Sizek -Emk dn = Smn
Finally, we can compute the gap size g with: g = ndk -dn If g < 100, a BLAT alignment of the last 30kb of Ck is launched against the first 30kb of Cn. Only the best alignment is kept and if its score is higher than 5,000, Ck and Cn are merged at the starting position of the alignment. Otherwise, a number g of Ns is inserted between Ck and Cn.

Results
In order to evaluate BiSCoT, we downloaded the contigs fasta file of the NA12878 human genome that was sequenced and assembled by the Whole Human Genome Sequencing Project (Jain et al, 2018) and Bionano DLE and BspQI maps from the Bionano Genomics website (https://bionanogenomics.com/library/datasets/). The QUAST (Gurevich et al, 2013) and BUSCO (Simão et al, 2015) tools were used respectively to evaluate the number of misassemblies to the GRCh38.p12 human reference genome and the number of conserved genes among eukaryotes.
In all cases, we first ran a classic Bionano scaffolding and used the files generated during this step to run BiSCoT (Table 1, Supplementary  Table 1, 2 and 3).
Globally, using BiSCoT lowered slightly the scaffolds NX metrics but this is an expected side effect as the merging of sequences induces a lower cumulative size.
However, the contig NX and NGAX metrics increased drastically: the contigs NGA50 of NA12878 increased by around 10%, going from 5.8Mb to 6.3Mb. The scaffolds NGAX metrics also increased: the scaffolds NGA50 increased from 10.8Mb in Bionano scaffolds to 11.5Mb in BiSCoT scaffolds. Moreover, the number of Ns decreased marginally and the number of complete eukaryotic BUSCO genes increased from 255 to 256. More importantly, when aligning the assemblies against the reference genome, we could detect a decrease in the number of misassemblies going from 1,592 in Bionano scaffolds to 1,491 in BiSCoT scaffolds. 32 (10.5%) 29 (9.6%) 29 (9.6%) 29 (9.6%)

Article short title
The same kind of results were observed in the three plant genomes with a slight decrease in scaffolds NX metrics and number of Ns but an increase in contigs NX metrics and complete BUSCO genes (Supplementary Tables 1, 2 and 3).