Resolving the complex Bordetella pertussis genome using barcoded nanopore sequencing

The genome of Bordetella pertussis is complex, with high G+C content and many repeats, each longer than 1000 bp. Long-read sequencing offers the opportunity to produce single-contig B. pertussis assemblies using sequencing reads which are longer than the repetitive sections, with the potential to reveal genomic features which were previously unobservable in multi-contig assemblies produced by short-read sequencing alone. We used an R9.4 MinION flow cell and barcoding to sequence five B. pertussis strains in a single sequencing run. We then trialled combinations of the many nanopore user community-built long-read analysis tools to establish the current optimal assembly pipeline for B. pertussis genome sequences. This pipeline produced closed genome sequences for four strains, allowing visualization of inter-strain genomic rearrangement. Read mapping to the Tohama I reference genome suggests that the remaining strain contains an ultra-long duplicated region (almost 200 kbp), which was not resolved by our pipeline; further investigation also revealed that a second strain that was seemingly resolved by our pipeline may contain an even longer duplication, albeit in a small subset of cells. We have therefore demonstrated the ability to resolve the structure of several B. pertussis strains per single barcoded nanopore flow cell, but the genomes with highest complexity (e.g. very large duplicated regions) remain only partially resolved using the standard library preparation and will require an alternative library preparation method. For full strain characterization, we recommend hybrid assembly of long and short reads together; for comparison of genome arrangement, assembly using long reads alone is sufficient.


INTRODUCTION
Bordetella pertussis is the pathogenic bacterium which causes most cases of whooping cough (pertussis). Pertussis was a major medical burden prior to the international introduction of vaccination in the 1940s and 1950s. Widespread vaccine uptake greatly reduced incidence of the disease in developed countries. Original whole-cell vaccines were replaced by new acellular vaccines throughout the 1990s and early 2000s. The acellular vaccines contain one to five of the B. pertussis protein antigens pertactin (Prn), pertussis toxin (Pt), filamentous haemagglutinin (FHA), and the fimbrial proteins Fim2 and Fim3. Despite continued high levels of pertussis vaccination coverage, since the early 1990s the number of cases of whooping cough has increased in many countries [1,2].
Suggested causes for this resurgence include improved diagnostic tests and awareness, waning immunity as a result of the switch to acellular vaccination, and genetic divergence of circulating B. pertussis from the vaccine strains due to vaccination-induced selection pressure [3][4][5]. A global survey of strains from the pre-vaccine, whole-cell vaccine and acellular vaccine eras showed that the genome of B. pertussis, widely regarded as a monomorphic and slowly evolving organism, has been evolving since the introduction of the whole-cell vaccine [6]. Analysis of strains from several recent epidemics showed the rate of evolution of the genes encoding vaccine antigens has increased since the switch to the acellular vaccine [7][8][9][10].
The B. pertussis genome contains up to 300 copies of a 1053 bp insertion sequence (IS), IS481. A smaller number of copies of IS1002 (1040 bp) and IS1663 (1014 bp) contribute further complexity to the genome. These regions of repetition mean that assembly of closed, single-contig B. pertussis genomes using short-read sequencing, which produces reads shorter than the IS repeats, has been particularly difficult: most genome sequences available on NCBI comprise several hundred contigs, or at least one contig per IS copy. Over the last decade, many studies have shown that reads longer than the longest repeat are required to resolve regions of high complexity [11][12][13][14][15][16][17][18]. Assembly of closed genomes may reveal genomic features which were previously unobservable in multi-contig assemblies; this is particularly true for genomes which contain a high number of IS copies, as insertion sequences are known to impact genomic structure via rearrangement, deletion and, more rarely, duplication [19,20].
In 2016, Bowden et al. [21] were the first to use long reads, together with Illumina short reads, to conduct a survey of B. pertussis strains which had circulated during two whooping cough epidemics, in the USA, in 2010 and 2012. Assembling closed genomes for these epidemic isolates revealed extensive genomic arrangement differences between isolates which appeared to be otherwise closely related. Bowden et al. concluded that further comprehensive whole genome studies are required to fully understand the international resurgence of whooping cough. More recently, Weigand et al. showed that the B. pertussis genome continues to undergo structural rearrangement on a frequent basis, usually mediated by IS481 [22]. As well as causing structural rearrangement, IS elements have also repeatedly been shown to be responsible for the ongoing reduction of the B. pertussis genome via gene deletion [23][24][25][26].
Bowden et al. and Weigand et al. both used Pacific Biosciences (PacBio) long read sequencing, which has high start-up costs, and lacks the portability needed for on-the-ground epidemic surveillance. In contrast, Oxford Nanopore Technology (ONT)'s MinION nanopore sequencer has relatively low start-up costs. Recent improvements to flow cell yield and the introduction of barcoded library preparation make per-sample MinION costs comparable to those of PacBio or Illumina [15,27,28]. In addition, the pocket-sized MinION sequencer is portable, enabling in-the-field sequencing [29][30][31].
Here we test the ability of barcoded nanopore sequencing, together with a variety of available data analysis tools, to resolve the genomes of five B. pertussis strains from a UK epidemic, which were previously unclosed and comprised many contigs assembled using short reads sequenced with Illumina's MiSeq [7]. We then briefly investigate the resulting genomes to identify any previously unobserved features, with a particular focus on the genome of one strain which remained unresolved by our hybrid assembly strategy.

IMPACT STATEMENT
Over the past two decades, whole genome sequencing has allowed us to understand microbial pathogenicity and evolution to an unprecedented degree. However, repetitive regions, like those found throughout the Bordetella pertussis genome, have confounded our ability to resolve complex genomes using short-read sequencing technologies alone. We have used nanopore sequencing, which can generate reads longer than these problematic repetitive regions, to resolve multiple B. pertussis genomes with a single flow cell. The resolved genomes can be used to visualize previously predicted genome rearrangements and, in addition, the inability of our long reads to resolve some of our genomes has allowed us to infer the presence of previously unidentified ultra-long duplications in two of our five strains. Thus, our findings point towards unanticipated genome-level genetic variation in strains which appear otherwise monomorphic at the nucleotide level. This work expands the recently emergent theme that even the most complex genomes can be resolved with sufficiently long sequencing reads. Our optimization process, moreover, shows that the analysis tools currently favoured by the sequencing community do not necessarily produce the most accurate assemblies for all organisms; pipeline optimization may therefore be beneficial in studies of unusually complex genomes.
All data analysis was carried out using the Medical Research Council's Cloud Infrastructure for Microbial Bioinformatics (CLIMB) [32].

Strain isolation and Illumina sequencing
Five strains originally isolated during the UK 2012 whooping cough epidemic were obtained from the National Reference Laboratory, Respiratory and Vaccine Preventable Bacteria Reference Unit, at Public Health England. Shortread sequencing data were generated previously, using genomic DNA (gDNA) extracted using a DNeasy Blood and Tissue kit (Qiagen), multiplex library preparation and Illumina sequencing [7]. Full details, including accession numbers, are included in Table S1.

DNA extraction
Strains were stored at À80 C in PBS/20 % glycerol at the University of Bath. They were grown on charcoal agar plates (Oxoid) for 72 h at 37 C. All cells were harvested from each plate and resuspended in 3 ml PBS. The optical density of each cell suspension was measured at 600 nm, and volumes of suspension equating to an OD of 1.0 (~2Â10 9 B. pertussis cells) in 180 µl were pelleted in a microcentrifuge for 2 min at 12 000 g. gDNA was extracted from each pellet using a GenElute bacterial genomic DNA kit (Sigma Aldrich) according to the manufacturer's instructions, including the optional RNAase A step and a two-step elution into 200 µl elution buffer (10 mM Tris-HCl, 0.5 mM EDTA, pH 9.0). QuBit fluorometry (dsDNA HS kit; Invitrogen) was used to measure gDNA concentration, and Nanodrop spectrometry (ThermoFisher Scientific) was used to assess gDNA purity.

Nanopore library preparation and MinION sequencing
In total, 1.5 µg of gDNA per strain was concentrated using a 2.5Â SPRI clean-up (AMPure XP beads; Beckman Coulter), eluting into 50 µl of nuclease-free (NF) water (Ambion). Then, 48 µl of this was sheared to 20 kb using g-tubes (Covaris), according to the manufacturer's instructions.
Sequencing libraries were prepared for all samples using ONT's 1D ligation sequencing kit (SQK-LSK108) with native barcoding (EXP-NBD103), according to the manufacturer's instructions. Ten barcodes were used, two per strain (see Table S2 for full details). After library preparation, different volumes of samples were combined to produce an equi-mass pool for eight samples; two samples had much lower concentration after library preparation so were pooled in full. A total mass of 712.5 ng was pooled in 208.9 µl NF water, which was concentrated to 50 µl by 2.5Â SPRI clean-up. Full details of mass pooled per sample are given in Table S2. This pooled library (712.5 ng in 50 µl) was used for sequencing adapter ligation.
The final sequencing library was loaded onto an R9.4 flow cell and sequenced for 48 h using a MinION MK1b device with MinKNOW sequencing software (protocol NC_48h_Sequencing_Run_FLOMIN106_SQK-LSK108).

Additional basecalling and demultiplexing
The fast5 files were basecalled using ONT's Albacore (v2.1.3) program, with barcode binning. As suggested by Wick et al. [15], Porechop was then used to demultiplex the Albacore reads, keeping only those for which Albacore and Porechop agreed on the bin. The Albacore+Porechop fastq files were deposited in the Sequence Read Archive (SRA) with accession codes SAMN09500966 to SAMN0900970. Full details of all read sets (including reads output by Min-KNOW's concurrent basecalling algorithm) are given in Table S1.
Assembly of short-read-only drafts Assuming the available Illumina data to have typically low error, short-read-only genome sequences were assembled for each strain using ABySS (v2.0.3) [33]. Prior to assembly, reads were prepared using Trimmomatic (v0.34) [34], which trimmed the first 10 bases of each read, and discarded any reads whose four-base sliding-window q-score fell below 32. These assemblies had low contiguity, but theoretically high accuracy.
Comparison of raw reads A shell script, 'summary_stats', was used to give the total number of reads, mass sequenced and minimum, maximum, mean and median read lengths for each set of raw reads. Summary_stats uses seq_length.py [35] and all_stats. All are available from https://github.com/nataliering/ Resolving-the-complex-Bordetella-pertussis-genome-usingbarcoded-nanopore-sequencing. Raw percentage identity was estimated by comparing each read set to the B. pertussis reference genome (Tohama I, NC_002929.2). As the UK 2012 strains were not expected to be identical to the Tohama I sequence, read error was also estimated by comparison with the respective Illumina-only assemblies. The comparison was conducted using BWA MEM [36] and samtools stats [37], which produces a long output file including 'error rate' [% identity was calculated from this: 100 À (error rate*100)]. Raw_error (https://github. com/nataliering/Resolving-the-complex-Bordetella-pertussisgenome-using-barcoded-nanopore-sequencing/blob/master/ raw_error) produces a stats file using this method, given a read set and reference genome. Using the same BWA MEM output, raw read coverage of the Tohama I reference genome was checked using samtools depth and visualization with a rolling window in R.
Canu has a standalone option to conduct pre-assembly read correction. This was used to correct the 359Â coverage UK36 read set to 40Â coverage of more accurate reads. Each assembly tool was then trialled with and without preassembly read correction. As Canu's read correction step is relatively time-consuming as regards CPU, an alternative was also trialled. Filtlong (https://github.com/rrwick/Filtlong) does not correct reads, but produces read sets comprising the longest and most accurate reads, up to a given level of coverage; 40Â and 100Â coverage were trialled here.
Finally, Racon (v.1.2.0) [42] was tested to determine whether the draft assemblies could be improved by postassembly polishing. After each Racon polish, the accuracy of the assembly produced was estimated. If an improvement was observed, another round of polishing was conducted, up to a total of five rounds. Once two successive rounds of polishing showed no further improvement, no further Racon polishes were conducted. For Unicycler, no manual Racon polishes were conducted, because Racon polishing is part of the Unicycler assembly process. After Racon polishing, each assembly was further polished with a single round of Nanopolish (v0.9.0) [14].
Testing exhaustive combinations of each of these steps produced 28 draft assemblies for each of the four assembly tools tested (ABruijn/Flye, Canu, Miniasm and Unicycler), a total of 112 draft UK36 assemblies (see Table S3 for all combinations).
Assembly tool testinghybrid As Illumina reads were already available for the strains sequenced here, a variety of hybrid de novo assembly strategies were also tested. Using Pilon (v1.22) [43], the best nanopore-only assembly for each of the assembly tools was polished with the Illumina reads, up to a total of five rounds. In addition, a hybrid assembly was produced using Unicycler's hybrid mode, which both combines read sets for assembly, and conducts several rounds of Racon and Pilon polishes automatically. Finally, the hybrid assembly mode of SPAdes (v3.12.0) [44] was tested. These hybrid tests produced another 22 draft assemblies (Table S4).

Assessing assembly accuracy
Summary_stats was used to determine the number of contigs, and contig length for each draft assembly. The percentage identity of each draft compared to the Illumina-only draft was estimated using a method developed by Wick et al. [45]. Their chop_up_assembly.py and read_length_identity.py scripts were used to generate percentage identity values for 10 kbp sections along the entirety of each assembly, and a custom shell script, assembly_identity (https:// github.com/nataliering/Resolving-the-complex-Bordetellapertussis-genome-using-barcoded-nanopore-sequencing/ blob/master/assembly_identity), was used to calculate the mean percentage identity of the whole.

Comparing genome arrangement
After the best nanopore-only and hybrid assembly pipelines were identified for UK36, the pipelines were used to produce draft assemblies for the remaining four strains. The hybrid assembly for each strain was annotated with Prokka (v1.12) [49] using the proteins from Tohama I as a reference. The genomes were also submitted to GenBank (accession numbers CP031289, CP031112, CP031113, QRAX00000000 and CP031114).
The arrangement of each nanopore-only assembly was compared to that of each hybrid using progressiveMauve (v20150226 build 10) [50]. Finally, the nanopore-only assemblies for each strain were compared to each other, also using progressiveMauve. Prior to these alignments, each draft was manually rearranged so that the first gene after the B. pertussis origin of replication, gidA, was at the beginning of the sequence. gidA_blast (https://github.com/nataliering/Resolving-the-complex-Bordetella-pertussisgenome-using-barcoded-nanopore-sequencing/blob/master/gidA_blast) locates the gidA sequence in the draft to enable manual rearrangement. Later, this same process was used to identify IS element copies in the assembled genomes. If a tool assembled the complementary strand instead of the template (as identified by the results of gidA_blast), a reverse complement of the draft sequence was generated using reverse_complement (https://github.com/ nataliering/Resolving-the-complex-Bordetella-pertussisgenome-using-barcoded-nanopore-sequencing/blob/master/reverse_complement).

Sequencing yield
During the 48 h MinION sequencing run, 1 803 648 reads were generated, equating to 9.73 Gbp of sequence. In total, 28.78 % of these reads (574 053 reads, 2.8 Gbp) were not assigned to the correct barcode bin during demultiplexing, leaving 6.93 Gbp (1 229 595 reads) of useable sequencing data (Fig. 1). Normalized yield per barcode (taking into account nanograms of gDNA included in the pooled sequencing library) was particularly high for one barcode (NB11, 15.28 Mbp ng À1 ) but otherwise relatively consistent, ranging from 7.38 to 10.28 Mbp ng À1 with a mean yield of 9.06 Mbp ng À1 (SE=0.37). Mean read length for the full read set was 5689 bp. Read lengths ranged from 4 to 201 977 bp.
The Albacore-demultiplexed reads were re-demultiplexed using Porechop, which keeps only the reads for which both tools agree on the barcode identified. This additional step resulted in a small but significant improvement in identity compared to Illumina-only assembly: 82.43 to 82.52 % (n=5, paired t-test P<0.001). Consequently, the reads used for data pipeline testing were those that had been basecalled and demultiplexed by Albacore, followed by Porechop redemultiplexing.
For full results, including which barcode was assigned to each sample, see Table S1.
Assembly tool testingnanopore-only Table 1 shows the quality measurements for the best nanopore-only assembly per tool trialled. All tools tested were able to resolve the nanopore long reads for UK36 into a complete, closed contig, using default assembly options with no manual intervention. In total, 112 different tool combinations were trialled. Alignment of drafts from different tools using progressiveMauve revealed that each tool also assembled the genome into the same arrangement. However, the length of the draft assemblies showed some variation: 3.984 to 4.134 Mbp, with a mean length of 4.108 Mbp. Comparing like-for-like assemblies before and after polishing shows that Nanopolish improves identity by 0.216 % on average (n=16, paired t-test P<0.001). Polishing with Racon produced inconsistent results: identity decreased after Racon polishing of ABruijn and Flye drafts, increased by 2.01 % after the optimal number of polishing rounds for pre-corrected non-ABruijn/Flye drafts (n=3), and increased by 15.15 % after optimal rounds for non-ABruijn/Flye drafts with no pre-correction (n=4). The mean number of Racon polishes required to reach optimal percentage identity (after which it began to decrease) was 4.75 (n=7).
The assembly with greatest percentage identity compared to the Illumina-only draft (99.59 %) combined pre-assembly read correction with Canu, assembly with ABruijn and post-assembly polishing with Nanopolish. The assemblies were also assessed using BUSCO, which searches draft assemblies for copies of benchmarking universal singlecopy orthologues (BUSCOs). BUSCOs are sets of core genes which are likely to appear universally in related organisms. A set of 40 such core genes from the Escherichia coli genome are used as the gram-negative bacterial BUSCOs; if a genome has been assembled accurately, the tool BUSCO is more likely to be able to identify these 40 genes within its sequence. Of the drafts assessed here, the ABruijn assembly contained the highest number of identifiable BUSCOs (37 full and two partial, of the full set of 40; see Table 1 for full results).
Assembly tool testinghybrid A number of hybrid assembly strategies were trialled, including polishing a long-read assembly with short reads, scaffolding short-read contigs with long reads, and using both short and long reads together during assembly ( Table 2 shows the best draft produced by each tool). Scaffolding short-read contigs with long reads using SPAdes produced one of the highest accuracy assemblies (99.68 %), but did not fully resolve the genome, as six contigs remained. No further polishing was attempted with this SPAdes assembly, as polishing would not close the remaining gaps between the contigs.
The best hybrid assemblies per tool were significantly more accurate than the best nanopore-only assemblies per tool, with a mean identity improvement of 0.11 % (hybrid n=6, nanopore-only n=5, paired t-test P<0.001). In addition, all hybrids contained all 40 identifiable BUSCOs, and all except the SPAdes hybrid were single closed contigs and showed the same arrangement when aligned using progressiveMauve.
The best single-contig hybrid assembly, with 99.68 % identity, was produced using Unicycler's hybrid option, which uses SPAdes, Minimap, Miniasm, Racon and Pilon. Table S5 shows the results from all nanopore-only and hybrid tests.

Assembly and annotation of all strains
Using the nanopore-only and hybrid pipelines defined through the tests described here (Fig. 2), draft genomes were assembled for all five UK strains sequenced during our barcoded run. The assemblies were assessed for percentage identity compared to each strain's Illumina-only assembly, G+C content, genome length and number of key IS element features; they were also annotated using Prokka. The full results of this analysis are shown in Tables 3, S6 and S7.
The hybrid assembly for one strain, UK76, had slightly lower percentage identity (99.54 %) than the other strains, each compared to their respective Illumina-only ABySS assembly. Discounting UK76, the assemblies had a mean identity of 99.69 % (n=4). The G+C content of the strains varied little: the content for all strains was 67.70 % when rounded to two decimal places. The number of genes predicted by Prokka was also relatively consistent, varying from 3757 to 3804.
The UK36 proteins predicted by Prokka were assessed by Ideel, which searched the Trembl database [51] for similar proteins. The length of the Prokka-predicted proteins was divided by those of the identified similar Trembl proteins; a perfect match would equal 1.0. This method, therefore, indicates whether indels in a draft sequence cause frameshifts which subsequently lead to truncated (or over-long) protein prediction. After manual curation to remove results which represented genes known to be fully present in other Bordetella species but truncated in B. pertussis, over 98 % of Prokka-predicted genes had a Prokka:Trembl length ratio of greater than 0.9. This suggests that the residual error in the hybrid assemblies does not cause substantial annotation problems, so the hybrid assemblies for all five strains were submitted to GenBank (accession numbers CP031289, CP031112, CP031113, QRAX00000000 and CP031114).

Comparison of genomic structure of all strains
All strains were assembled into single contigs using the nanopore-only pipeline. These assemblies were aligned using progressiveMauve (Fig. 3), displaying genomic rearrangement between strains; three, UK36, UK38 and UK39, shared exactly the same arrangement, whilst UK48 and UK76 were rearranged.
Of the hybrid assemblies, two strains, UK48 and UK76, had longer genomes than the others (4.112 and 4.113 Mbp, respectively, compared to 4.108 Mbp), which corresponds with them also having more copies of the most abundant IS element, IS481. All strains but one were assembled into single contigs. The remaining strain, UK48, was assembled into five contigs (N50=3.934 Mbp). Of these, three were shorter than 500 bp, and were subsequently discarded. The remaining two contigs were 3 934 355 and 178 023 bp. Mapping the raw UK48 reads to the Tohama I reference sequence revealed a section of almost 200 kbp, located between 1.35 and 1.53 Mbp, which had double the read depth of the rest of the reference; the doubled read depth suggests that this section of the genome is duplicated in UK48. No other strain had a similarly duplicated section, although the coverage of UK76 was enriched by around 25 % at the same locus (Fig. 4), potentially indicating a heterogeneous UK76 population, of which a subset (i.e. 25 %) of cells carries a duplication. These abnormalities are also present in the Illumina reads, which were obtained approximately 5 years before our nanopore reads (Fig. 4), but had not been identified previously using the short reads alone.

DISCUSSION
Are residual unresolved ultra-long repeats present in some strains? Our primary aim in this study was to determine whether long reads produced by nanopore sequencing using ONT's MinION can be used to produce closed B. pertussis genome sequences, which will enable visualization of Ring et al., Microbial Genomics 2018;4 large-scale inter-strain genomic differences, and may further reveal previously hidden genomic features. Our nanopore-only assembly pipeline produced closed-contig assemblies for all five strains sequenced here, allowing visualization and validation of previously predicted ISmediated genomic rearrangements. In addition, the inability of our long reads to produce a closed hybrid assembly for UK48 has revealed a separate, unpredicted, genomic feature in our UK strains.
The region of enriched coverage between 1.35 and 1.53 Mbp in the Tohama I reference genome observed in the UK48 reads (Fig. 3) is likely to indicate a large (almost 200 kbp) duplication of that region which is present in UK48 but not in the reference. A less obvious duplication may also be present in the genome of UK76: a 300 kbp region from 1.38 to 1.68 Mbp shows 125 % coverage. The presence of the same abnormalities in other read sets for both strains suggests that they have not been caused by contamination (Fig. 3). Similar duplicated regions have been observed previously in a very small number of French and Finnish strains (fewer than five) through microarray-based studies in 2006 and 2007 [25,26]. More recently, Weigand et al. [52,53] noted complex duplications in two US strains and two Indian vaccine-reference strains; these genomes were long-read sequenced with PacBio, but resolution of the duplications was only possible with optical mapping. The locus found to be duplicated in these previous studies was the same as that we predict is duplicated in UK48 and UK76; however, at 180 and 300 kbp, our predicted duplications are longer than any of those observed previously. The identification of two additional strains carrying a  Fig. 3. Alignment of our five sequenced strains, showing genomic rearrangement. Our five UK B. pertussis strains (UK36, UK38, UK39, UK48 and UK76) were assembled using our nanopore-only pipeline, resulting in single, closed-contig, assemblies. The closed assemblies were aligned with progressiveMauve, which showed that even strains which are closely temporally related can display different genomic arrangements. duplication of the same region suggests that IS-mediated duplication is occurring more frequently in B. pertussis than previously believed. Furthermore, the apparent heterogeneity of our UK76 culture suggests that only a portion of UK76 cells may carry the duplication, a phenomenon previously unobserved in any duplication-carrying B. pertussis isolate. Finally, the locus of the duplication itself, which contains many motility-related genes, may have interesting implications for an organism traditionally described as non-motile. Fig. 4. Alignment of nanopore reads to the Tohama I reference sequence compared to alignment of Illumina reads to the Tohama I reference sequence. Raw reads from each sequencer were aligned to the reference using BWA MEM, followed by coverage calculation with samtools depth. The coverage of three strains (UK36, UK38 and UK39) was consistent across the whole reference genome, whereas UK48 and UK76 coverage was enriched in certain locations. In UK48, a large section from 1.35 to 1.53 Mbp into the reference appears to have exactly twice as much coverage as the rest of the genome. In UK76, a section from 1.38 to 1.68 Mbp is enriched by 25%. The coverage abnormalities seen in UK48 and UK76 are present in both sets of reads, suggesting they are not the result of a quirk in sequencing method, or contamination. Neither our nanopore-only pipeline using Flye nor our hybrid pipeline using Unicycler was able to resolve the duplication correctly, however. The nanopore-only pipeline produced closed contigs for all five strains, seemingly missing the duplication completely, whilst the hybrid pipeline produced a multi-contig assembly for UK48 and the same closed contig as the nanopore-only pipeline for UK76. Our UK48 reads ranged from 73 to 108 575 bp with a mean length of 6243 bp, whilst the UK76 reads ranged from 4 to 70 486 bp with a mean length of 5480 bp; if the key to resolving long repeats is to use reads longer than the longest repeat, we will need high coverage of ultra-long reads in the order of hundreds of thousands of bases to resolve these putative duplications via sequencing [16,17]. Nanopore sequencing is currently the only sequencing method theoretically capable of producing such long reads; methods to obtain ultra-long reads are under development by the nanopore community, with reports of reads in the order of millions of bases [54].
Accuracy of long-read sequencing is improving but error estimation is challenging In addition to our primary aim, we also compared a variety of de novo assembly strategies to determine the current optimal pipeline for producing the most accurate genome sequences for B. pertussis.
Without a recent, closely related reference sequence, error estimation in B. pertussis assemblies is inexact. Comparison with the Tohama I reference sequence will identify basecalling errors which are false positives, having arisen due to natural variation between different strains (that is, true SNPs will be identified as errors). Moreover, the validity of Tohama I as a representative of all B. pertussis strains is questionable [55]. Having estimated our hybrid assemblies to be, on average, 99.69 % accurate, we can conclude that roughly 13 000 bases in each 4.1 Mbp draft genome are incorrect. Whilst these incorrectly called bases will not influence comparisons of genome arrangement, residual base errors in draft genome sequences assembled using long reads remain a concern, with the potential to falsely identify SNPs or prevent accurate protein prediction [56]. Incorrect sequencing of homopolymers is a known weakness of many sequencing methods, including nanopore sequencing [17], and our assemblies are no exception. Indeed, a base-level manual comparison of one of our hybrid assemblies with a more accurate Illumina-only draft using progressiveMauve revealed that every difference occurred in a homopolymeric tract, with the hybrid sequence having inserted or deleted bases. Two options for correct SNP identification, therefore, are manual correction of known homopolymeric indels [56], and simply ignoring SNPs which appear to occur in homopolymeric regions. The manual correction option would be time-consuming, whilst the second option could result in false negatives. Nevertheless, until improved pore chemistry or basecalling tools are available which do not produce homopolymeric indels, the use of either option means that SNP identification is still possible, even in assemblies which are less than 100 % accurate.
Correct prediction of proteins appears to be of less concern than SNP identification in our hybrid assemblies: all 40 potential bacterial BUSCOs were present in full for all of our strains, and both Quast and Prokka were able to identify the majority of the Tohama I reference proteins in the same assemblies. In addition, assessment of our UK36 hybrid using Watson's Ideel pipeline [48] suggested that, although we know some errors remain, they do not substantially inhibit the correct prediction of full-length proteins during annotation. It is here, however, that we can see clearly the benefit of the hybrid assemblies over the nanopore-only assemblies: although the mean accuracy of the nanoporeonly assemblies (99.48 %) was only 0.2 % lower than that of the hybrids, none of the nanopore-only assemblies contained full copies of all 40 BUSCOs.
Does the de Bruijn graph method assemble highly repetitive prokaryotic genomes more accurately than other commonly used methods?
The opinion of the sequencing community has long been that de Bruijn graph assembly is not as effective for errorprone long reads as other de novo assembly methods [57,58]. The tool which consistently produced the most accurate nanopore-only B. pertussis assemblies was therefore unexpected: the percentage identity and indel rates of our ABruijn assemblies were better by far than those of the Canu, Miniasm or Unicycler assemblies. The recent version change of ABruijn to Flye seems to have negatively affected these metrics in some of our strains; however, whilst the ABruijn assemblies were better than the Flye assemblies, the Flye assemblies were still better than those produced by other tools. Another recent study, which assembled highly complex and repetitive Pseudomonas koreensis genomes using ultra-long nanopore reads, also found Flye to produce the most accurate assemblies [16]. This suggests that the de Bruijn method might be optimal for prokaryotic genomes which contain a high number of repeats.
Two possible pipelines for B. pertussis genome sequence resolution We have shown here that resolution of five B. pertussis genomes per MinION flow cell is possible, whether using long reads alone or in combination with short reads. Sequencing five strains using one flow cell produced a mean yield of over 300Â B. pertussis genome coverage per strain, which probably exceeds that required to achieve comparable results. A draft produced from using roughly half of our reads (175Â coverage) for UK36, pre-corrected and assembled with Flye, had an identity of 99.467 %, whilst the same assembly produced by the full (360Â coverage) read set had an identity of 99.474 %. This suggests that twice as many strains could be de novo assembled per flow cell without a notable drop in accuracy. Thus, resolution of ten B. pertussis genomes per MinION flow cell should be possible.
If short reads are also available, we have shown that hybrid assembly, using pre-correction with Canu followed by Unicycler, remains the most accurate method. Indeed, for now, for full strain characterization (including comparison of genome arrangement, SNP identification and allele-typing), hybrid assemblies are required. For comparison of genome structure and arrangement only (e.g. Fig. 3), however, our nanopore-only pipeline, which uses Canu pre-correction, Flye assembly and post-assembly polishing with Nanopolish, can produce single contig assemblies of adequate accuracy for all but the most unusual B. pertussis genomes.
Continued improvement of long-read data processing tools Although the pipelines we have defined here produce the most accurate B. pertussis genome sequences currently possible, the tools available for the analysis of nanopore sequencing data are continually improving. A recent update to Racon added the ability to polish assemblies with Illumina reads; a brief comparison of this with Pilon, however, showed little improvement to our data, so we did not add short-read Racon polishing to our suite of tests. Alternative basecallers such as Chiron [59] or the currently indevelopment Guppy, which use entirely new basecalling algorithms, may also offer further accuracy improvements and could be trialled with existing and future B. pertussis data sets, particularly if Illumina short reads are not available for hybrid assembly.
We tested the most commonly used de novo assembly tools suitable for long reads and, at the time of writing, are not aware of any newly released tools. However, minor (or sometimes major, in the case of ABruijn to Flye) updates are common. New polishing tools are also being developed: ONT's own Medaka, for example, is claimed to rival Nanopolish in terms of speed and assembly improvement capabilities [60]. In addition, MaSuRCA [61] was not trialled here due to the low Illumina coverage (the manual suggests 50Â+ for hybrid assemblies, whereas we had only 37.5Â coverage for UK36). Ultimately, for the foreseeable future, no data pipeline including nanopore reads should be set in stone; we will continue to trial new tools and to update our pipeline where appropriate, and would suggest that similar pipeline optimization may be required for each organism to be sequenced.

Funding information
This work was funded by the University of Bath and Oxford Nanopore Technologies.