Complete genome assembly of clinical multidrug resistant Bacteroides fragilis isolates enables comprehensive identification of antimicrobial resistance genes and plasmids

Bacteroides fragilis constitutes a significant part of the normal human gut microbiota and can also act as an opportunistic pathogen. Antimicrobial resistance and the prevalence of antimicrobial resistance genes are increasing, and prediction of antimicrobial susceptibility based on sequence information could support targeted antimicrobial therapy in a clinical setting. Complete identification of insertion sequence (IS) elements carrying promoter sequences upstream of resistance genes is necessary for prediction of antimicrobial resistance. However, de novo assemblies from short reads alone are often fractured due to repeat regions and the presence multiple copies of identical IS elements. Identification of plasmids in clinical isolates can aid in the surveillance of the dissemination of antimicrobial resistance and comprehensive sequence databases support microbiome and metagenomic studies. Here we test several short-read, hybrid and long-lead assembly pipelines by assembling the type strain B. fragilis CCUG4856T (=ATCC25285=NCTC9343) with Illumina short reads and long reads generated by Oxford Nanopore Technologies (ONT) MinION sequencing. Hybrid assembly with Unicycler, using quality filtered Illumina reads and Filtlong filtered and Canu corrected ONT reads produced the assembly of highest quality. This approach was then applied to six clinical multidrug resistant B. fragilis isolates and, with minimal manual finishing of chromosomal assemblies of three isolates, complete, circular assemblies of all isolates were produced. Eleven circular, putative plasmids were identified in the six assemblies of which only three corresponded to a known cultured Bacteroides plasmid. Complete IS elements could be identified upstream of antimicrobial resistance genes, however there was not complete correlation between the absence of IS elements and antimicrobial susceptibility. As our knowledge on factors that increase expression of resistance genes in the absence of IS elements is limited, further research is needed prior to implementing antimicrobial resistance prediction for B. fragilis from whole genome sequencing. REPOSITORIES Sequence files (MinION reads de-multiplexed with Deepbinner and basecalled with Albacore in fast5 format and Illumina MiSeq reads in fastq format) and final genome assemblies have been deposited to NCBI/ENA/DDBJ under Bioproject accessions PRJNA525024, PRJNA244942, PRJNA244943, PRJNA244944, PRJNA253771, PRJNA254401, and PRJNA254455 IMPACT STATEMENT Bacterial whole genome sequencing is increasingly used in public health, clinical, and research laboratories for typing, identification of virulence factors, phylogenomics, outbreak investigation and identification of antimicrobial resistance genes. In some settings, diagnostic microbiome amplicon sequencing or metagenomic sequencing directly from clinical samples is already implemented and informs treatment decisions. The prospect of prediction of antimicrobial susceptibility based on resistome identification holds promises for shortening time from sample to report and informing treatment decisions. Databases with comprehensive reference sequences of high quality are a necessity for these purposes. Bacteroides fragilis is an important part of the human commensal gut microbiota and is also the most commonly isolated anaerobic bacterium from non-faecal clinical samples but few complete genome assemblies are available through public databases. The fragmented assemblies from short read de novo assembly often negate the identification of insertion sequences upstream of antimicrobial resistance gens, which is necessary for prediction of antimicrobial resistance from whole genome sequencing. Here we test multiple assembly pipelines with short read Illumina data and long read data from Oxford Nanopore Technologies MinION sequencing to select an optimal pipeline for complete genome assembly of B. fragilis. However, B. fragilis is a highly plastic genome with multiple inversive repeat regions, and complete genome assembly of six clinical multidrug resistant isolates still required minor manual finishing for half the isolates. Complete identification of known insertion sequences and resistance genes was possible from the complete genome. In addition, the current catalogue of Bacteroides plasmid sequences is augmented by eight new plasmid sequences that do not have corresponding, complete entries in the NCBI database. This work almost doubles the number of publicly available complete, finished chromosomal and plasmid B. fragilis sequences paving the way for further studies on antimicrobial resistance prediction and increased quality of microbiome and metagenomic studies. DATA SUMMARY Sequence read files (Oxford Nanopore (ONT) fast5 files and Illumina fastq files) as well as the final genome assemblies have been deposited to NCBI/ENA/DDBJ under Bioproject accessions PRJNA525024, PRJNA244942, PRJNA244943, PRJNA244944, PRJNA253771, PRJNA254401, and PRJNA254455. Fastq format of demultiplexed ONT reads trimmed of adapters and barcode sequences are available at doi.org/10.5281/zenodo.2677927 Genome assemblies from the assembly pipeline validation are available at doi: doi.org/10.5281/zenodo.2648546. Genome assemblies corresponding to each stage of the process of the assembly are available at doi.org/10.5281/zenodo.2661704. Full commands and scripts used are available from GitHub: https://github.com/thsyd/bfassembly as well as a static version at doi.org/10.5281/zenodo.2683511


ABSTRACT 31
Bacteroides fragilis constitutes a significant part of the normal human gut microbiota and can also 32 act as an opportunistic pathogen. Antimicrobial resistance and the prevalence of antimicrobial 33 resistance genes are increasing, and prediction of antimicrobial susceptibility based on sequence 34 information could support targeted antimicrobial therapy in a clinical setting. Complete identification 35 of insertion sequence (IS) elements carrying promoter sequences upstream of resistance genes is 36 necessary for prediction of antimicrobial resistance. However, de novo assemblies from short 37 reads alone are often fractured due to repeat regions and the presence multiple copies of identical 38 IS elements. Identification of plasmids in clinical isolates can aid in the surveillance of the 39 dissemination of antimicrobial resistance and comprehensive sequence databases support 40 microbiome and metagenomic studies. Here we test several short-read, hybrid and long-lead 41 assembly pipelines by assembling the type strain B. fragilis CCUG4856T 42 (=ATCC25285=NCTC9343) with Illumina short reads and long reads generated by Oxford 43 Nanopore Technologies (ONT) MinION sequencing. Hybrid assembly with Unicycler, using quality 44 filtered Illumina reads and Filtlong filtered and Canu corrected ONT reads produced the assembly 45 of highest quality. This approach was then applied to six clinical multidrug resistant B. fragilis 46 isolates and, with minimal manual finishing of chromosomal assemblies of three isolates, complete, 47 circular assemblies of all isolates were produced. Eleven circular, putative plasmids were identified 48 in the six assemblies of which only three corresponded to a known cultured Bacteroides plasmid. 49 Complete IS elements could be identified upstream of antimicrobial resistance genes, however 50 there was not complete correlation between the absence of IS elements and antimicrobial 51 susceptibility. As our knowledge on factors that increase expression of resistance genes in the 52 absence of IS elements is limited, further research is needed prior to implementing antimicrobial 53 resistance prediction for B. fragilis from whole genome sequencing. 54

IMPACT STATEMENT 55
Bacterial whole genome sequencing is increasingly used in public health, clinical, and research 56 laboratories for typing, identification of virulence factors, phylogenomics, outbreak investigation 57 and identification of antimicrobial resistance genes. In some settings, diagnostic microbiome 58 amplicon sequencing or metagenomic sequencing directly from clinical samples is already 59 implemented and informs treatment decisions. The prospect of prediction of antimicrobial 60 susceptibility based on resistome identification holds promises for shortening time from sample to 61 report and informing treatment decisions. Databases with comprehensive reference sequences of 62 high quality are a necessity for these purposes. Bacteroides fragilis is an important part of the 63 human commensal gut microbiota and is also the most commonly isolated anaerobic bacterium 64 from non-faecal clinical samples but few complete genome assemblies are available through public 65 databases. The fragmented assemblies from short read de novo assembly often negate the 66 identification of insertion sequences upstream of antimicrobial resistance gens, which is necessary 67 for prediction of antimicrobial resistance from whole genome sequencing. Here we test multiple 68 assembly pipelines with short read Illumina data and long read data from Oxford Nanopore 69 Technologies MinION sequencing to select an optimal pipeline for complete genome assembly of 70 B. fragilis. However, B. fragilis is a highly plastic genome with multiple inversive repeat regions, 71 and complete genome assembly of six clinical multidrug resistant isolates still required minor 72 manual finishing for half the isolates. Complete identification of known insertion sequences and 73 resistance genes was possible from the complete genome. In addition, the current catalogue of 74 Bacteroides plasmid sequences is augmented by eight new plasmid sequences that do not have 75 corresponding, complete entries in the NCBI database. This work almost doubles the number of 76 publicly available complete, finished chromosomal and plasmid B. fragilis sequences paving the 77 way for further studies on antimicrobial resistance prediction and increased quality of microbiome 78 and metagenomic studies. 79

INTRODUCTION 97
Bacteroides fragilis is a Gram-negative anaerobic bacterium that is commensal to the human gut 98 but can act as an opportunistic pathogen; it is the most commonly isolated anaerobic bacteria from 99 non-faecal clinical samples (1). Antimicrobial resistance rates are increasing for B. fragilis, 100 especially for carbapenems and metronidazole, two widely used antimicrobials for treatment of 101 severe infections and anaerobe bacteria (2,3). Antimicrobial susceptibility testing of anaerobes 102 using agar dilution or gradient strip methods can be costly and labour intensive and despite efforts 103 to validate disk diffusion as a less expensive option, turn-around time will still be least 18 hours and 104 validation for individual species will be required (4). 105 Antimicrobial resistance prediction from bacterial whole genome sequences, from cultured isolates 106 as well as metagenomes, could be implemented in clinical microbiology in the near future, with the 107 potential for improved sample-to-report turnover time and possibly eliminating the need for 108 phenotypical testing for individual species (5-8 genes associated with resistance to metronidazole (nim genes) and clindamycin (erm genes) (1). 122 In 2014 we observed that identification of IS elements upstream of known antimicrobial resistance 123 genes in B. fragilis was hampered in short read de novo assemblies even though the genes could 124 be identified (14). This occurred because contigs were often terminated close to the start of the 125 resistance genes, presumably due to the proliferation of multiple copies of the same IS elements 126 throughout the B. fragilis genomes. Genome assemblies from short read sequencing technologies 127 alone most often result in fragmented assemblies because of repetitive regions and genome 128 elements with multiple occurrences in the chromosomes and plasmids (15,16 aureus. Improving the representation of complete assemblies of B. fragilis in the public genome 144 databases will support the development of antimicrobial resistance prediction from WGS as well as 145 microbiome and metagenomic analysis projects. 146 The aims of this study were to select an optimal assembly software pipeline for complete, circular 147 assembly of Bacteroides fragilis and demonstrate the utility of complete assembly for both plasmid 148 identification and comprehensive detection of genes and IS elements associated with antimicrobial 149 resistance. We assembled the B. fragilis CCUG4856T (= ATCC25285 = NCTC9343) reference 150 strain utilising long reads generated with the MinION sequencer from Oxford Nanopore Technology 151 (ONT) and high-quality Illumina short reads and selected the best assembly pipeline by comparing 152 assemblies to the Sanger sequenced reference NCTC9343 (RefSeq accesion GCF_000025985.1). 153 The best assembly pipeline was then applied to six clinical multi-drug resistant B. fragilis isolates 154 from our 2014 study (14). 155

Culture conditions and DNA extraction 157
Bacteroides fragilis CCUG4856T and the six strains described in our previous study were included 158 (14,21). Strains were stored at -80° in beef extract broth with 10% glycerol (SSI Diagnostica) and 159 cultured on solid chocolate agar with added vitamin K and cysteine (SSI Diagnostica) for 48 hrs in 160 an anaerobic atmosphere at 35 °C. Ten µl of culture was transferred to 14 ml saccharose serum 161 broth (SSI Diagnostica) and incubated for 18 hrs under the same conditions. DNA was then 162 extracted using the Genomic-Tip G/500 kit (Qiagen) following the manufacturers protocol for Gram 163 negative bacteria and eluted into 5 mM Tris pH 7.5 0.5 mM EDTA buffer. Quality control was 164 performed by measuring fragment length on a TapeStation 2500 (Genomic DNA ScreenTape, 165 Agilent), purity on the NanoDrop (ThermoFisher Scientific) and concentration on the Qubit (dsDNA 166 BR kit; Invitrogen). The eluted DNA was then stored at -20 °C. 167

Illumina library preparation, sequencing and quality control 168
The strains had previously been sequenced and assembled using Illumina short reads for our 169 previous study (14), but to minimise biological disparities we opted to re-sequence with Illumina 170 using the same DNA extraction prepared for long read sequencing.

Assembly validation 199
To select and validate the optimum assembly pipeline Bacteroides fragilis CCUG4856T was 200 assembled using a variety of well-known assemblers and polishing tools (Table 1). Each 201 assembler was run with the Filtlong filtered reads as input or the filtered reads corrected with Canu 202 1.8 (with standard settings, corMinCoverage=0, or coroutCoverage=999). Canu was also tested 203 with the unfiltered reads as input. Hybrid assemblers used the filtered long reads and the filtered, 204 trimmed and down-sampled Illumina reads. Unicycler includes polishing with Racon and Pilon. For 205 assemblers other than Unicycler, Racon polishing with ONT reads was run for one or two rounds 206 and Pilon was run until no changes were made or for a maximum of six rounds. Racon polishing 207 with Illumina reads was run for one round. 208 The original Sanger sequenced Bacteroides fragilis NCTC9343 (=CCUG4856T) (21) downloaded 209 from NCBI RefSeq (accession GCF_000025985.1) was used as reference sequence for the 210 assembly comparisons and Quast v5.0.2 was used for assembly summary statistics, indel count, 211 and K-mer-based completion (32). BUSCO v3.0.2b with the bacteroidetes_odb9 dataset, CheckM 212 v1.0.12, and Prokka v1.13.3 were used to assess gene content (33-35). Average nucleotide 213 identity was calculated using https://github.com/chjp/ANI/blob/master/ANI.pl and ALE v0.9, which 214 uses a likelihood based approach to assess the quality of different assemblies, was also used to 215 score the assemblies (36,37). Ranking of assemblies was based on number of contigs, number of 216 circular contigs, closeness to total length compared to the reference genome, number of local 217 misassembles, number of mismatches per 100 kb, number of indels per 100kb, average nucleotide 218 identity (ANI), CheckM and BUSCO scores, and the total ALE score (a higher score is better). 219 Please see https://github.com/thsyd/bfassembly for full bioinformatics methods. 220

Genome assembly of MDR B. fragilis isolates 221
The assembly strategy deemed to produce the highest quality genome for CCUG4856T was 222 chosen for initial assembly of the six MDR B. fragilis isolates. Manual finishing of incomplete 223 assemblies was performed using Bandage for visualisation of assembly graphs and BLASTn 224 searches (38). Minimap2 and BWA MEM were used to map reads to the assemblies for coverage 225 graphs (39,40). Long read assembly with Flye was compared to the Unicycler assembly and used 226 to guide and validate the manual finishing results. Circlator's fixstart task was used to fix the start 227 position of the manually finished genomes to be at the dnaA gene (41). 228 The assembled genomes were submitted to NCBI GenBank and annotated with PGAP (42

Identification of plasmids and mobile genetic elements 237
The PLSDB web server (https://ccb-microbe.cs.uni-saarland.de/plsdb/) (data v. 2019_03_05) 238 contains bacterial plasmid sequences retrieved from the NCBI and was used for screening and 239 identifying putative plasmids sequences (46). Only hits to accessions from cultured organisms 240 were included. Putative plasmids not identified using PLSDB, were evaluated by the read depth 241 relative to the chromosome (higher relative read depth indicates plasmid sequence) and Pfam 242 families covering known plasmid replication domains from Table 1  Selecting the optimal assembly pipeline 262 141 assemblies of B. fragilis CCUG4856T were generated using the various assemblers and 263 polishing steps (Supplementary Table S2). Compared to the reference genome, Unicycler 264 assemblies were of the highest quality (Table 2). Unicycler, with any of the read input options, 265 produced two circular contigs of the expected lengths, and the differences between the various 266 Unicycler assemblies were minimal (Table 3). Assemblies with Canu corrected reads showed 267 slightly higher genome fractions and average nucleotide identities to the reference and fewer 268 mismatches and indels, when compared to Unicycler alone. Unicycler assemblies corrected with 269 Racon using Illumina reads worsened slightly overall with 0.04-0.19 more indels and 0.14-0.25 270 more mismatches per 100 kbp. Based on this initial evaluation, the assembly pipeline using Canu 271 corrected reads with default options was chosen (Assembly "OF.CS" in Table 3). This would 272 reduce the number of long reads, compared to Canu correction with corMinCoverage=0, or 273 coroutCoverage=999, and thereby lead to a faster run-time for Unicycler. 274 The hybrid Unicycler assembly of CCUG4856T with standard Canu corrected ONT reads consists 275 of two circular contigs of 5,205,133 and 36,560 bp in length. The plasmid is the same length as 276 plasmid pBF9343 from the reference assembly GCF_000025985.1 and the chromosome is seven 277 bases shorter. Alignments of the Sanger sequenced assembly GCF_000025985.1 with the hybrid 278 Unicycler assembly show an 88,045 bp inversion in the hybrid assembly compared to the Sanger 279 assembly (Figure 1). This inversion is present in all the best assemblies, including assemblies 280 derived from solely ONT sequences or Illumina sequences (Supplementary Figure S1) as well as 281 two additional assemblies of NCTC9343/ATCC25285 from PacBio and Illumina sequences 282 downloaded from NCBI RefSeq (Supplementary Figure S2). 283 Complete assembly of six multidrug resistant isolates 284 Unicycler, using filtered and trimmed Illumina reads and the Filtlong filtered and Canu corrected 285 ONT reads from the first sequencing runs, generated complete, continuous, circular assemblies for 286 two of the six isolates (BFO18 and BFO67) (Figure 2). For the assemblies that were not complete 287 with sequencing data from the first MinION runs, increasing the amount of ONT data resulted in 288 fewer contigs overall, except for BF067, where the additional data from the second sequencing run 289 led to a fragmented assembly and manual finishing was necessary. Performing assembly of isolate 290 S01 without Canu correction of the ONT reads from the first sequencing resulted in a closed 291 chromosome and performing Canu correction of reads resulted in a fragmentation of the 292 chromosome. This was ameliorated by including more ONT data. By manual finishing using read 293 mapping and additional assembly with Flye, the remaining three assemblies were circularised. 294 Chromosomes varied in length from 5,141,257 -5,504,076 bp. Alignment of ONT and Illumina 295 reads to the chromosome assemblies showed even coverage for both sequencing technologies 296 (Supplementary Figure S3). For BFO85 a >100% relative read depth increase was observed at 297 approximately 25kb-38kb. This could represent a 12 kb repeat region that was not resolved in the 298 assembly. Seven (47%) of the 15 PGAP annotated CDS' in the 13kb region were annotated as 299 hypothetical proteins. None of the annotated CDS' represented mobilisable proteins. 300 Eleven putative plasmid sequences were identified 301 A total of 11 putative circular plasmids were identified in the six B. fragilis isolates ( Table 4). Zero 302 to three putative plasmids were identified per isolate with lengths varying from 2,782 to 85,671bp. 303 The PLSDB database contains NCBI RefSeq plasmid sequences marked as complete. Three of 304 the 11 putative plasmid sequences were found to match (ID > 98%) a sequence in PLSDB (Table  305 4). These three all matched the cryptic plasmid pBFP35 (49  Figure S4) and the circular structures of the two sequences lacking a predicted replication domain, 326 were confirmed manually by visually inspecting BLASTn mapping of ONT sequences longer than 327 10 kbp to the assembled plasmid sequences with CLC Genomics Workbench 10 (Qiagen). Eleven 328 and 22 ONT reads spanned the complete lengths of pBFO17_1 and pBFS01_1 respectively and 329 contained no other elements. pBFO17_1 and pBFS01_1 demonstrate a degree of similarity of 330 close to 100%, except for an approximate total of 7,500 bp transposase and prophage sequences 331 in pBF017_1 (Figure 3). No alignment to chromosomal sequences of any of the included B. fragilis 332 isolates was observed using progressiveMauve (not shown) (55). 333 The GC content of pBFO17_1 and pBFS01_1 are 36. 78%  above support the assembly data suggesting these two sequences are in fact plasmids. 342

Detection of antimicrobial resistance genes and Insertion sequence elements 343
We used ABRicate to screen assemblies for AMR genes (ResFinder, NCBI and CARD databases 344 supplemented with sequences for bexA and bexB) and IS elements (IS-finder database); several 345 AMR genes, possible homologs to known AMR genes and IS elements adjunct to the AMR genes 346 were detected (Table 5). Of note, isolate BFO17 contains two homologs of the metronidazole 347 resistance gene nimJ (with a 100% consensus) and two isolates, S01 and BFO85, harbour two 348 homologs of the tetracycline resistance gene tetQ. Homologs to bexA and bexB were identified 349 with 73.53-99.12 %ID and were all confirmed with BLASTx searches against the NCBI nr 350 database, as was done in our previous study (14). Partial hits for ugd was observed for several 351 isolates, but with low %ID and %COV, and possible represent identification of conserved domains, 352 but not ugd homologs. Increased expression of the cfiA metallo-beta-lactamase gene, nim-family 353 5-nitroimidazole genes and erm genes is partly regulated through IS elements containing promoter 354 sequences. Full length IS elements could be identified upstream of 11 (79%) of 14 cfiA, nim and 355 erm genes and upstream of two of three CfxA4 genes and the OXA-347 gene identified in BFO42. 356 The described Bacteroides fragilis promotors TAnnTTTG (-7) and TG or TTG or TGTG (-33) (58) 357 were searched for manually but could not be identified upstream of the two cfiA genes in isolates 358 BFO67 and BFO85 or the ermB gene in BFO85 for which no IS elements could be detected 359 upstream (not shown). 360

Correlation between identified genes and IS elements and phenotypical resistance 361
As in our previous study, the cfiA gene was identified in the five meropenem resistant isolates 362 (Table 5). All the cfiA genes were found on the chromosomal sequences. Complete IS elements 363 were identified upstream of the cfiA genes in BFO17, BFO18 and S01, but not in BFO67 or 364 BFO85. MICs for meropenem and imipenem were lower for these two isolates. Nim genes (-A, -D, 365 -E and -J) could be found in the four metronidazole resistant isolates, all with complete IS elements 366 upstream. Three of the nim genes were found on putative plasmids of the respective isolates. The 367 four clindamycin-resistant isolates all carried erm-genes with upstream IS elements. A transposase 368 was inserted in the ermF-gene in isolate BFO18, splitting it in two and the same isolate 369 demonstrated a lower clindamycin MIC (6 mg/L) than the other three clindamycin resistant isolates. 370

DISCUSSION 371
Hybrid genome assembly produces high quality B. fragilis genomes 372 The primary aim of this study was to select and validate an assembly method to reliably complete 373 chromosome and plasmid assembly of B. fragilis genomes. From 141 assembly variations, a 374 hybrid approach using Filtlong filtered and Canu corrected ONT reads with quality filtered Illumina 375 reads as input to Unicycler produced a complete, closed assembly of B. fragilis CCUG4856T with 376 high similarity to the reference assembly of the original Sanger sequenced reference assembly. An 377 88kb inversion was observed when comparing the two assemblies. Cerdeño-Tárraga and 378 colleagues observed difficulties in resolving certain regions of the Sanger sequenced assembly of 379 NCTC9343 due to invertible regions with flanking inverted repeat sequences (21). The observed 380 inversion in the hybrid Unicycler assembly, could be due to a) a superior assembly where the 381 longer ONT reads have overcome the shortcomings of the shorter Sanger sequences, b) an 382 incorrect assembly by Unicycler, c) a biological difference that has occurred over time between the 383 strain stored at NCTC and CCUG, or d) a biological difference that occurred during the culturing of 384 the strain, with dominance of a clone with the inversion, prior to DNA extraction as part of this 385 study. The observations that the inversion is also present in all the best assemblies from this study 386 and assemblies from two other research institutions support the conclusions that the current hybrid 387 Unicycler assembly represents the true orientation of the 88kb sequence. 388

Complete genome assembly of three of the six multidrug resistant isolates required manual 389
finishing 390 The assemblies of BFO18, S01, and BFO42 were completed by Unicycler without manual 391 intervention, but the chromosomes of BFO17, BFO67, and BFO85 could only be closed by 392 performing manual steps. The manual finishing steps are time consuming, difficult to replicate and 393 are easily biased. In order to be implemented in routine clinical laboratories, large scale, 394 automated, complete assembly of prokaryote genomes require robust methods with minimal 395 human interaction. Genome assembly using another long-read assembler, Flye, supported the 396 results of the manual finishing for two of three isolates. Flye is better at resolving repeats than 397 miniasm, the long read assembler included in the Unicycler pipeline (59). One option could be to 398 include the long-read assembly from Flye, in place of that of miniasm, to guide bridge building for 399 the higher quality Illumina-only contigs produced in the first steps of Unicycler. To resolve repeats it 400 is often necessary to have long reads that span the repeat. In prokaryotes repeats over 10kb are 401 not unusual and they are often spanned by the ONT reads generated, even by novice researchers. 402 But repeat regions of up to 120kb and duplications of 200kb have been described in some 403 prokaryotes (17,18,60). ONT sequencing runs will routinely result in many reads that span the 404 majority of repeats, but to obtain ONT reads that span specific 120-200kb repeats in a genome of 405 interest still requires skill and a certain amount of luck. We chose to benchmark a selection of widely used genome assemblers for short read, long-read 437 and hybrid bacterial genome assembly as well as polishing tools for long read assemblies, but 438 many other options have been published. Most assemblers and polishing tools were run using 439 default parameters, and it is possible that further optimisation of settings for the individual software 440 packages might have improved assemblies further than was demonstrated here. As sequencing 441 technologies and assembly software continues to improve, continued validation of pipelines is 442 advisable. Software such as poreTally provides user friendly options for benchmarking genome 443 assembly pipelines prior to implementation (62). 444

Bacteroides plasmids are not well represented in public databases 445
A secondary aim of this study was to identify plasmids in the hybrid assemblies. Automated tools 446 have been developed and validated for identification of plasmids from genome assemblies or read 447 data, but they are dependant of collated databases of known plasmid sequences. As such, tools 448 such as PlasmidFinder or mlplasmids can be applied for plasmid identification for 449 Enterobacteriaceae or Enterococcus faecium, but B. fragilis is not supported at the time of writing 450 (63,64). Therefore, we evaluated putative plasmid sequences by sequence identity and length 451 comparison using the PLSDB webpage, identifying plasmid replication domains, and using 452 circularisation and relative coverage as indicators that a sequence represents a plasmid in a given 453 isolate. 454 Only four of the twelve plasmid sequences from the seven isolates could be identified using the 455 PLSDB and three of these were the same plasmid, pBFP35. Two other putative plasmids, 456 pBFO18_1 and pBFS01_2 were likely plasmids pBF388c and pIP421 based on the partial 457 sequences from these plasmids and plasmid length. This still leaves half of the circularised, 458 putative plasmids unidentified. The two longer putative plasmids, pBFO17_1 and pBFS01_1, 459 displayed a high degree of similarity, a GC% out of the normal range for B. fragilis, and a relative 460 read depth of double the reads compared to the chromosome. Most annotated CDS' were 461 associated with mobilisable elements, but no known plasmid replication domains could be 462 identified. From the sequencing data alone, we cannot conclude that they represent true plasmids, 463 however the findings above and manual inspection of long read mapping support that inference. 464 There are only 14 complete plasmid sequences from cultured Bacteroides isolates in the PLSDB 465 v2019_03_05, which is based on the NCBI RefSeq database. Many other Bacteroides plasmids 466 have been partially described, and some are represented by partial sequences or marked as contig 467 level in the NCBI nucleotide database (65-68). Metagenomic sequencing and genome assembly 468 projects are expanding the public sequence databases and screening the NCBI nucleotide 469 database, sequences with a high degree of similarity to the putative plasmid sequences from one 470 patient isolate (BFO18) could be found. These originated from a rat caecum metagenomic plasmid 471 sequencing project from Copenhagen, a few hours' drive from Odense University Hospital. To 472 understand and perform surveillance of the dissemination of plasmids there is a need for increased 473 submissions of high quality, annotated and phenotypically validated sequences of bacterial isolates 474 including plasmids. This study adds significantly to the number of complete plasmid sequences 475 associated with Bacteroides. 476 Complete assembly allows comprehensive identification of resistance determinants in B. 477 fragilis 478 We also intended to comprehensively identify resistance genes and IS elements in the hybrid 479 genome assemblies. Using ABRicate with several resistance gene databases and IS-element 480 nucleotide sequences, the findings of our previous study were confirmed and enhanced. 481 Assemblies from Illumina sequencing alone would only allow partial IS element identification (14). 482 Now, with the complete assemblies, comprehensive identification of known IS elements upstream 483 of the relevant resistance genes could be completed. In our first study we used ResFinder with the 484 available database at that time. Now, by including several databases, and lowering the %ID 485 threshold, the number of genes identified increased. Additionally, for as a result of the complete 486 genome assembly of BFO17, we could now identify two copies of nimJ, while only one copy was 487 identified in the short read draft assembly of the same isolate in the previous study. Husain and 488 colleagues identified the presence of three copies of nimJ in strain HMW615, when describing the 489 nimJ gene (69). We confirmed this finding by running ABRicate on the HMW615 assembly as done 490 with the isolates of this study (not shown). Interestingly, RAST annotates a third nim gene 491 (nucleotide positions 1,359,590..1,360,093) in the Unicycler hybrid assembly of BFO17, and the 492 PGAP annotation includes an additional annotation of a pyridoxamine 5'-phosphate oxidase family 493 gene (nucleotide positions 940,032..940,505), the family that includes the nim-genes. It is possible 494 that one or more novel homologs of the nim are present in BFO17. 495 IS elements could be identified upstream of most relevant resistance genes. However, in three 496 cases no IS element was present upstream of a resistance gene, even though the isolates 497 displayed phenotypical resistance associated with increased expression of the specific gene. 498 Known B. fragilis promoter sequences could not be identified upstream of the genes "missing" 499 upstream IS elements, however B. fragilis promotors are still not completely described, so it is 500 possible there are other unknown variants. 501 By selecting an optimal genome assembly strategy for B. fragilis, supplemented with minimal 502 manual finishing efforts, and applying this to six multidrug resistant isolates, the number of 503 complete B. fragilis genomes and plasmids in the public databases has now almost doubled. The 504 future aim of performing antimicrobial resistance prediction based solely on WGS information for B. 505 fragilis demands near-complete genomes for identification of IS elements upstream of resistance 506 genes. However, we must caution that the absence of an IS element upstream of cfiA does not 507 always correlate to susceptibility to carbapenems. Future studies are needed to address this, and 508 utilising complete genome assembly for genome wide association studies is one approach that 509 could be pursued. Technologies that provide a single solution for real-time, high-quality sequencing 510 of long reads will be essential for implementing near real-time diagnostics of infectious diseases 511 and characterisation of pathogens. 512

AUTHOR STATEMENTS 513
Authors and contributors 514

Conflicts of interest 520
The authors declare that there are no conflicts of interest 521