Complete microbial genomes for public health in Australia and the Southwest Pacific

Complete genomes of microbial pathogens are essential for the phylogenomic analyses that increasingly underpin core public health laboratory activities. Here, we announce a BioProject (PRJNA556438) dedicated to sharing complete genomes chosen to represent a range of pathogenic bacteria with regional importance to Australia and the Southwest Pacific; enriching the catalogue of globally available complete genomes for public health while providing valuable strains to regional public health microbiology laboratories. In this first step, we present 26 complete high-quality bacterial genomes. Additionally, we describe here a framework for reconstructing complete microbial genomes and highlight some of the challenges and considerations for accurate and reproducible genome reconstruction.


INTRODUCTION
Whole-genome sequence (WGS) data are now a critical tool in public health microbiology [1][2][3][4]. The data can be used to replicate many of the now commonly used microbiological sub-typing methodologies, as well as support epidemiological investigations, such as surveillance and outbreak investigation [5][6][7]. The appeal of WGS data comes from the promise of a single workflow to process all microbial pathogens, and the provision of easily portable data that promote deeper integration of surveillance and investigation efforts across jurisdictions. This promise is leading to a concerted effort to move microbial public health to a primarily genome-based workflow in numerous countries [8][9][10], including Australia [11].
Essential to the success of this transition to a genomics workflow is the need to develop catalogues of high-quality complete reference genomes of microbial pathogens [12].

ACCESS
Complete bacterial genomes can provide valuable insights, for instance, into the genomic context of virulence and antimicrobial resistance genes [13], and their possible mechanisms of actions. More importantly, complete genomes are essential for generating accurate phylogenomic analyses, a core requirement of public health surveillance and outbreak response. In this setting, they provide valuable context to identify variable genomic regions across samples in a given study in a computationally efficient manner [14][15][16].
However, pathogenic bacteria are not generally composed of uniform panmictic populations. Instead, they represent numerous diverse clades, with many being endemic to particular regions or jurisdictions [17][18][19][20][21][22][23]. We define the latter as clones that are repeatedly observed in a given region with evidence of ongoing local transmission, but are not commonly observed in other parts of the world; a more practical definition is given by clones observed in local outbreaks for which no suitable reference genome is available in the public domain. This inherent population structure poses a challenge to a successful transition to genomics in public health microbiology laboratories and can significantly reduce the resolution of phylogenomic analyses by influencing the identification of genetic variants [24,25]. Thus, catalogues of complete genomes will only be effective in supporting a transition to genomics in public health microbiology if they are rich in endemic strains.
Here we present the establishment of a genome catalogue for microbial pathogens of regional importance to Australia and the Southwest Pacific, and describe the first 26 complete genomes to be added. We will continue to build on this resource as further strains are sequenced and assembled.

Whole-genome sequencing
All strains were grown in appropriate media for the organism following standard laboratory protocols. Whole-genomic DNA was extracted using various methods, selected based on the species to ensure high-quality DNA for short-and long-read sequencing (outlined in Fig. 1). For the Chemagic Viral DNA/RNA kit (PerkinElmer), the GenElute Bacterial Genomic DNA kit (Sigma-Aldrich) and the Nanobind CBB Big DNA kit (Circulomics), extractions were performed as per the manufacturer's recommendations. For mycobacterial species, the protocol outlined in McNerney et al. [26] was used with the following modifications. Growth from five brown and buckle slopes was used for gDNA extraction, dissolved in 500 µl molecular grade (Ultrapure, Life Technologies) water and heated for 60 min at 80 °C. Following incubation with lysozyme (Sigma Aldrich), samples were mixed by manual inversion and all incubation steps were performed at 60 °C. Samples were eluted in EB Buffer (Qiagen, 10 mM Tris/Cl, pH 8.5 buffer) by overnight incubation at 4 °C followed by incubation at 60 °C for 15 min, and then centrifugation.
Short-read genomic DNA was sequenced on either the Illumina NextSeq500/550 (2×150 bp paired-end) or MiSeq (2×300 bp paired-end) platforms. Long-read genomic data were produced on either the PacBio RS-II (P6-C4 chemistry) or Oxford Nanopore GridION X5 (with FLO-MIN106D R9 flow cells) platforms. The DNA extraction, library preparation and sequencing workflow is illustrated in Fig. 1, with strainspecific methodology provided in the figure, Table S1, and the respective NCBI BioSample record (Table 1).

Genome assembly
Before de novo assembly, all sequence data underwent quality control (QC) to ensure sufficient depth and basecall quality, and that the sequence data represented the expected organism based on a kraken2 [27] analysis (Fig. 1). Confirmation of Shigella sonnei identification was performed by phylogenetic analysis of the strains against samples described elsewhere [28]. In the case of PacBio data, provided that the above QC requirements were met, the consensus circular subreads fastq files were concatenated and used as the input for de novo assembly. In the case of Nanopore data, sequence data were basecalled onboard the GridION X5 using ONT's Guppy basecaller v3.2.4 with the high accuracy protocol. Demultiplexing and adaptor trimming were performed using Porechop v0.2.4 (https:// github. com/ rrwick/ Porechop). Default parameters were used with two exceptions: to be kept, a read required (i) both barcodes to be identified with (ii) a minimum identity of 85 %. Long-read datasets were then filtered using Filtlong Significance as a BioResource to the community Referenced-based bioinformatic analyses are increasingly being used to enhance public health activities; comparative genomics having been shown to appreciably assist in outbreak investigation and understanding the genetic context underlying clinically relevant phenotypes. However, reference-based analyses are inherently constrained by the genetic similarity of the reference strain to the population being studied and subsequently a catalogue of high-quality reference strains is required to support the diverse analyses undertaken in the public health environment. The genomes reported here represent the first 26 reference strains to be incorporated into a new public health resource; a collection of diverse bacterial pathogens of importance to Australia and the Southwest Pacific (including curated genomic and phenotypic data), that will continue to be added to in a multi-jurisdictional collaboration between public health laboratories in the region. To further support public health activities, we also provide a detailed framework for bacterial genome reconstruction, using a combination of short-and long-read sequence data generated from different platforms. Included is a discussion of the challenges encountered and the considerations made to ensure both accuracy and reproducibility in the construction of these reference genomes.  1. Schematic of the methodology used for sequence data generation and genome assembly. *Modifications to the published CTAB method are decsribed in the methods section. **Nanopore data was filtered to 100x for the expected species size, preferencing quality and length equally using Filtlong v0.2.0; ***PacBio data was filtered using a minimum read quality [mQ] = 0.80; a multiple versions used -refer to the methods section and supplementary tables; P/O = parameters/options (that differ from default); mRL = minimum read length; cOR = corOutCoverage; GS = genome size ( b set as Mb closest to species average); cER = correctedErrorRate ( c set as 0.144 for Nanopore or 0.045 for PacBio data); PLD = plasmid flag used; SLR = seed read length; oER = overlapper error rate; d start replicon was dnaA for chromosome sequences and rep for plasmid sequences, based on prokka annotations. All genomes were assembled using four different approaches.
Three were applied to all datasets, one for PacBio data only, and one for ONT data only, as outlined in Fig. 1.
Following assembly, all draft genomes were compared for structural consistency and a single assembly was selected.
Features considered during comparison were: (i) interassembly variation in genome size and consistency with expected size for the species; (ii) number and location of ribosomal RNA gene; (iii) broad structural similarity as assessed visually from an alignment generated with Mauve v1.1.1 [34], looking for large differences (i.e. >5 kb) including inversions, duplications and deletions; (iv) representation of small replicons (e.g. plasmids and other mobile elements).
When selecting a final assembly, the hybrid assembly approach was prioritized. If required, selection between long read-only assembly approaches was based on which produced a structure that most closely represented the consensus of the assembly outputs. In general, this was HGAP3 (PacBio) or Flye (ONT), followed by Unicycler, then Canu; an order that is consistent with a benchmarking study that established performance standards for these assemblers in the context of long-read genome assembly [33].
Following selection, the final assembly was assessed for orientation to an appropriate start replicon and adjusted if required. The assembly then underwent a final error correction with the short-read sequence data using Snippy v4.3 or v4.4, run in an iterative manner until no variants were detected. This was also performed on hybrid assemblies generated by Unicycler, even though the software performs its own short-read correction, the reasoning for which is explained below. Strain-specific assembly information is illustrated in Fig. 1, and provided in Table S1 and the respective NCBI BioSample record (Table 1).
Sequences representing plasmids were additionally checked for similarity to published sequences deposited in the NCBI database (https://www. ncbi. nlm. nih. gov/). Of the 39 plasmids recovered, only one was identified as novel; that belonging to the Salmonella enterica subsp. enterica serovar Birkenhead AUSMDU00010532. This sequence contained genes consistent with plasmid replication machinery.

RESULTS AND DISCUSSION
Here, we present 26 high-quality complete bacterial reference genomes. Strains were selected to represent a broad range of organisms of importance for public health in Australia and the Southwest Pacific. These included foodborne (e.g. L. monocytogenes, and S. enterica), waterborne (e.g. L. pneumophila), sexually transmitted (e.g. N. gonorrhoeae) and other pathogens of public health importance (e.g. K. pneumoniae, Mycobacterium sp., N. meningitidis). In some cases, we chose the strains because of their relevance to local surveillance and outbreak requirements as well as their virulence or antimicrobial resistance (AMR) phenotypes (e.g. colistinresistant S. enterica, carbapenem-resistant K. pneumoniae and vancomycin-resistant Enterococcus faecium). Presented in Table 1 is a summary of the demographic and genomic characteristics (including in silico MLST and serotypes) of the 26 reference genomes. Phenotypic antimicrobial susceptibility data (when testing is appropriate for the given species) and the matched genotypic AMR profiles are presented in Table 2.
With the increasing use of genomics in public health investigation, high-quality reference genomes have become a key resource but one that must be continually updated with shifts in circulating microbial lineages, the emergence of new outbreak clones and the ongoing spread of genetic material encoding clinically relevant phenotypes. The advances in long-read sequencing technology have made it increasingly cost-effective to generate the data needed to construct reference genomes. However, missing are pipelines to automate the downstream assembly process. Such a pipeline would need to be capable of generating accurate and reproducible genomes and reliably handle genetically diverse datasets with minimal manual intervention. Following is a discussion about our experiences with reconstructing the 26 genomes described in this paper and considerations that should be made in the generation of such a bioinformatic pipeline.

Assembly approaches; one size does not fit all
Of the 104 assemblies performed (4 per genome), only 54 were designated as 'complete' when considered in isolation; defined as an assembly output that included (i) a chromosomal contig that was circular and of an appropriate size for the species, and (ii) if present, circular plasmid contig(s) that matched published sequences (based on a nucleotide alignment and length comparison), or in the case of AUSMDU00010532 (Salmonella Birkenhead) carried genes encoding known plasmid replication machinery.
The remaining 50 assemblies were designated as either 'draft' (n=14) -contig(s) represented a full-length chromosome and plasmid(s), when present, but were not circularized (examination of the contig ends identified a sequence overlap, indicating that the entire replicon was reconstructed) -or 'failed' (n=36), with most representing fragmented assemblies. There were only two assembly attempts in which the assembly approach produced no output (Table S2, available in the online version of this article).
Overall, the hybrid assembly approach produced the highest number of 'complete' assemblies (19/26). However, every approach produced an assembly designated as 'failed' on at least three datasets. This indicated that there is no onesize-fits-all approach to reconstructing the genomes of a diverse collection of strains, consistent with the performance of the various assemblers reported by Wick and Holt [33].
Subsequently, a pipeline developed for long-read genome assembly in public health would need to incorporate multiple assembly tools and approaches to maximize performance.

Structural consistency; assembling the same dataset twice does not always give the same result
With the examination of a few assembly metrics it is reasonably straightforward to detect large errors in reconstructed genomes. For example, fragmented or linear contigs, significant length inconsistencies for a given species, and the absence of genes of interest. However, there are a range of more subtle errors that may go undetected when an assembly output is considered in isolation. These include structural rearrangements (inversions, duplications and deletions), absence of plasmids and other small replicons, and the presence of superfluous contigs that do not represent true replicons. All of which could contribute to error in a reference-based analysis.
Therefore, all assemblies (with the exception of those that were highly fragmented or for which no output was produced) were compared for structural consistency (as outlined in the methods). This identified 28 assemblies with structural inconsistencies (Table S2), 22 of which were initially deemed 'complete' or 'draft' (due to linear contigs) when considered in isolation. Of these, 10 were inconsistent due to the absence of small replicon(s) and 16 due to the presence of inversions, duplications or deletions (the affected regions ranging in the 10s to 100s of kb). Four assemblies contained both inconsistencies.
These findings highlight a significant issue with reproducibility. Every assembly approach generated at least one output that was deemed to be structurally inconsistent. We will not comment on which assemblies were 'correct' or which approach produced the least number of 'incorrect' assemblies, as we have made our judgments based on which genome structure was most commonly observed in the assembly outputs, and have not preformed the required laboratorybased experiments to determine which is biologically correct.
Regardless, our experience highlights that to achieve complete and reproducible genome reconstruction, multiple assembly approaches should be used, and their outputs compared. This creates challenges for pipeline generation, with the comparison of assembly outputs still requiring some level of manual curation. For example, of the 68 assemblies classified as 'complete' or 'draft' (ignoring structural inconsistencies), the outputs of 20 contained small superfluous contigs; artefacts of the assembly approaches that were not representative of true replicons (Table S2). A small number of these were circularized due to repetitive sequence and without comparison and curation would have been included in the final assembly.
Ultimately, which assembly is selected as the 'final' one is dependent on the species, available sequence data and the assembly approach(es) used. Our method was to select the assembly output that most closely represented the consensus of all outputs, was biologically consistent with the species and contained the characteristics expected for the strain (i.e. AMR Table 2. Phenotypic antimicrobial susceptibility data and resistance genes profiles for the 26 reference genomes   Table 2. Continued determinants), favouring the hybrid assembly output when it met these requirements.
Since the submission of our manuscript, our conclusions have been additionally supported by observations from others (R. Wick, personal communication), leading to the recent release of TryCycler (https:// github. com/ rrwick/ Trycycler), a tool that may help automate comparisons of multiple genome assemblies and the identification of a consensus assembly. We look forward to contributing to the development of TryCycler by comparing our manually curated outputs with those produced by the tool.
Nanopore vs PacBio, and are short-read sequence data really needed?
There was no clear difference in the assembly outputs in terms of genome completeness or structural inconsistencies between the PacBio and ONT datasets. The main distinction between the two approaches is that the latter is more costeffective for bacterial genomes and requires smaller amounts of input DNA. It has become standard practice at MDUPHL to undertake short-read sequencing on the same genomic DNA that is used for long-read sequencing. Our findings from these assemblies indicate that it is worthwhile as the hybrid assembly approaches, which utilizes both datasets in the assembly process, generated the highest number of 'complete' genomes (Table S2). Both short-and long-read sequence data can be generated in similar time frames in public health laboratories. However, multiplexing enhances cost savings and outside of urgent public health activities, there is often a delay in the generation of long-read data, waiting until a sufficient number of samples are available to maximize the output of the platform.
Another common use for short-read sequence data in this context is in genome polishing; a process of correction by mapping the short reads to the long-read assembly and identifying the consensus. Multiple iterations of long-read polishing can result in 99.9% consensus between the reads and the assembly. However, in a bacterial genome this still equates to thousands of potential errors, the majority of which are insertions or deletions due to the known issue with basecalling homopolymers during long-read sequencing.
We recommend short-read polishing because it provides relatively cheap, high-depth coverage across the genome and helps overcome the higher error rate in long-read sequence data, further reducing the number of potential errors in the final genome, a feature that is very important for the mapping-based analyses conducted by public health laboratories to investigate transmission. For consistency, we opted to short-read polish all final assemblies, including the hybrid assemblies generated with Unicycler, which undertakes a short-read correction step but utilizes a different tool from that used in our laboratory for mapping-based analysis.

Orientation; to start at dnaA or not
While it is a preference with minimal to no impact on downstream analysis, we support the practice of orientating contigs to start at genes encoding the replication machinery. This is straightforward for chromosomes, only requiring the identification of dnaA. It is more challenging for plasmids, which carry diverse and often multiple replication-associated genes, with the identification of which is responsible for replication not always simple. Of the assembly tools used, only Unicycler included a step to orientate contigs. Of the 21 final assemblies constructed by either the hybrid or long read-only approaches using Unicycler, the chromosome orientation was correct in 16, with the other 5 requiring adjustment (Table S2).

Next steps: pipeline generation and open access
With long-read sequencing on a trajectory to become part of routine public health practice, the next required step is the development of a pipeline to automate the process of assembly. As indicated, such a pipeline would need to incorporate multiple tools and approaches, compare outputs for inconsistencies, handle data from different platforms and from diverse species, and preferably orientate final contigs to a suitable start replicon. As mentioned, TryCycler appears to be a promising step in this direction. Outside of this, such a pipeline should (i) be amenable to rapidly changing sequencing technologies. (ii) Be designed in a way to enable to development of genus-and/or species-specific workflows (e.g. through specifying organism-specific parameters through a config file that could be easily shared). In this scenario, we would envision such configuration parameters being tuned on large datasets, such as those housed at the NCBI Pathogen Portal, and shared with the wider community. It has not been mentioned, but assembly challenges were noted for specific species due to genome structural complexity, and repetitive and/or recombinant regions. (iii) Most importantly, such a pipeline would need to be proven to reliably and reproducibility generate data that are consistent with both short-read sequence and phenotypic data to be accredited for use in a public health laboratory. Here again, we would advocate taking advantage of the large numbers of assembled genomes already on the NCBI Pathogen Portal, as well as the ATCC's high-quality reference genome collection (https://www. atcc. org/ en/ Documents/ Marketing_ Literature/ Genome_ Portal_ Technical_ Document. aspx).
Another important step is open access for newly generated reference genomes. While there are a number of challenges to uploading public health data (including legal, ethical and computational considerations), where possible reference genomes and linked phenotypic and demographic data should be openly available to maximize the use of these public health resources. We have chosen to focus first on building our reference dataset with predominantly endemic clones. However, a clone that is endemic in one region may be an imported clone in another, and open access to reference genome collections will enable other laboratories to access high-quality genomes for their own public health activities.
To further improve this resource, additional genomic characteristics could be included, such as characterization of key mobile genetic elements or genomic regions that may interfere with reference-based analyses (e.g. integrated phage); information that may help public health laboratories with reference selection and use.

CONCLUSIONS
We report the establishment of a new public health resource; a collection of high-quality reference genomes and matched phenotypic data to support regional public health activities.
We will continue to add to this resource as part of a multijurisdictional collaboration. Additionally, we share our process for genome reconstruction, highlighting some of the challenges and considerations for accurate, reproducible and eventually automated assembly of high-quality complete microbial genomes of medical and public health relevance.

Funding information
This study was funded by a grant from the Department of Health of the Australian Federal Government to the Communicable Diseases Genomics Network (CDGN).