Insights from the revised complete genome sequences of Acinetobacter baumannii strains AB307-0294 and ACICU belonging to global clones 1 and 2

The Acinetobacter baumannii global clone 1 isolate AB307-0294, recovered in the USA in 1994, and the global clone 2 (GC2) isolate ACICU, isolated in 2005 in Italy, were among the first A. baumannii isolates to be completely sequenced. AB307-0294 is susceptible to most antibiotics and has been used in many genetic studies, and ACICU belongs to a rare GC2 lineage. The complete genome sequences, originally determined using 454 pyrosequencing technology, which is known to generate sequencing errors, were re-determined using Illumina MiSeq and MinION (Oxford Nanopore Technologies) technologies and a hybrid assembly generated using Unicycler. Comparison of the resulting new high-quality genomes to the earlier 454-sequenced versions identified a large number of nucleotide differences affecting protein coding sequence (CDS) features, and allowed the sequences of the long and highly repetitive bap and blp1 genes to be properly resolved for the first time in ACICU. Comparisons of the annotations of the original and revised genomes revealed a large number of differences in the protein CDS features, underlining the impact of sequence errors on protein sequence predictions and core gene determination. On average, 400 predicted CDSs were longer or shorter in the revised genomes and about 200 CDS features were no longer present.


INTRODUCTION
Acinetobacter baumannii is a Gram-negative bacterium that has emerged as an important opportunistic pathogen, and is a research priority because of its high levels of resistance to antibiotics [1][2][3], desiccation and heavy metals [4,5]. On a global scale, members of two clinically important clones, known as global clone 1 (GC1) and global clone 2 (GC2), have been responsible for the majority of outbreaks caused by multiply antibiotic-resistant A. baumannii strains [1][2][3][6][7][8]. Whole-genome sequencing technologies have revolutionized the study of bacterial pathogens, allowing the entire gene repertoire of bacterial strains to be determined; hence, enabling the study of the relationships between outbreak strains with an unprecedented high resolution [9]. However, accuracy is important.
The first 10 complete genomes of A. baumannii strains were reported between 2006 and 2012 (Table 1), and are still used as a baseline in many studies of this micro-organism [10][11][12]. Except for three strains (AYE, TCDC-AB0715 and TYTH-1), all of the early A. baumannii complete genomes were sequenced using 454 pyrosequencing technology and assembled using PCR. Pyrosequencing is known to generate frequent systematic sequencing errors, especially errors in the length of homopolymeric runs [13]; and these errors lead to erroneous protein coding sequence (CDS) prediction, often associated with fragmentation of genuine ORFs.
An additional problem in A. baumannii genomes determined using short-read sequence data followed by PCR gap closure arises from the many short internal repeats present in the very large bap gene (~8-25 kbp), which is hard to assemble accurately. This gene encodes the biofilm-associated protein Bap [14][15][16][17]. The bap gene was originally cloned from AB307-0294 (GC1), and found to be 25 863 bp with a complex configuration of internal repeats [15]. However, the size of the bap gene from a GC2 isolate was estimated at approximately 16 kbp [16]. In another study, the length of Bap proteins predicted from A. baumannii genomes available in GenBank appeared to be highly variable, mainly due to different numbers of copies of the various repeated segments and the ORF was often fragmented [17]. The blp1 gene, which is 9-10 kbp, encodes a further very large protein that also has internal repeats and is associated with biofilm formation [17].
Newer sequencing technologies such as PacBio (Pacific Biosciences) and MinION (Oxford Nanopore Technologies; ONT) can generate much longer sequencing reads [9], allowing gaps to be spanned. MinION-only assemblies are also prone to errors [18], but can be combined with highaccuracy Illumina short-read data to produce very-highquality finished genome assemblies [19]. Long-read sequence data have enabled a re-assessment of early completed A. baumannii genomes, including several of the first 10 to be sequenced (Table 1). For example, in 2016, ATCC 17978 was re-sequenced using PacBio. This revealed the presence of a 148 kb conjugative plasmid, pAB3, fragments of which were erroneously merged into the chromosome in the original 454-based assembly [20]. This plasmid sequence brought together the parts of GIsul2, fragmented pieces of which had been randomly distributed in the chromosome in the original sequence [21]. In 2017, we revised the 454-based genome sequence of the GC1 strain AB0057 using Illumina HiSeq technology, and found that hundreds of single base additions or deletions changed >200 protein CDS features [22]. An additional copy of the oxa23 carbapenem-resistance gene, located in Tn2006, was also found in the revised sequence of the chromosome (GenBank accession no. CP001182.2) [22,23].
A recent revision of the 454-based genome of the GC2 strain MDR-ZJ06 using PacBio sequencing led to the correction of hundreds of CDS features and allowed reassessment of the localization of important antimicrobial-resistance regions [24]. The position of transposon Tn2009, which carries the oxa23 gene, was revised; and a region originally reported as a plasmid, but that had been predicted to be a chromosomally located AbGRI3-type resistance island [25], was incorporated into the chromosome (CP001937.2) [24]. In the revised genome, the two arrays of gene cassettes carrying antibiotic-resistance genes in class 1 integrons are now in the correct resistance islands. These revisions exemplify the challenges encountered when relying solely on short-read data to assemble bacterial genomes, and highlight the extent and impact of pyrosequencing errors particularly on CDS predictions.
Two further A. baumannii strains for which only early 454-based genome sequences are available are the largely antibiotic-susceptible isolate AB307-0294, recovered from the blood of a patient hospitalized in Buffalo, NY, USA, in 1994 [26], and the extensively antibiotic-resistant isolate ACICU recovered in 2005 from the cerebrospinal fluid of a patient in San Giovanni Addolorata Hospital in Rome, Italy (GenBank accession no. CP000863) [27]. AB307-0294 was one of the first GC1 strains to be completely sequenced [26] and has been extensively used in genetic studies [28][29][30][31][32]. It belongs to CC1 (clonal complex 1) [sequence type 1 (ST1)] in the

Impact Statement
The genomes of the first 10 Acinetobacter baumannii strains to be completely sequenced underpin a large amount of published genetic and genomic analysis. However, most of their genome sequences contain substantial numbers of errors as they were sequenced using 454 pyrosequencing, which is known to generate errors particularly in homopolymer regions; and employed manual PCR and capillary sequencing steps to bridge contig gaps and repetitive regions in order to finish the genomes. Assembly of the very large and internally repetitive genes for the biofilm-associated proteins Bap and Blp1 was a recurring problem. As these strains continue to be used for genetic studies and their genomes continue to be used as references in phylogenomics studies, including core gene determination, there is value in improving the quality of their genome sequences. To this end, we re-sequenced two such strains that belong to the two major globally distributed clones of A. baumannii, using a combination of highly accurate short-read and gap-spanning long-read technologies. Annotation of the revised genome sequences eliminated hundreds of incorrect coding sequence (CDS) feature annotations and corrected hundreds more. Given that these revisions affected hundreds of non-existent or incorrect CDS features currently cluttering GenBank protein databases, it can be envisaged that similar revision of other early bacterial genomes that were sequenced using error-prone technologies will affect thousands of CDSs currently listed in GenBank and other databases. These corrections will impact the quality of predicted protein sequence data stored in public databases. The revised genomes will also improve the accuracy of future genetic and comparative genomic analyses incorporating these clinically important strains.     Institut Pasteur multilocus sequence typing (MLST) scheme and to ST231 in the Oxford MLST scheme, and carries the KL1 capsule genes and OCL1 at the outer core locus [33]. Compared to other GC1 strains characterized to date, AB307-0294 is relatively susceptible to antibiotics [26], exhibiting resistance only to chloramphenicol (intrinsic) and nalidixic acid (acquired). It contains no plasmids.
ACICU was the first GC2 isolate to be sequenced [27]. It belongs to ST2 in the Institut Pasteur MLST scheme, ST437 in the Oxford MLST scheme, and carries the KL2 capsule genes and OCL1 at the outer core locus [33]. ACICU is carbapenem resistant and also resistant to multiple antibiotics, including third-generation cephalosporins, sulfonamides, tetracycline, amikacin, kanamycin, netilmicin and ciprofloxacin [27]. It contains two plasmids [27]. However, we previously showed that the largest plasmid, pACICU-2, which was reported to include no resistance genes, is larger and contains the amikacin-resistance gene aphA6 in transposon TnaphA6. The central segment of TnaphA6, including the aphA6 gene and one of the ISAba125 copies as well as a 4.7 kb backbone segment, were missing in the original 454-based wholegenome sequence [34].
Here, we report revised complete genome sequences for A. baumannii strains AB307-0294 (GC1) and ACICU (GC2), generated using MiSeq (Illumina) and MinION (ONT) sequence data. The new genome sequences correct hundreds of protein CDS features generated by the presence of single nucleotide differences (S NDs) and small insertion/deletions of mainly 1-3 bases in the earlier 454 genome sequences.

Whole-genome sequencing, assembly and annotation
Whole-cell DNA was isolated and purified using a protocol described elsewhere [1,35]. Libraries were prepared from whole-cell DNA isolated from AB307-0294 and ACICU, and were sequenced using Illumina MiSeq and ONT MinION. Paired-end reads of 150 bp and MinION reads of up to 20 kb were used to assemble each genome using Unicycler software (v0.4.0) [19] with default parameters.
Protein CDS, rRNA and tRNA genes were annotated using the automatic annotation program Prokka v1.13 [36]. Regions containing antibiotic-resistance genes and the polysaccharide biosynthesis loci, biofilm-associated proteins and genes used in the MLST schemes were annotated manually.
To compare previous CDS (≥25 aa CDS features) annotations with our new results, we wrote a script ( github. com/ rrwick/ Compare-annotations) to quantify the differences. This script classifies CDSs in the annotations as either exact matches, inexact matches, only present in the first annotation or only present in the second annotation. We also used the Ideel pipeline of Dr Mick Watson ( github. com/ mw55309/ ideel) to assess the completeness of CDSs annotated in each genome, by comparing the length of each CDS to that of its longest blast hit in the UniProt database (as described in http:// www. opiniomics. org/ a-simple-test-for-uncorrected-insertions-and-deletions-indels-in-bacterial-genomes/).

Revised genome of ACICU
ACICU, the first GC2 strain to be completely sequenced, contains AbaR2 in the chromosomal comM gene [27]. As this AbaR resistance island type is more usually found in this location in GC1 strains [37] with an AbGRI1 type island in GC2 isolates [38], ACICU may represent a rare GC2 lineage. Here, the ACICU genome was re-sequenced using a combination of Illumina (MiSeq, 58× depth) and ONT  Table 1). Most of the additional length in the revised chromosome was found to be due to a 11.2 kbp longer bap gene (Table 2), which is just over 11 kbp and in nine smaller ORFs in the original sequence (locus_ids ACICU_02938 to ACICU_2946), as noted previously [17]. In the revised genome sequence, the bap gene is 22.2 kbp (locus_id DMO12_08904), mainly due to a large number of short strings of repeated sequences missing previously. Hence, some of the variation in the length of bap reported elsewhere [17] may be due to sequencing and assembly issues rather than genuine length variation in the A. baumannii population. The blp1 gene in the original sequence (locus_id ACICU_02910) is 9510 bp and 9813 bp (locus_id DMO12_08811) in the revised genome ( Table 2).  Fig. 1(a), highlighting a substantial population of CDSs annotated in the old assembly that have lengths well below those of homologous proteins in UniProt.
ACICU carries two plasmids (Table 1), pACICU1 and pACICU2 [27], which encode the RepAci1 and RepAci6 replication initiation proteins [39]. The original pACICU1 sequence (GenBank accession no. CP000864) is 28 279 bp long and contains two copies of the carbapenem-resistance gene oxa58, while the revised pACICU1 (GenBank accession no. CP031381) is 24 268 bp long and includes only a single oxa58 copy. It lacks the region between the two IS26 and one  1 (revised). The x-axis shows the ratio of CDS length to the length of the closest hit in the UniProt TrEMBL database. The y-axis shows gene frequency and is truncated at 100 (the centre bar extends to ~3000 genes). A tight distribution around 1.0 indicates that the assembly's CDSs match known proteins, supporting few indel errors in the assembly. A left-skewed distribution is characteristic of an assembly with indel errors that lead to premature stop codons. copy of IS26 in the original sequence. The IS26-mediated duplication may have been generated during growth in selective media. The original and revised pACICU1 sequences also differed by three SNDs, six single bp insertions, and one single bp and two 2 bp deletions. We previously used a PCR mapping strategy [34] to show that the aphA6 gene and an additional ISAba125, as well as a 4.7 kb long backbone segment, located between two copies of a ~420 bp repeated segment, are missing from the original sequence of pACICU2, the larger plasmid of ACICU [34]. Here, the long-read sequences generated for pACICU2 (GenBank accession no. CP031382) confirmed this. The revised plasmid sequence differs by six SNDs from pAb-G7-2 (GenBank accession no. KF669606.1), a conjugative plasmid from a GC1 isolated in Australia in 2003 reported previously [40].

Revised genome of AB307-0294
The AB307-0294 genome was also sequenced using a combination of Illumina (MiSeq, 63× depth) and ONT (MinION, 120× depth) technologies. The hybrid assembly resulted in a single 3 759 495 bp chromosome (GenBank accession no. CP001172.2) compared with 3 760 981 bp in the original genome (GenBank accession no. CP001172.1), making the revised genome 1486 bp shorter ( Table 1). As with AB0057, the majority of differences were found to be additions or deletions of 1-3 bases, usually in ' A' or 'T' in homopolymeric runs of these nucleotides.  (Fig. 1b).
The bap gene was 25 863 bp (locus_id ABBFA_00771), the same length as reported originally [15] but 1067 bp shorter than the 26 930 bp bap gene in the original genome sequence where it is split into two ORFs (locus_id ABBFA_000776 and locus_id ABBFA_000777). The revised genome was found to contain a 10 089 bp blp1 gene (ABBFA_00802), only 18 bp longer than that in the original sequence (Table 2). Interestingly, both the original and revised genomes appear to be devoid of any insertion sequences.

Revised genomes affect many predicted protein sequences
To date, six early A. baumannii genome sequences, including AB307-0294 and ACICU reported here, have been corrected, and in each case the revised genome has resulted in the correction of ~600 CDS features on average [20,22] Fig. 1c). Hence, the original assembly was substantially flawed and should not be used in future. However, although the original study reported that ATCC 17978 contains two cryptic plasmids of 13 kb, pAB1 (GenBank accession no. CP000522.1) and 11 kb, pAB2 (GenBank accession no. CP000523.1) [41], the revised genome does not include either of these plasmids. This may be due to an assembly parameter setting to filter out the small contigs, which would remove pAB1 and pAB2 from the final assembly.
Granted the large effects observed on the length of bap and blp in ACICU using long-read data, their sizes in original and revised genomes in the remainder of the first set of 10 sequenced A. baumannii (Table 1) were compared and significant differences were observed only where long-read data were used in the revision. In the GC2 strain MDR-ZJ06 (GenBank accession no. CP001937), blp1 (locus tag ABZJ_03096) is 9812 bp in the revised genome (CP001937.2) versus 9134 bp in the original sequence (locus tag ABZJ_03096; Table 2). Further, bap, which is 7946 bp in the revised genome (locus_id ABZJ_03955), was split into three ORFs, ranging in size from 2 to 2.5 kb, in the original sequence. In ATCC 17978, the blp1 gene is not present in either the original or the revised genome. However, the bap gene, which was split into two ORFs (locus_id A1S_2696, 6306 bp; and A1S_2724, 1161 bp) and separated by 41 kbp in the original sequence is now in a single ORF (locus_id ACX60_04030; 6225 bp) in the revised genome and 842 bp shorter compared to the original genome (Table 2).

Conclusions
The revised genome sequences of AB307-0294 and ACICU will underpin more accurate studies of the genetics and genomic evolution of related A. baumannii strains belonging to GC1 and GC2. This work highlights the need to review and revise early bacterial genomes sequenced using shortread data and assembled with (or sometimes without) PCR to join contigs. Special attention needs to focus on the genomes determined using the 454 pyrosequencing technology in order to correct predicted protein sequences.
Long-read data, such as those generated by PacBio and ONT (MinION) technologies, allow for complete genome assembly without manual intervention. While assembling long-read data alone can result in sequence errors and failure to detect small plasmids, hybrid assembly (using both short and long reads) can produce assemblies that are both complete and accurate. However, repetitive sequences in the genome, such as the genes encoding Bap and Blp1, are difficult to perfect even with hybrid assembly, so variations in these regions should be interpreted with caution.
Finally, as the original GenBank entries are replaced by revised genomes, there is a need to eliminate non-existent and incorrect predicted protein sequences in order to simplify the already complex task of protein sequence searches. It can be assumed that this problem is not only limited to A. baumannii genomes as many bacterial species so far have been sequenced using 454 pyrosequencing technology.