Development of a novel streamlined workflow (AACRE) and database (inCREDBle) for genomic analysis of carbapenem-resistant Enterobacterales

In response to the threat of increasing antimicrobial resistance, we must increase the amount of available high-quality genomic data gathered on antibiotic-resistant bacteria. To this end, we developed an integrated pipeline for high-throughput long-read sequencing, assembly, annotation and analysis of bacterial isolates and used it to generate a large genomic data set of carbapenemase-producing Enterobacterales (CPE) isolates collected in Spain. The set of 461 isolates were sequenced with a combination of both Illumina and Oxford Nanopore Technologies (ONT) DNA sequencing technologies in order to provide genomic context for chromosomal loci and, most importantly, structural resolution of plasmids, important determinants for transmission of antimicrobial resistance. We developed an informatics pipeline called Assembly and Annotation of Carbapenem-Resistant Enterobacteriaceae (AACRE) for the full assembly and annotation of the bacterial genomes and their complement of plasmids. To explore the resulting genomic data set, we developed a new database called inCREDBle that not only stores the genomic data, but provides unique ways to filter and compare data, enabling comparative genomic analyses at the level of chromosomes, plasmids and individual genes. We identified a new sequence type, ST5000, and discovered a genomic locus unique to ST15 that may be linked to its increased spread in the population. In addition to our major objective of generating a large regional data set, we took the opportunity to compare the effects of sample quality and sequencing methods, including R9 versus R10 nanopore chemistry, on genome assembly and annotation quality. We conclude that converting short-read and hybrid microbial sequencing and assembly workflows to the latest nanopore chemistry will further reduce processing time and cost, truly enabling the routine monitoring of resistance transmission patterns at the resolution of complete chromosomes and plasmids.

No

Figure S2 .
Figure S2.AACRE pipeline performance.The distribution of running times by number of threads used for assembly of each sample.is shown.

Figure S3 .
Figure S3.The distributions of Illumina versus ONT R9 genome coverage.

Figure S4 .
Figure S4.Distribution of scaffold lengths of SPAdes short-read assemblies and Unicycler hybrid assemblies.The comparison was done using scaffolds longer than 292 bp to discard the chaff from both assembly sets (CRE, this paper, and PRJEB10018).
Figure S5.Annotation comparison between Unycicler hybrid and SPADes Illumina assemblies.Histograms show the difference in number of annotated genes between the Illumina-only and hybrid assemblies of each sample (A) and the differences in the number of annotated resistant genes (RGI) between the Illumina-only and Hybrid assemblies of each sample (B).Boxplots of the RGI gene completeness ratios for Illumina-only and Hybrid assemblies are shown in (C).

Figure S6 .
Figure S6.Effect of lower ONT coverage for well-assembled samples at high coverage.We downsampled the ONT read data for 195 well-assembled samples (circularity ratio = 1) to 10x -100x in 10x increments maintaining the original Illumina coverage, and then assembled them in bold mold with the AACRE pipeline.Here we show (A) the circularity ratio versus ONT coverage obtained with R9 (Spearman's rank correlation: rho = 0.282, p-value < 2.2e-16) and (B) the percentage of fully circularized assemblies for each R9 coverage slice (Spearman's rank correlation: rho = 0.9636364, p-value < 2.2e-16).

Figure S7 .
Figure S7.Circularity ratio distribution by ONT chemistry.These 76 samples had been initially assembled and annotated with the AACRE pipeline using the ONT reads obtained with the R9.4.1 chemistry.We then ran the AACRE pipeline combining the Illumina reads with either just R10.3 read data or both R9.4.1 and R10.3 data.

Figure S8 .
Figure S8.ONT scaffold circularization by ONT coverage.To compare the effect of coverage between the two chemistries on hybrid assembly, we downsampled both the R9 and R10 read data to equivalent coverages from 100x down to 10x.Of the 76 samples, 33 had at least 100x coverage of both R9 and R10.Note that this dataset contained samples that were more problematic and difficult to assemble.Shown are the number of circular scaffolds (A), the total number of scaffolds (B), and the circulariity ratio (C) versus ONT coverage obtained with both R10 and R9 (N=33).Median values (dark bars) and quartiles (white boxes) are shown.

Figure S11 .
Figure S11.Comparison of consensus quality for different ONT Chemistries.Shown are QV value distributions determined by Mummer alignment of long-read only assemblies (Flye plus Racon and Medaka polishing) of either ONT R9 reads or R10 reads to the Unicycler hybrid assembly (R9 plus Illumina paired end reads).Error rates were calculated as the total number of single nucleotide and indel differences divided by the number of aligned bases in 1-to-1 alignments as reported by dnadiff.The error rates were converted to QV values using the Phred scale.QV40 = error rate of 1 in 10,000.QV30 = 1/1000.

Figure S12 .
Figure S12.Comparison of gene annotations in long-read assemblies and hybrid assemblies.(A) The distribution of differences in annotated gene number between the R10 and R9 long-read Flye assemblies (33 samples each).A negative value on the x-axis indicates that the R9 assembly had more genes annotated than the R10 assembly of the same sample.(B) The same but comparing the number of genes in hybrid assemblies versus their R9 long -read counterpart.(C) Hybrid versus R10.(D) Boxplots showing the difference in gene lengths among R9, R10 and hybrid assemblies.(E) The difference in the number of RGI gene annotations between hybrid and R9 assemblies.Note that a positive value indicates fewer RGI genes were found in the R9 Flye assembly.(F) The difference in the number of RGI gene annotations between hybrid and R10 assemblies.

Table S1 .
TABLES Assembly and Sequencing Statistics.The median assembly length of 5,640,631 bp was used to standardize the coverage estimate across samples regardless of the assembly efficiency.

Table S2 .
Summary of scaffold lengths per assembly set.

Table S3 .
Assembly and annotation metrics of SPAdes Illumina assemblies versus Unicyclyer hybrid assemblies for the 461 samples included in this study.Mean and standard deviation are shown for each metric.

Table S4 .
List of plasmids identified inside bacterial chromosomes.

Table S5 .
Species diversity of 71 samples for which both ONT R9 and R10 sequencing was performed.