Discordance between different bioinformatic methods for identifying resistance genes from short-read genomic data, with a focus on Escherichia coli

Several bioinformatics genotyping algorithms are now commonly used to characterize antimicrobial resistance (AMR) gene profiles in whole-genome sequencing (WGS) data, with a view to understanding AMR epidemiology and developing resistance prediction workflows using WGS in clinical settings. Accurately evaluating AMR in Enterobacterales, particularly Escherichia coli , is of major importance, because this is a common pathogen. However, robust comparisons of different genotyping approaches on relevant simulated and large real-life WGS datasets are lacking. Here, we used both simulated datasets and a large set of real E. coli WGS data (n=1818 isolates) to systematically investigate genotyping methods in greater detail. Simulated constructs and real sequences were processed using four different bioinformatic programs (ABRicate, ARIBA, KmerResistance and SRST2, run with the ResFinder database) and their outputs compared. For simulation tests where 3079 AMR gene variants were inserted into random sequence constructs, KmerResistance was correct for 3076 (99.9 %) simulations, ABRicate for 3054 (99.2 %), ARIBA for 2783 (90.4 %) and SRST2 for 2108 (68.5 %). For simulation tests where two closely related gene variants were inserted into random sequence constructs, KmerResistance identified the correct alleles in 35 338/46 318 (76.3 %) simulations, ABRicate identified them in 11 842/46 318 (25.6 %) simulations, ARIBA identified them in 1679/46 318 (3.6 %) simulations and SRST2 identified them in 2000/46 318 (4.3 %) simulations. In real data, across all methods, 1392/1818 (76 %) isolates had discrepant allele calls for at least 1 gene. In addition to highlighting areas for improvement in challenging scenarios, (e.g. identification of AMR genes at <10× coverage, identifying multiple closely related AMR genes present in the same sample), our evaluations identified some more systematic errors that could be readily soluble, such as repeated misclassification (i.e. naming) of genes as shorter variants of the same gene present within the reference resistance gene database. Such naming errors accounted for at least 2530/4321 (59 %) of the discrepancies seen in real data. Moreover, many of the remaining discrepancies were likely ‘artefactual’, with reporting of cut-off differences accounting for at least 1430/4321 (33 %) discrepants. Whilst we found that comparing outputs generated by running multiple algorithms on the same dataset could identify and resolve these algorithmic artefacts, the results of our evaluations emphasize the need for developing new and more robust genotyping algorithms to further improve accuracy and performance.

Simulated data: ART parameters For all simulations, any non-depth parameters required for running ART were estimated from metrics derived from sequencing data from 100 randomly selected samples from the 1,818 real world E. coli sequences evaluated as part of this study.

Simulated data: Multiple allele identification
The choice of 31-mers to identify "similar" AMR gene alleles for each construct was based on an evaluation of several different k-mer lengths (11, 17, 21 and 31).Jaccard's similarity as estimated using 31mers was highly correlated with that estimated from the other k-mer sizes tested (Fig. S3).

Supplementary Tables and Figures
Table S1: Bioinformatic AMR genotyping tools considered for the performance evaluation.Green -tools evaluated, orange -tools not evaluated, with reasons in the "Usable March 2018" column (when the study was originally designed).Figure S3.Association between Jaccard index derived from 31-mers from AMR gene variants vs 11mers (top panel), 17-mers (middle panel) and 21-mers (bottom panel).Note that only AMR gene alleles that were correctly identified by all four bioinformatic methods were included in the multiple allele simulations.

Figure S1 .
Figure S1.Schematic of four different bioinformatics algorithms evaluated.

Figure S2 .
Figure S2.Distribution of coverage depth for beta-lactam AMR genes in 1,818 real E. coli isolate sequences.

Figure S4 .
Figure S4.Correlation of 31-mer-based Jaccard similarity index for ResFinder AMR gene allele pairs and percentage identity derived using the Needleman-Wunsch algorithm.

Figure S5 .
Figure S5.Distribution of Jaccard similarity indices for ResFinder AMR gene variant sequence pairs characterized as similar based on 31-mer sharing (see Methods).Note that only AMR gene alleles that were correctly identified by all four bioinformatic methods were included in the multiple allele simulations.

Figure S6 .
Figure S6.MIC distributions for common antimicrobials for isolates evaluated in this study.Values censored at upper and lower limits.

Figure S7a .
Figure S7a.Genotyping calls for multiple allele simulations by sequence similarity of the pairs (ABRicate) Note that only AMR gene alleles that were correctly identified by all four bioinformatic methods were included in the multiple allele simulations.

Figure S7b .
Figure S7b.Genotyping calls for multiple allele simulations by sequence similarity of the pairs (all programs, Jaccard's similarity) Note that only AMR gene alleles that were correctly identified by all four bioinformatic methods were included in the multiple allele simulations.

Figure S8 .
Figure S8.SRST2 cluster size for common beta-lactamase gene families by CD-hit nucleotide percentage identity cluster threshold.NB -x-axis is on a log scale.

Table S2 .
Summary of antimicrobial susceptibility phenotypes across the datasets for commonly evaluated antimicrobials."Non-susceptible" denotes isolates in either "I" or "R" categories.

Table S3 .
Number of isolates containing each of the commonest non-ubiquitous resistance genes (at gene level as opposed to allelic level) identified by each method.Percentage values indicate the proportion of each dataset (UKHSA N=337 isolates, APHA N=497 isolates, Oxford N=984 isolates) Figure S9.Allele assignation discrepancies by method for beta-lactam resistance genes Ab = ABRicate, Ar = ARIBA, Km = KmerResistance and Sr = SRST2.Discrepancy pattern indicated by left panel with white indicating gene found by this method, grey indicating not found.E.g. the first line represents 23 isolates which KmerResistance only identified as containing partial/low coverage genes.