Reconciling the Potentially Irreconcilable? Genotypic and Phenotypic Amoxicillin-Clavulanate Resistance in Escherichia coli

Resistance to amoxicillin-clavulanate, a widely used beta-lactam/beta-lactamase inhibitor combination antibiotic, is rising globally, and yet susceptibility testing remains challenging. To test whether whole-genome sequencing (WGS) could provide a more reliable assessment of susceptibility than traditional methods, we predicted resistance from WGS for 976 Escherichia coli bloodstream infection isolates from Oxfordshire, United Kingdom, comparing against phenotypes from the BD Phoenix (calibrated against EUCAST guidelines).

Information on heterozygous hits and disruption ARIBA assemblies were not used in any resistance prediction outside of porin genes (so as not to require complete assemblies/consensus sequences for genes found, a known issue when looking for resistance features in gram negative organisms [7]). However, these were fully investigated on discrepancy checking, including analysing the resistance profile of all predicted heterozygous alleles present in each isolate.
To identify potential blaTEM promoter sequences associated with increased blaTEM expression, we searched each isolate's sequence data for sequences similar to the Lartigue blaTEM P3 promoter. [8] Sequences found were then compared to each of the Lartigue promoter types, and the closest match was selected. Resistance was inferred if the closest match was any non P3 promoter.

Extended algorithm: ampC promoter mutations.
Similarly to the blaTEM promoter, we searched each isolate's sequence data for sequences similar to the ampC promoter present in ATCC 25922. Sequences found were inspected for variants known to increase chromosomal ampC expression in E. coli. [9] Isolates with any of these variants were predicted resistant by the extended algorithm. In addition, as ampC and its promoter should be universally present among E. coli, we noted where we were unable to find ampC promoter sequences for discrepancy analysis.

Extended algorithm: DNA copy number
Generating the metric For all transmissible genes, we generated a DNA copy number metric from ARIBA output, defined as where gene coverage was defined as the coverage of the longest contig in the ARIBA assembly. For example, a DNA copy number of 1 suggests the gene is present in the same quantities as the MLST genes (i.e. 1 per cell) and a DNA copy number greater than 1 suggests the gene is present in higher quantities than the MLST genes (i.e. > 1 per cell).

Reliability of the DNA copy number metric
Measurement error in the copy number was estimated to be small (standard deviation = 0.17 (95% CI 0.14, 0.21)) based on copy numbers estimated for 46 elements from 9 isolates sequenced in duplicate as part of quality control. Therefore the absolute estimated DNA copy number was used in subsequent analyses.

Choice of DNA copy number threshold for the extended resistance prediction
This cut-off was chosen based on a Receiver Operating Characteristic (ROC) analysis of all 328 isolates containing a single beta-lactamase as the only potential cause of resistance (Supplementary Figure 3A). Maximal Youden index, maximal Liu index and minimal distance to (0,1) all selected cut-offs between 2.3 and 2.4 (bootstrap 95% confidence intervals contained within 1.7-3.0). We rounded to the nearest half number for easier interpretation, defining 2.5 as the threshold. The fact that the threshold is not an integer suggests the gene may be present in varying copies in different cells (e.g. some cells containing 2, and others containing 3).

Extended algorithm: porin loss of function
To investigate potential loss of function of porin mutations, we searched each isolate's sequence data for sequences similar to reference ompC and ompF sequences (RefSeq: NC_000913.3). Given these sequences should be ubiquitous in E. coli, we defined being unable to find and assemble a complete coding sequence for either as likely signifying porin loss. The presence of any of the following factors were used to determine this • Absence of any of "unique contig", "complete gene found" or "assembled into one contig" ARIBA flags • Presence of any of "hit both strands: "region assembled twice", "scaffold graph bad" ARIBA flags • If length of the found sequence != length of the reference sequence • If percentage identity of protein alignment to reference sequence < 90% • If the found sequence contains a mutation causing truncation (i.e. the entire found sequence must be coding) Given reduced permeability is generally thought to contribute to multi-mechanism resistance in isolates [10], suspected porin loss was only used to predict amoxicillinclavulanate resistance when the isolate additionally contained a beta-lactamase.

Sequencing quality control
WGS data quality was ensured by obtaining and assessing the following metrics from the data.
• Sequencing data for each isolate needed to contain > 1000000 reads • The top species (as identified using Kraken [11]) • The de-novo assembly had to meet the following criteria o total length had to be between 4 and 6 megabases long o the N50 (minimum contig length needed to cover 50% of the genome) was greater than 50000 o the assembly contained less than 250 contigs greater than 500bp long o the assembly contained less than 800 contigs • The mapping of WGS data to to E. coli CFT073 (AE014075.1) had to meet the following criteria o > 60 % of reads mapped to reference o > 60 % of reference bases called as "A", "T", "C" of "G" • There were no signs of mixture identified by in-silico MLST typing by ARIBA Then for each isolate, it was only included in the agar dilution subsample if we obtained 2 or more MICs in essential agreement for all of amoxicillin, fixed-based amoxicillin-clavulanate and ratio-based amoxicillin-clavulanate. Any test with less than two "passing" sets of agar dilution results were re-tested (for all of amoxicillin, fixed and ratio tests) and included if it then had two or more MICs in essential agreement for all of amoxicillin, fixed-based amoxicillin-clavulanate and ratio-based amoxicillin-clavulanate. The reasoning behind this protocol of repeats and quality control requirements was to obtain MICs with at least the support of a duplicate test (suggesting that the sample was not mixed), but also to avoid creating selection bias by dropping samples with greater variability in MIC since this could select against some mechanisms of resistance with variable expression. Of note while the phenotype was repeated a minimum of three times, our quality control and retesting procedure resulted in some isolates having even numbers of tests for one or more antibiotics, necessitating the use of the "upper median MIC" (choosing the higher MIC when the median lay between two MIC readings) as described in the main text.

Categorization of resistance mechanisms identified from WGS for modelling
Supplementary Tables S2 and S3 detail the beta-lactamase genes (Supplementary Table S2), ampC promotors (Supplementary Table S3) and blaTEM promotors (Supplementary Table S3) found in the study isolates. In a subset of isolates undergoing agar dilution MICs, we estimated the individual effect of these resistance mechanisms. Some mechanisms occurred in very few isolates in this agar dilution subsample, and so could not be considered individually. Beta-lactamases found as "complete genes" by ARIBA were categorised by family and Bush-Jacoby [13] classification (Supplementary Table S5). bla genes in categories with fewer than 10 isolates and non-complete bla genes present in the agar dilution subsample were reclassified as an "other" group. This resulted in 5 categories of betalactamases • blaTEM:2b containing the class 2b blaTEM genes • blaCTXM:2be containing the class 2be blaCTX-M genes • blaOXA:2d containing blaOXA-1 genes (a class 2d enzyme) • blaSHV:2b containing blaSHV-11 genes (a class 2b enzyme) • Other_bla: containing the "other" beta-lactamases. In the agar dilution subsample this group contained 10 inhibitor resistant beta-lactamases, and 4 beta-lactamases with unknown impact on amoxicillin-clavulanate resistance.
Given the large number of individually relatively rare ampC promoters, they were grouped and classified as "ampCpr", equalling 1 if containing a resistance conferring mutation, 0 if otherwise. Likewise, blaTEM promoters were grouped and classified similarly as "blaTEMpr". Isolates with both a suspected loss of function of either ompC or ompF (see above) and a beta-lactamase were classified as having a significant permeability effect (NFOMP(nonfunctioning omp)=1, 0 otherwise).

Model selection: variance components
Random-effects models were constructed to relate each MIC (modelled on the log2 scale, so a 1 unit increase represents a doubling) to various genetic elements and were fitted using maximum likelihood estimation. Each test result was included as an observation, accounting for repeated tests of the same isolate using isolate-level random effects. Left censoring of co-amoxiclav MICs at 2/2 (EUCAST) and 2/1 (CLSI) mg/L, and right censoring at >128/2 mg/L (EUCAST only) in a small proportion of isolates was dealt with by dividing or multiplying the observed lower limit by half respectively (i.e. subtracting or adding 1 on the log2 scale [14]). Sets of agar plates containing varying concentrations of amoxicillin:clavulanate (fixed clavulanate 2mg/L (EUCAST) and 2:1 ratio clavulanate (CLSI)) were prepared in batches, which were then used to test 55 isolates for each batch. One source of variation in MICs is therefore the test batch. Another source of variation is the isolate itself. As not all batches contained the same isolates due to repeat testing, isolates were not nested within the same batch. The random effects structure was therefore chosen by taking the variance components structure with the lowest Akaike Information criterion (AIC) from random effects for isolate only, batch only, separate and crossed random effects for isolate and batch. The best fit was cross classified, with each individual MIC nested within both isolate and batch, and neither of isolate/batch nested in the other. Including the main 3 MLSTs individually or together as fixed effects did not improve model fit (p>0.14) and so MLSTs were not considered further.

Model selection: main fixed components
Given the need to control for the effects of the different clavulanate ratios intrinsic to the two test methods (EUCAST vs CLSI), models a-priori included a main fixed effect for test method and each genetic element, initially including all genetic elements as binary presence/absence (classical interpretation; multivariable model). We then considered whether there was evidence for additional effects of gene dosage. Modelling considered each beta-lactamase in turn, with the most common first (i.e. blaTEM, then blaCTX-M, blaOXA, blaSHV), for each of the following: • Binary (presence/absence, classical interpretation) (default) • Binary (presence/absence) + linear effect of increasing copy number when present (each absolute unit higher copy number has the same effect on log2 MIC) • Binary (presence absence) + log2-linear effect of increasing copy number when present (each doubling of copy number has the same effect on log2 MIC) The best fitting model for each beta-lactamase gene was chosen based on the AIC. To reduce the influence of outliers, we truncated one extreme copy number (54.8) for SHV2b (in rest of isolates median 1.77, IQR (1.70,9.05)) to the 95 th centile (16.27). There were no other clear copy number outliers.

Model selection: interactions
As defined, the blaTEMpr and NFOMP (porin loss) factors already represented an interaction with other beta-lactamases, since by definition they can only have an impact in the presence of a beta-lactamase; therefore these were omitted from further investigation of interactions. Given the importance of test method, we included all first order interactions between test method and all other main genetic factors in the model. All individually were p<0.05 on a likelihood ratio test comparing multivariable models with and without each specific interaction, supporting this choice, with the exception of the interaction between test method and NFOMP or blaCTX-M:2be which may be due to the smaller number of isolates with these mechanisms (Supplementary Table S5). Other first order interactions between included genetic elements were assessed if they were found together in 5 or more isolates, and were selected by forward selection using likelihood ratio tests with p<0.05. These included interactions all reduced MIC when multiple elements were found together, reflecting saturation effects whereby the combined effect of having two genetic elements which individually both increase MIC is not the additive effect of both, but slightly less than this (see equation below).
Finally, we considered whether the random effects (by batch and isolate) varied according to test methodology. Including heteroskedastic residual errors (i.e. allowing the residual error of each test MIC to vary across test methodology), and allowing the random error associated with each isolate's MIC and the random error associated with batch effects to vary according to test method improved model AIC and so were included, resulting in a final model as shown below. Covariance between test method was only estimable on isolate level random errors due to small numbers of clusters at batch level.
Where for an isolate i, batch j, and test k, and test methods 0 (ratio-based, CLSI), and 1 (fixed-based, EUCAST) (Note names in equations above represent model components)

Model-based MIC prediction
The model above was developed to reflect associations between resistance elements and agar-dilution MIC as accurately as possible, and hence considered potential non-linearity of associations and interactions. However, having developed the model, an obvious question is whether the fixed effect estimates could provide accurate MIC predictions. Cross-validation is a standard method for estimating internal error (i.e. in the same dataset on which models are developed, predicting the observed outcome from the factors included in the model). However, this "training" set was multi-level, including multiple replicates per isolate, potentially with different phenotypes for each of two different testing methodologies (EUCAST and CLSI). To stratify the "training" set for random selection into folds for crossvalidation, we therefore first divided the agar dilution subsample isolates into four strata To provide an external estimate of the accuracy of MIC prediction we compared the rounded prediction based on EUCAST fixed estimates with the observed BD Phoenix MIC. As the essential agreement between BD Phoenix and EUCAST agar dilution was 92% we focussed on non-subsample (715/976) isolates for this assessment of external error. Comparisons included all 715 isolates and 704/715 isolates that did not contain either 'incomplete' beta-lactamases, or beta-lactamases not present in the agar dilution subsample. Of note, in the 261 subsample isolates EUCAST MIC predictions agreed with BD Phoenix MIC for 175/261 (67%) isolates and were in essential agreement (within ±1 doubling dilution) for 243/261 (93%). Figure S1: Sampling frame and sample selection A: Overall

B: Stratified random sampling strategy for detailed agar dilution phenotyping
Note: samples included in subsample analysis if in addition to being selected, they passed additional quality control checks (see Supplementary Methods). Orange indicates discordance between WGS-predicted resistance using the basic algorithm (i.e. based on gene presence/absence alone) and AST observed phenotype discordance.

Supplementary Figure S2: Isolate STs, resistance mechanisms and amoxicillin-clavulanate phenotyping for a) the full dataset N=976) and b) the agar dilution subsample (N=261)
Note: Isolate ST and resistance mechanisms for a) the main dataset (N=976) and b) the agar dilution subsample (N=261). By sampling method, the agar dilution subsample was enriched for amoxicillin-clavulanate resistance. Single horizontal lines represent each isolate.     o: Other beta-lactam resistance mechanisms includes any other beta-lactamase, ampC promoter mutations suggestive of ampC hyper production and functional porin loss -: no isolates in this category *: "like" indicates inexact match with parent reference sequence, "unknown" indicates inability to identify a complete gene (ARIBA flag),

B: Frequencies and combinations of multiple beta-lactamases seen in isolates
Beta-lactamase combination Number of isolates with combination blaCTX -M-15, blaOXA-1  27  blaCTX-M-15, blaOXA-1, blaTEM-1  10  blaCTX-M-14, blaTEM-1  8  blaCTX-M-15, blaTEM-1  6  blaOXA-1, blaTEM-1  3  blaCMY-2, blaTEM-1  2  blaSHV-11, blaTEM-1  2  blaSHV-12, blaTEM-1  2  blaCMY-2, blaLAT*, blaTEM-1  1  blaCMY-2, blaCMY*, blaTEM-1  1  blaCTX-M-3, blaTEM-    *: Isolates predicted co-amoxiclav resistant due to non-functional porin in the presence of (any) beta-lactamase o: Truncations occur when frameshift mutations leading to premature stop codons are found in the sequence. Assembly issues are when our search method (ARIBA) failed to completely and uniquely assemble the gene (see Methods) or the gene was only found at <10x coverage (which was typically associated with multiple frameshift mutations, suggesting uncertainty of the sequence). Note: Genetic data for each isolate are available as part of PRJNA540750 at NCBI. Isolate identifiers are as above. *: Isolates predicted co-amoxiclav resistant due to non-functional porin in the presence of (any) beta-lactamase o: Truncations occur when frameshift mutations leading to premature stop codons are found in the sequence. Assembly issues are when our search method (ARIBA) failed to completely and uniquely assemble the gene (see Methods) or the gene was only found at <10x coverage (which was typically associated with multiple frameshift mutations, suggesting uncertainty of the sequence). Note: Genetic data for each isolate are available as part of PRJNA540750 at NCBI. Isolate identifiers are as above.