Novel Clinical mNGS-Based Machine Learning Model for Rapid Antimicrobial Susceptibility Testing of Acinetobacter baumannii

ABSTRACT Multidrug-resistant (MDR) bacteria are important public health problems. Antibiotic susceptibility testing (AST) currently uses time-consuming culture-based procedures, which cause treatment delays and increased mortality. We developed a machine learning model using Acinetobacter baumannii as an example to explore a fast AST approach using metagenomic next-generation sequencing (mNGS) data. The key genetic characteristics associated with antimicrobial resistance (AMR) were selected through a least absolute shrinkage and selection operator (LASSO) regression model based on 1,942 A. baumannii genomes. The mNGS-AST prediction model was accordingly established, validated, and optimized using read simulation sequences of clinical isolates. Clinical specimens were collected to evaluate the performance of the model retrospectively and prospectively. We identified 20, 31, 24, and 3 AMR signatures of A. baumannii for imipenem, ceftazidime, cefepime, and ciprofloxacin, respectively. Four mNGS-AST models had a positive predictive value (PPV) greater than 0.97 for 230 retrospective samples, with negative predictive values (NPVs) of 100% (imipenem), 86.67% (ceftazidime), 86.67% (cefepime), and 90.91% (ciprofloxacin). Our method classified antibacterial phenotypes with an accuracy of 97.65% for imipenem, 96.57% for ceftazidime, 97.64% for cefepime, and 98.36% for ciprofloxacin. The average reporting time of mNGS-based AST was 19.1 h, in contrast to the 63.3 h for culture-based AST, thus yielding a significant reduction of 44.3 h. mNGS-AST prediction results coincided 100% with the phenotypic AST results when testing 50 prospective samples. The mNGS-based model could be used as a rapid genotypic AST approach to identify A. baumannii and predict resistance and susceptibility to antibacterials and could be applicable to other pathogens and facilitate rational antimicrobial usage.

To ensure the high quality of the collected Acinetobacter baumannii (A. baumannii) genomes, a customized criterion formulated by reference to the NCBI genome exclusion rules was applied to discard the low-quality genomes. The customized criteria for excluding genomes with low quality were as follows: a) multiple but inconsistent AST conclusions; b) genome assembly N50 < 5000 bp or number of contigs > 2000; c) assembled genome length was 1.5 times longer or 0.5 times shorter than the average length of the target species; d) the number of predicted genes was 0.5 times lower than the average number of the target species or not satisfied (0.5 ≤ predicted number of CDS/(genome length/1000) ≤ 1.5); e) average nucleotide identity (ANI) < 0.95 or the taxonomy annotation by aligning genome contigs to the NCBI NT database was not consistent with the target species; and f) 16S rDNA was not detected.

Section 2: Phylogenetic analysis and multilocus sequence typing (MLST)
Phylogenetic trees were constructed based on the single-copy core genes of A.
baumannii to explore genetic diversity within the population. The coding genes for each strain were first predicted using GeneMarkS software (Version 4.17) (http://topaz.gatech.edu/GeneMark) [1] . Subsequently, core and specific genes were analyzed by CD-HIT (Version 4.6.1) for rapid clustering of similar proteins, with a threshold of 50% pairwise identity and 0.7 length difference cutoff in amino acids [2] .
Multiple sequence alignment of single-copy genes was performed using MUSCLE software, and phylogenetic trees inferred with the maximum likelihood model were constructed using Treebest (Version 1.9.2) based on transformed CDS [3] .
All pairwise genes in the reference database were aligned using MUSCLE software (Version 3.8.31) to calculate the similarity: similarity = total number of correctly matched bases/total global length of alignment. The cluster analysis was conducted based on the sequence similarity between any two members within the ADC family, and a cluster tree was plotted with the PPV annotation of CAZ, CPM, and IPM for A. baumannii (Supplementary Figure 1). Hence, ADC subfamilies, such as ADC-30-like and ADC-240-like, were defined.

Section 4: Assembly-based ARG test using WGS data
Assembled genome contig sequences were aligned to the curated ARG reference database using BLASTN software (Version: ncbi-blast-2.9.0+, Parameters: -evalue 1e-5 -outfmt 0 -num_alignments 10000) to test the presence or absence and SNPs/InDels of antibiotic resistance genes by parsing the m0-format alignment result with a self-built Perl script program. Only hits with identity ≥ 90% and subject coverage ≥ 60% were retained, in which the test performance was verified with a consistency of above 0.95 by comparison to annotation records from the NCBI NDARO database. For each contig's alignment regions, the hit with the highest value (hit_score*subject_coverage) was selected as the best and was annotated. Meanwhile, the copy number of detected ARGs was counted with the following formula: ARG copy number = (coverage of target ARG* depth of cover regions)/(coverage of strain genome * depth of genomic cover regions).

Section 5: ARG annotation and selection for the read-based ARG test pipeline
The ARG annotation and subtyping pipeline was developed to correctly characterize ARG subtypes by directly aligning unassembled read sequences to the ARG reference database using BLASTN software (parameters: -evalue 1e-5 -outfmt 0 -num_alignments 10000) based on mNGS. Briefly, for each read's annotation, a greedy algorithm and specific alignment strategy were first applied to identify the true or false positive of ARG, and the false positive hits were filtered out. The remaining ARGs detected were then annotated using the LCA approach. In addition, due to the randomness of mNGS, ARGs with uneven coverage of the read sequence were also considered false positives and filtered out. For the ARG belonging to a variant model such as gyrA, the variant site of < 0.2 frequency was presumed to be a false result and filtered out (Supplementary Figure 2).

Section 6: Simulation experiment
Short read sequences were simulated using ART software (Version 2.5.8, parameters:-ssNS50 -l 75 -f 5 -nf 0 -rs 1) to conduct read-based ARG tests. For screening key ARG subtypes, Illumina SE75 reads of different data amounts were simulated to evaluate and summarize the normal range of uniformity coverage. The threshold of uniformity coverage was determined to filter out false positive results.

Simulation test for mNGS-AST reporting rules.
For all strains in the training set, Illumina SE75 reads of a gradient genome sequencing depth (0.01x, 0.02x, 0.03x, 0.05x, 0.1X, 0.2x, 0.3x, 0.4x, 0.5x, 0.6x, 0.7x, 0.8x, 0.9x, 1x, 2x, 3x, 5x, 10x, 30x) were simulated to evaluate the effect of the amount of sequencing data on the performance of the prediction model and to determine the lowest sequencing depth in each model. The test performance of read-based ARG detection was assessed under 30x sequencing depth using sensitivity, specificity, and accuracy calculated by referring to the assembly based ARG test 7 result. The AUC value was calculated, and a line chart was drawn to assess the performance under different sequencing data sizes for each prediction model. The lowest sequencing depth could be determined when the AUC value became stable first with increasing data size for each model.
Given the data size of sequencing and performance of the prediction model, the thresholds could be set for reporting "resistant" and "susceptible", respectively. The threshold for reporting "resistant" was determined under sufficient sequencing data size when the Youden index was max (recorded as R_ cutoff). The threshold of reporting "susceptible" was set by two criteria. On the one hand, R_cutoff was used as the threshold of reporting "susceptible" (recorded as S_Cutoff2), and the NPV must be large (usually >= 0.9). Otherwise, "susceptible" cannot be reported. On the other hand, the maximum score value with higher NPV (usually >= 0.9) was recorded as the threshold of reporting "susceptible" (recorded as S_cutoff1) under the minimum data amount of sequencing depth (recorded as gf_msd1), which must be less than or equal to S_cutoff2. Whether S_cutoff2 was suitable for the other simulated data sizes between gf_msd1 and 30X also needs to be assessed. Another minimum data size that met the designated NPV threshold (such as 0.9) was located and recorded as gf_msd2.
In summary, the reporting rules for predicting resistance and susceptibility to an antibiotic were as follows: a) If the score value was >= R_cutoff, it was reported as "resistant".
b) If the genome coverage of the pathogen was >= gf_msd2 and the score value was less than S_cutoff2, it was reported as "susceptible". c) If the genome coverage of the pathogen was >= gf_msd1 && < gf_msd2 and the score value was < S_cutoff1, it was reported as "susceptible". d) If the genome coverage of the pathogen was >= gf_msd1 && < gf_msd2 and the score value was >= S_cutoff1 && < S_cutoffs, it was reported as "/", namely, "not predicted". Specifically, for a purely variation model such as CIP with a higher AUC value (usually >= 0.9) in the WGS-AST model, if all variable sites were 8 covered and the score value was < S_cutoff1, it was also reported as "susceptible". e) If the genome coverage of the pathogen was < gf_msd1 and the score was < S_cutoff1, it was reported as "/". Specifically, for a purely variation model such as CIP with a higher AUC value (usually >= 0.9) in the WGS-AST model, if all variable sites were covered and the score value was < S_cutoff1, it was also reported as "susceptible".

Simulation testing of the species attribution of ARGs.
A total of A. baumannii strains with the OXA-23 gene (without NDM) and 30 Klebsiella pneumoniae strains with the NDM gene (without OXA-23) were selected from the training set and downloaded from the NCBI genome database. Illumina SE75 reads of a gradient genome sequencing depth (0.01x, 0.02x, 0.03x, 0.05x, 0.1x, 0.2x, 0.3x, 0.4x, 0.5x, 0.6x, 0.7x, 0.8x, 0.9x, 1x, 2x, 3x, 5x, 10x) were simulated based on each strain genome and were mixed together between any two A. baumannii and K. pneumoniae strains. Then, the read-based ARG test pipeline was conducted to evaluate the accuracy of ARG species attribution. 9

Supplementary Figures
Supplementary Figure 1. Cluster analysis to define ADC subgroups associated with resistance to CAZ, CPM, and IPM. On the left, the cluster tree was plotted based on sequence identity between any two members within the ADC family. The PPV of each ADC subtype to CAZ, CPM, or IPM is displayed on the right panel, with the major ADC subgroups outlined with black dashed lines, such as the ADC-30-like family.