Identification of a Specific Biomarker of Acinetobacter baumannii Global Clone 1 by Machine Learning and PCR Related to Metabolic Fitness of ESKAPE Pathogens

ABSTRACT Since the emergence of high-risk clones worldwide, constant investigations have been undertaken to comprehend the molecular basis that led to their prevalent dissemination in nosocomial settings over time. So far, the complex and multifactorial genetic traits of this type of epidemic clones have allowed only the identification of biomarkers with low specificity. A machine learning algorithm was able to recognize unequivocally a biomarker for early and accurate detection of Acinetobacter baumannii global clone 1 (GC1), one of the most disseminated high-risk clones. A support vector machine model identified the U1 sequence with a length of 367 nucleotides that matched a fragment of the moaCB gene, which encodes the molybdenum cofactor biosynthesis C and B proteins. U1 differentiates specifically between A. baumannii GC1 and non-GC1 strains, becoming a suitable biomarker capable of being translated into clinical settings as a molecular typing method for early diagnosis based on PCR as shown here. Since the metabolic pathways of Mo enzymes have been recognized as putative therapeutic targets for ESKAPE (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species) pathogens, our findings highlight that machine learning can also be useful in knowledge gaps of high-risk clones and provides noteworthy support to the literature to identify relevant nosocomial biomarkers for other multidrug-resistant high-risk clones. IMPORTANCE A. baumannii GC1 is an important high-risk clone that rapidly develops extreme drug resistance in the nosocomial niche. Furthermore, several strains have been identified worldwide in environmental samples, exacerbating the risk of human interactions. Early diagnosis is mandatory to limit its dissemination and to outline appropriate antibiotic stewardship schedules. A region with a length of 367 bp (U1) within the moaCB gene that is not subjected to lateral genetic transfer or to antibiotic pressures was successfully found by a support vector machine model that predicts A. baumannii GC1 strains. At the same time, research on the group of Mo enzymes proposed this metabolic pathway related to the superbug's metabolism as a potential future drug target site for ESKAPE pathogens due to its central role in bacterial fitness during infection. These findings confirm that machine learning used for the identification of biomarkers of high-risk lineages can also serve to identify putative novel therapeutic target sites.

ABSTRACT Since the emergence of high-risk clones worldwide, constant investigations have been undertaken to comprehend the molecular basis that led to their prevalent dissemination in nosocomial settings over time. So far, the complex and multifactorial genetic traits of this type of epidemic clones have allowed only the identification of biomarkers with low specificity. A machine learning algorithm was able to recognize unequivocally a biomarker for early and accurate detection of Acinetobacter baumannii global clone 1 (GC1), one of the most disseminated high-risk clones. A support vector machine model identified the U1 sequence with a length of 367 nucleotides that matched a fragment of the moaCB gene, which encodes the molybdenum cofactor biosynthesis C and B proteins. U1 differentiates specifically between A. baumannii GC1 and non-GC1 strains, becoming a suitable biomarker capable of being translated into clinical settings as a molecular typing method for early diagnosis based on PCR as shown here. Since the metabolic pathways of Mo enzymes have been recognized as putative therapeutic targets for ESKAPE (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species) pathogens, our findings highlight that machine learning can also be useful in knowledge gaps of high-risk clones and provides noteworthy support to the literature to identify relevant nosocomial biomarkers for other multidrug-resistant high-risk clones. IMPORTANCE A. baumannii GC1 is an important high-risk clone that rapidly develops extreme drug resistance in the nosocomial niche. Furthermore, several strains have been identified worldwide in environmental samples, exacerbating the risk of human interactions. Early diagnosis is mandatory to limit its dissemination and to outline appropriate antibiotic stewardship schedules. A region with a length of 367 bp (U1) within the moaCB gene that is not subjected to lateral genetic transfer or to antibiotic pressures was successfully found by a support vector machine model that predicts A. baumannii GC1 strains. At the same time, research on the group of Mo enzymes proposed this metabolic pathway related to the superbug's metabolism as a potential future drug target site for ESKAPE pathogens due to its central role in bacterial fitness during infection. These findings confirm that machine learning used for the identification of biomarkers of high-risk lineages can also serve to identify putative novel therapeutic target sites.
ML can deal with large and diverse data sets to extract relevant information (57). Given the wide use of ML in biology, and the absence of accurate identifiers for early detection of A. baumannii GC1 strains, our study aimed to assess whether ML could be applied to process thousands of genomes to identify a suitable A. baumannii GC1 biomarker. We also wanted to analyze whether ML could be combined with other techniques such as PCR and/or quantitative PCR with high-resolution melting (HRM) assays to provide a molecular typing method capable of being translated into clinical settings. For these reasons, we built predictive models for typing A. baumannii GC1 genomes by training SVM and SCM classifiers. From these classifiers, we identified a new and specific genomic biomarker for the early detection of A. baumannii GC1 strains by a PCR technique not subjected to the selective pressure of antibiotics nor to lateral genetic transfer.

RESULTS
To identify new genomic biomarkers that uniquely identify strains belonging to A. baumannii GC1, we applied the SVM and SCM algorithms to data set 1 (500 genomes) and data set 2 (4,799 genomes) (see Tables S1 and S2 in the supplemental material). First, we applied both algorithms to data set 1, which was composed of 200 A. baumannii GC1 genomes and 300 A. baumannii non-GC1 genomes. We aimed to predict whether a particular genome in data set 1 belonged to A. baumannii GC1 or not by using SVM and SCM algorithms. Data set 1 was used as input for both algorithms during the training and testing of the models. Once we obtained accurate models that predicted putative biomarker sequences, we used the second A. baumannii genome collection, data set 2 (Table S2). Data set 2 was composed of 312 A. baumannii GC1 genomes and 4,487 A. baumannii non-GC1 genomes. By using blastn searches, we analyzed whether the predicted putative biomarker sequences were also found in data set 2 genomes, maintaining the same pattern found in data set 1. This analysis was done to perform an external validation of the predictions made by both algorithms. The study workflow is summarized in Fig. 1.
Obtaining unique A. baumannii GC1 predictive sequences with SVM. The Pasteur scheme for MLST has the potential to identify isolates belonging to A. baumannii GC1, providing a neat demarcation of sequence types (STs) composing A. baumannii GC1 and non-GC1 clonal complexes (5,(58)(59)(60)(61). Considering this fact, we annotated the ST of each genome in data sets 1 and 2 and categorized them as "GC1" or "non-GC1" (see "MLST classification" in Materials and Methods). We then ran the DBGWAS program using data set 1 genome sequences as input; we obtained a total of 1,622,573 distinct unitigs that represented sequences of diverse length of data set 1, taking into account the genomic variation (62). The length of the unitigs obtained was between 31 and 32,759 bp. We used the variant matrix built by DBGWAS to create the binary matrix used as input for the SVM algorithm. Unitigs that were found in fewer than 50 genomes from data set 1 were discarded; we kept only unitigs that were 31 to 385 bp in size in the input binary matrix. Parameter tuning and model validations were performed using a 5-fold cross-validation and a grid search over a range of given values to determine the SVM kernel and hyperparameters that generated the best area under the curve (AUC). Once we obtained the SVM model that best fitted the data set 1 genomes, we extracted the first 100 unitigs that contributed most to A. baumannii GC1 strain prediction (Table 1; Table S3) according to the values of the features weight vector. By using blastn, we corroborated if the unitigs predicted by SVM were specific for A. baumannii GC1 detection. For this purpose, we searched for the unitig sequences in the genomes of data set 2 to assess which unitigs had the greatest number of matches within each genome class (A. baumannii GC1 or non-GC1) ( Table 2; Table S4). We also annotated the genome location and gene product related to the 100 unitigs (Tables 1 and 3; Tables S3 and S5).
Building SCM models for A. baumannii GC1 identification. In an additional approach, we applied the SCM algorithm implemented in Kover (29). We used as input the k-mers of the 500 A. baumannii genomes contained in data set 1, considering the class label for each genome (A. baumannii GC1 or non-GC1). We ran Kover using k-mer sizes ranging from 31 to 127 nucleotides. Although output rules could be conjunctions (logical-AND) or disjunctions (logical-OR), we obtained 49 simple rules without conjunctions or disjunctions in the models of our study (Table 3; Table S5). Rules depicted the presence (n = 10) or absence (n = 39) of k-mers in data set 1 genomes. In addition, we registered the number of matches of the k-mer sequences against A. baumannii GC1 and non-GC1 genomes contained in data set 1 (Table 4; Table S6); we also annotated the k-mer sequences (Table 4; Table S6).
We observed that the rules obtained by the SCM models selected fragments of different lengths that matched the loci ACICU_02924 (n = 23), ACICU_02095 (n = 5), ABAYE3455 (n = 5), ACICU_01506 (n = 2), and ABAYE2468 (n = 2) ( Table 3; Table S5). This fact could be caused by point mutations contained in the loci that the SCM models associated with A. baumannii GC1 prediction as previously reported by Kover developers (29).
Selection of candidate biomarkers for rapid detection of A. baumannii GC1. We were interested in obtaining the longest DNA sequences shared by all A. baumannii GC1 genomes and, at the same time, without matches among A. baumannii non-GC1 genomes. For this purpose, unitigs and k-mer sequences obtained as candidate biomarkers for A. baumannii GC1 using the SVM and SCM algorithms were sorted in descending order according to sequence length. Then, we considered the number of A. baumannii GC1 genomes that matched the unitig sequence or the SCM rule and The diagram indicates the steps we followed during the present work related to the genome collections used, data set preparation, ML analysis, and the design of a method for A. baumannii GC1 strain identification.
Identification of Acinetobacter baumannii GC1 mSystems  Fig. 2 and also in Tables S3 and S4 in the supplemental material. We named unitigs from U1 to U100. We observed that the unitigs named U1 to U12 were found in 100% of A. baumannii GC1 genomes while they were not found in A. baumannii non-GC1 genomes from data set 1. However, U1 was the only unitig found in 100% of the A. baumannii GC1 genomes within data sets 1 and 2 and absent in A. baumannii non-GC1 genomes of both data sets, being a specific biomarker for A. baumannii GC1 identification. Despite the fact that the U8 sequence was found in 100% of A. baumannii GC1 genomes within data sets 1 and 2, it was also found in 1/ 4,487 of A. baumannii non-GC1 genomes from data set 2. The U8 sequence matched the A. baumannii AYE genome in locus tag ABAYE1412 between coordinates 651 and 691 (Table 1; Table S3). The fragment is part of a gene that encodes a putative acyl coenzyme A (acyl-CoA) dehydrogenase protein (acdB-like). As the U8 sequence was not found exclusively in A. baumannii GC1 genomes, we discarded it as a possible biomarker of A. baumannii GC1. Among these 12 unitigs, the U4 sequence was the one that had a higher number of matches (43/4,487) with A. baumannii non-GC1 genomes from data set 2.
The SCM results are shown in Tables 3 and 4 and Fig. 3 as well as in Tables S5 and S6. We observed that 43/49 rules targeted 100% of A. baumannii GC1 genomes from data set 1 but also targeted between 1 and 4 A. baumannii non-GC1 genomes from data set 1. Within these 43 rules, two sets of rules targeted only 1 A. baumannii non-GC1 genome from data set 1; as an example of this, 26/49 rules targeted a genome with accession no. (AN) GCF_000248195.1 (ST 69) and 1/49 rules (R49) targeted a genome with AN GCF_000453745.1 (ST 2) (Fig. 3). Concerning data set 2, 12/49 rules targeted 100% of A. baumannii GC1 genomes but also targeted several A. baumannii non-GC1 genomes from data set 2. Since no rule obtained can uniquely identify A. baumannii GC1 genomes contained in our data sets, we were unable to obtain a putative biomarker from the results of the SCM models.
When using data set 2 to validate the results obtained by the SVM and SCM models from data set 1, we observed that in the case of the SVM classifier, there was no statistically significant difference (P . 0.05) in the number of matches of the unitigs regardless of the genome data set (data set 1 or 2) except for unitig U69 ( Table 2; Table S4). In the case of U69, we found a significant difference between the numbers of matches baumannii GC1 and non-GC1 genomes typed by MLST that matched the unitig within data sets 1 and 2 using blastn, and the P values of Fisher's exact test using a significance level of 0.05. Fisher's exact test was calculated in R by considering the nominal variables "data set source" (data set 1 or 2) and "matched" (yes or no). The total number of A. baumannii GC1 and non-GC1 genomes that matched/did not match the unitigs in each data set was used for calculation. The total data of the 100 unitigs predicted by SVM are detailed in Table S4 in the supplemental material.
with A. baumannii non-GC1 genomes from data sets 1 and 2 (P = 0.01077). Concerning the SCM classifier, we observed that there was no statistically significant difference in the number of A. baumannii GC1 and non-GC1 genomes from data sets 1 and 2 that matched the rules (P . 0.05) ( Table 4; Table S6). As we mentioned, the SVM model predicted the U1 sequence as a specific biomarker for A. baumannii GC1 genomes. The U1 sequence had 367 nucleotides and matched the A. baumannii AYE genome in locus tag ABAYE1552 between coordinates 558 and 924 (Table 1; Table S3). This region corresponds to a fragment of the moaCB gene, which encodes a bifunctional protein that includes molybdenum cofactor biosynthesis protein C and protein B (63). In particular, the U1 sequence matched the region of the moaCB gene, which encodes MoaB protein. The molybdenum cofactor (Moco) is an essential component of a large family of enzymes involved in carbon, nitrogen, and sulfur metabolism whose biosynthetic pathway is evolutionarily conserved. The MoaC protein, together with the MoaA protein, is , and the gene name corresponding to the genome region matched and the gene product of the 10 larger k-mer sequences targeted by the SCM rules. A. baumannii AYE and ACICU genomes were used as A. baumannii GC1 and non-GC1 references, respectively, to locate the k-mer sequences. The prefix "REGION" was used when the match occurred either in an intergenic region or in a combination of an intergenic region and a gene. The complete list of the 49 rules obtained by the SCM models is detailed in Table S5 in the supplemental material. ND, no data.
involved in the first step of Moco biosynthesis (63). Interestingly, various studies have linked in-host survival of prevalent pathogenic bacteria such as Mycobacterium tuberculosis, Escherichia coli, and Salmonella enterica to the presence of functional molybdoenzymes (64)(65)(66)(67). In our study, we found that the U1 sequence was 100% conserved in A. baumannii GC1 genomes (no single nucleotide polymorphisms [SNPs]), while it was a variable region with 1 to 74 SNPs in a total of 4,987 A. baumannii non-GC1 genomes from data sets 1 and 2, rendering 94 allelic variants in A. baumannii non-GC1 genomes. According to these results, the sequence of the moaCB gene comprising between 558 and 924 nucleotides considering the nominal variables "data set source" (data set 1 or 2) and "matched" (yes or no). The total number of A. baumannii GC1 and non-GC1 genomes that matched/did not match the rules in each data set was used for calculation. The total data of the 49 SCM rules are detailed in Table S6 in the supplemental material. Identification of Acinetobacter baumannii GC1 mSystems allowed accurate discrimination for A. baumannii GC1 and non-GC1 genomes, becoming a biomarker that differentiates the two groups. Moreover, we observed that the U1 sequence is conserved in all the A. baumannii GC1 genomes, but this is not the case for A. baumannii GC2. For this reason, the U1 sequence could not be used as a biomarker for accurate detection of A. baumannii GC2 strains. Since 95 variants within 5,299 A. baumannii strains were identified with multiple SNPs along the entire sequence of U1, we proposed that a strategy based on PCR amplification would allow us to accurately differentiate A. baumannii GC1 from non-GC1 strains. For this reason, we designed the primer pair BioM_GC1_ABA F (59-TATTCATAGCCTCCTGGATGC-39) and BioM_GC1_ABA R (59-CCAGATGAAGCGGATACTTTG-39), with coordinates 559 to 914 from the ABAYE1552 locus tag representing 356 bp of the U1 sequence (positions 2 to 357). By a blastn search, we identified that this primer pair amplified only the U1 sequence recognized in A. baumannii GC1 strains among all the variants detected so far in non-GC1 genomes. The closest variants showed a mismatch of 1 nucleotide at the 39 end of the reverse primer in A. baumannii ST163, ST411, and ST976 (AN GCF_015537765, GCF_000453725, and GCF_010500415, respectively).
Experimental analysis, with a total of 35 A. baumannii strains, i.e., 10 A. baumannii GC1 strains and 25 A. baumannii non-GC1 strains, including 4 ST2 (GC2), 15 ST79, 5 ST119, and 1 ST404, was performed to test the primer pair BioM_GC1_ABA F and BioM_GC1_ABA F (Fig. S1). A. baumannii GC1 strains that had been isolated more than 25 years apart, such as A144 (1997), A155 (1994), and HAX25Aba (2021), were tested (68, 69). The pair of primers designed in this study amplified by PCR only the A. baumannii GC1 strains, and as baumannii GC1 strain A144 and water were used as positive and negative controls, respectively. The PCR cycling conditions consisted of an initial denaturation at 94°C for 5 min, followed by 30 cycles of denaturation at 94°C for 45 s, annealing at 58°C for 45 s, and extension at 72°C for 30 s, and then a final extension at 72°C for 5 min. Alternatively to this PCR to identify A. baumannii GC1, TaqMan assays could also provide a molecular typing method capable of being translated into clinical settings to differentiate A. baumannii GC1 and other relevant GCs or STs.
Evaluation of the performance of the SVM and SCM models. To avoid overfitting, a 5-fold cross-validation was performed with the training data sets used as input in both the SVM and SCM algorithms to select the best hyperparameter values with the highest AUC. Performance during testing was evaluated using the best hyperparameters obtained in terms of sensitivity, specificity, accuracy, precision, and F1 score (Tables 5 and 6; Table S7). The sensitivity of the SVM model was 1 6 0.00, indicating that 100% of A. baumannii GC1 genomes were correctly identified within the testing data set. Also, the SVM model achieved a high specificity (1 6 0.00) when predicting A. baumannii non-GC1 genomes from the testing data sets. The precision value was 1 6 0.00 when the model predicted that a genome was within A. baumannii GC1, being correct 100% of the time. Accuracy was 1 6 0.00, indicating that 100% of A. baumannii GC1 and non-GC1 genomes were correctly predicted. No false positives and no false negatives were predicted by the SVM model. Regarding the SCM models, the mean values of sensitivity, specificity, precision, and accuracy were 1 6 0.01, 0.98 6 0.01, 0.97 6 0.01, and 0.99 6 0.01, respectively. The mean rate of false positives was 1.11%, while the mean rate of false negatives was 0.097%. All the rules obtained by the SCM models predicted false positives within the testing data set ( Table 6).
The training and the testing accuracies of the SVM and SCM models were above 0.99 (Fig. 4), indicating that the SVM and SCM models were not overfitted. The F1 score, which is the harmonic mean of precision and recall that is commonly used to compare different classification algorithms, was similarly high in the SVM and SCM models (1.00 and 0.99 6 0.01, respectively). We used Fisher's exact test to evaluate the performance of the SVM and SCM predictions, comparing the actual genome classes (A. baumannii GC1 and non-GC1) typed by MLST and the predicted classes obtained by the models. In both cases, we obtained a P of ,2.2e 216 (Tables 5 and 6; Table S7), indicating that the SVM and SCM models could significantly classify A. baumannii GC1 and non-GC1 strains. Despite these results, since the aim of this work was to find a biomarker that uniquely identifies A. baumannii GC1 genomes, the SVM model performed better than the SCM models since it did not predict false positives or false negatives.

DISCUSSION
High-risk clones, also called "superbugs," are dangerous clonal complexes with epidemic behavior equipped with exceptional resources both to infect the host and to evolve to extreme drug resistance phenotypes over time in the nosocomial niche (1,(70)(71)(72)(73)(74)(75)(76). A molecular understanding of the genetic and/or transcriptomic traits that lead to these capabilities is still unknown (77,78). Our study showed that ML applied to the study of high-risk clones can not only help in the identification of thoroughly accurate biomarkers but also contribute to disentangling molecular pathways that lead to epidemic lineages in the nosocomial niche which have not yet been completely deciphered. Accordingly, these findings could be used as therapeutic targets to reduce the dissemination of lineages with epidemic behavior. This is the case of the U1 sequence identified in the present study, which corresponded to 367 bp of the moaCB gene encoding a bifunctional protein that includes molybdenum cofactor biosynthesis protein C and protein B (63). Mononuclear molybdoenzymes (Mo enzymes) occur in organisms in all domains of life, where they mediate essential cellular functions such as energy generation and detoxification reactions (79). It has been shown that in bacterial pathogens, several processes such as molybdate uptake, cofactor biosynthesis, and the activities of Mo enzymes affect fitness in the host as well as virulence (79). In addition to many studies on Mo enzymes that identified their crucial role in pathogenic species such as E. coli, S. enterica, Campylobacter jejuni, and Mycobacterium tuberculosis, some reports recently identified that these enzymes also contribute to the survival of ESKAPE pathogens (65,(80)(81)(82)(83)(84). Experimental studies must be undertaken to investigate the role of the moaCB gene in the virulence and fitness of A. baumannii, which is also included in the ESKAPE group. Based on our results and previous experimental data in other pathogenic species (79), we can hypothesize that the U1 fragment of the moaCB gene in A. baumannii GC1 may be related to an essential metabolic pathway that plays a vital role in the maintenance of epidemic clones in the hospital environment. Since it has been found that most of the Mo enzymes belong to groups that are unique to prokaryotes, these have been proposed as promising targets for the development of new antibiotic agents (79). Given the variability observed in biomarker U1 between A. baumannii GC1 and the 94 variants in A. baumannii non-GC1 strains, we propose a simple molecular biology strategy of one-step PCR amplification to accurately differentiate A. baumannii GC1 from non-GC1 strains without performing MLST, or WGS plus comparative genomics. The strategy proposed here, which was experimentally tested in this study, is accessible to a wide range of clinical and/or research laboratories. In addition, as the methodology consists of a single PCR, the detection of A. baumannii GC1 strains can be performed from the colony or directly from the clinical sample, giving the possibility of an early and simple diagnosis of this lineage.
Due to the increasing availability of bacterial WGS data, very active research has emerged on the use of this tool for genotype-phenotype prediction of antibiotic susceptibility (29,32,33,43,53,(85)(86)(87); however, there are no data available concerning the identification of high-risk clones based on WGS data excluding MLST. Previous PCR-based studies used bla OXA-51 as one of the three targeted genes to discriminate strains of A. baumannii GC1, GC2, and GC3 (18,19), and the detection of a deletion of 108 bp in the 59 conserved segment (59-CS) of the class 1 integron has also been proposed for identification of A. baumannii GC1 strains (19). Since bla OXA-51 has been found in other species (88)(89)(90) and the deletion of 108 bp as a biomarker partially differentiates two lineages within A. baumannii GC1 (7) and considering that both can be subjected to lateral genetic transfer events, the analysis provided by ML in the present study supports a more accurate and solid tool to evaluate the presence of A. baumannii GC1 strains. Accordingly, U1 is part of an essential gene in the A. baumannii GC1 genome (moaCB). This suggests that the optimization of metabolic genes from the core genome may be related to the exceptional abilities of high-risk clones. Interestingly, Identification of Acinetobacter baumannii GC1 mSystems our data suggest that the accessory genome such as genomic islands or transposons involved in pathogenicity or antibiotic resistance, at least in A. baumannii GC1 strains, would not have played a causal role in the adaptation of this lineage to the hospital niche over time (1). In our work, we applied two ML algorithms that differ substantially in methodology. The SVM algorithm used the number of unitigs occurring in A. baumannii GC1 and A. baumannii non-GC1 genomes to learn and predict clonal membership of the strains, while the SCM algorithm used a greedy approach to construct conjunction or disjunction rules to find the most concise set of k-mers that allows for accurate A. baumannii GC1 or A. baumannii non-GC1 genome prediction. Previously, methods that combined the use of the SVM and SCM algorithms and the representation of genomic data as kmers were used to find genomic biomarkers to identify antibiotic resistance (29) or to predict antibiotic resistance from WGS data (53,85). We ran the SCM algorithm through the Kover program with k-mer lengths between 31 and 127 nucleotides to be able to analyze all the possible rules obtained from these k-mer lengths. Also, we ran the SVM algorithm using unitigs that usually corresponded to a longer sequence than that of the individual equivalent k-mers. Unitigs are defined as the longest sequences that can be obtained when k-mers overlap by exactly k-1 nucleotides (62,87,91). In k-mer-based genome representations, the main downside is that the representation contains a lot of redundancy, since many k-mers are always present or absent simultaneously (e.g., gene deletion/insertion). In this sense, it has been proposed that k-mers be replaced with unitigs (62,87,91).
Although the best model that fitted our data was a linear SVM model, we evaluated linear, polynomial, radial basis function, and sigmoid SVM kernels. Other genome-wide association studies (GWAS) tools that apply ML techniques to the prediction of phenotypes from genotype data, such as Pyseer (92) or PhenotypeSeeker (93), use different approaches to find the best models to solve classification problems. Pyseer uses generalized linear models (GLiM), a linear regression model, and PhenotypeSeeker uses a logistic regression (LR) model. While GLiM performs a more simplistic linear regression using a set of observed values to predict its response variable, SVM deploys much more sophisticated techniques (57,94,95). SVM can perform kernel tricks that can handle nonlinear data, thus making the nonlinear data appear to be linear. These tricks cannot be done by GLiM or LR (39,48,96). Moreover, it has been shown that when it is of interest to predict the group to which a new observation belongs, based on a single variable, SVM models are a feasible alternative to LR since SVM models require fewer variables to perform better than or as well as LR (97,98). In addition, the risk of overfitting is less in SVM, while LR is vulnerable to overfitting (99). In comparison with Pyseer and PhenotypeSeeker approaches, our methodology considered the testing of different SVM kernels and allowed the possibility of using linear or complex nonlinear functions to find the best model. On the other hand, Pyseer and PhenotypeSeeker perform weighting of strains by using a distance matrix of the strains to account for population structure (92,93). Although the implementation of our algorithm did not include prior knowledge of population structure, we could successfully find a specific biomarker for A. baumannii GC1 strains.
We also proved by using statistical methods that the SVM and SCM models could significantly classify A. baumannii GC1 isolates (P , 0.05). While the SVM classifier predicted the U1 sequence as a specific biomarker for A. baumannii GC1 genomes, none of the rules obtained with the SCM models was able to uniquely identify A. baumannii GC1 genomes. All the rules obtained by the SCM models matched in A. baumannii GC1 and A. baumannii non-GC1 genomes from both data sets 1 and 2. Due to this result, it was not possible to obtain a sequence that could be used as a specific biomarker for A. baumannii GC1 strains from the rules obtained by the SCM models. A key step for the successful implementation of ML algorithms is the preparation of the input data sets (57, 100-102). In our study, we faced two issues related to the preparation of data sets 1 and 2. On the one hand, a limitation in the program DBGWAS during the preparation of the matrix used in the SVM algorithm meant that only a total of 500 genomes could be included in the training data set. Since our goal was to use the same training data set for the SVM and SCM models, the 500 genomes were also used as input in the SCM algorithm. On the other hand, the scarcity of A. baumannii GC1 genomes available in GenBank in comparison to A. baumannii non-GC1 genomes caused data set 2 to have visibly more genomes representing the A. baumannii non-GC1 class. Despite this fact, the results obtained from data sets 1 and 2 remained consistent in 100% (P . 0.05) of the cases for the SVM model and more than 91.83% (P . 0.05) of the cases for the SCM models. Perhaps the limitation in the number of genomes in the training data set is one of the reasons why the SCM models did not have enough samples to learn from A. baumannii GC1 genomes and therefore could not find a rule that uniquely identifies them. Conversely, the numbers of isolates predicted to be A. baumannii GC1 or non-GC1 by the SVM model using unitig U1 were the same as the result obtained by the MLST technique. This fact indicated that the SVM model obtained was excellent in the classification of A. baumannii GC1 and non-GC1 genomes. It will be interesting to study in future works how the data set 1 splitting strategy (unitigs or k-mers), the number of genomes of each class (A. baumannii GC1 and non-GC1) in data set 1, and the total number of A. baumannii genomes in data set 1 impact the SVM and SCM model predictions and performances. One possible approach could be using as input the binary matrix obtained from the unitigs representing data set 1 genomes generated by the program DBGWAS in the program Kover and then to analyze the SCM model results. In the same way, we could obtain k-mer profiles (k-mer sizes ranging from 31 to 127 nucleotides) from data set 1 genomes and the k-mer matrixes associated with each profile using the DSK k-mer counter (103). DSK is used by Kover to internally compute k-mer profiles from the input genomes (104). Then, we could use the k-mer matrixes as input in the SVM algorithm. As result, we would obtain new putative sequence biomarkers from both approaches, and we could compare them with the ones obtained in the current work.
In conclusion, these results suggest that the application of ML to identify biomarkers for high-risk clones or superbugs can also be used at an exploratory level of great precision since it can be useful for novel understandings related to bacterial adaptation within the nosocomial niche. In turn, these data can contribute to experimental work with the possibility of further translation to clinical settings. The SVM algorithm made genetic predictions based on the presence or absence of short genomic sequences in both A. baumannii GC1 and non-GC1 genomes. It detected a biomarker, U1, which is unrelated to lateral genetic transfer, accessory genome, or antibiotic pressures, and can uniquely identify the CG1 strains. The identification of this biomarker by the SVM algorithm, in agreement with previous experimental works on the group of Mo enzymes, showed that the application of ML could be a powerful tool to discover new therapeutic targets for the development of new antibiotic agents.
Data collection. To perform an accurate ML analysis to identify an A. baumannii GC1 biomarker, we defined two data sets to do our studies.
Data set 1 was composed of 200 A. baumannii GC1 genomes and 300 A. baumannii non-GC1 genomes obtained from GenBank and typed by MLST as previously described. These genomes were retrieved from the GenBank assembly database by filtering with "Acinetobacter baumannii" in the search-by-organism option (https://www.ncbi.nlm.nih.gov/assembly/organism/; last accessed July 2021). A. baumannii GC1 genomes included 18 genomes used in a previous study (1) and 182 A. baumannii genomes as scaffolds and contigs (Table S1). A. baumannii non-GC1 genomes included five genomes belonging to other high-risk epidemic clones such as ACICU (CP031380.1) as representative of GC2, Naval-13 (AMDR01000001.1) as representative of GC3, AB33405 (NZ_JPXZ00000000.1) as representative of local epidemic clone CC113, and both ATCC 17978 (CP018664.1) and A118 (AEOW01000000) as Identification of Acinetobacter baumannii GC1 mSystems sporadic clones. We also included 205 A. baumannii non-GC1 genomes as scaffolds and contigs (Table S1). Data set 2 was composed of 312 A. baumannii GC1 genomes and 4,487 A. baumannii non-GC1 genomes (Table S2) retrieved from the GenBank assembly database (last accessed July 2021). The number of genomes in these data sets was limited by the A. baumannii GC1 and non-GC1 genomes available in the GenBank database at the time of the query. We used data set 2 to validate using blastn searches the results obtained by the SVM model (Table S2). STs were numbered according to the Pasteur scheme for MLST (Tables S1 and S2).
Machine learning analyses. The SVM classifier is based on the maximization of the margin around the hyperplane (w T x 1 b) separating samples or instances of the different classes (56,105). Each instance i = 1, . . ., m consists of an N-dimensional feature vector x i and a class label, y i 2 11; 21 f g . The maximization of the margin corresponds to the following minimization: In this soft-margin SVM equation, j i is a penalty for misclassification or classification within the margin. Parameter C sets the weight of this penalty. The resulting weight vector w* encodes the contributions of all features to the classifier (56). b* refers the resulting bias term. The bias term shifts the hyperplane away from the origin and allows the SVM to fit a hyperplane that is not necessarily passing through the origin. j * refers to the resulting slack variable. We created a Python script using scikit-learn (106) to run the SVM algorithm. The script evaluated the classifier through a 5-fold cross-validation. In detail, the data were split into five consecutive folds (without shuffling) using the Python scikit-learn (sklearn) KFold function (https://scikit-learn.org/stable/ modules/cross_validation.html#k-fold), and five models were built. Each fold was used once as a test set, while the four remaining folds formed the training set. During each of the five iterations, hyperparameter tuning was done using a 5-fold cross-validated grid search using the GridSearchCV function implemented in the sklearn.model_selection package (https://scikit-learn.org/stable/modules/generated/sklearn .model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to find the best hyperparameters. We evaluated linear, polynomial (with a default degree of 3), radial basis function (RBF), and sigmoid kernels. We considered values between 0.01 and 100 for the penalty parameter of the error term (C) and values between 0.000001 and 10 for the gamma parameter. The predictions of all five iterations were compared using the AUC score (https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics). Finally, we built up the classifier from the entire training set using the best hyperparameters (with the highest AUC) identified through cross-validation and applied the best model to the test set. The highest AUC values were obtained using kernel = linear, gamma = 0.000001, and C = 0.01 (code available on GitHub at https://github.com/vealvarez/SVM_GC1).
The SCM (41) is a learning algorithm that produces models that are conjunctions (logical-AND) or disjunctions (logical-OR) of boolean-valued rules r: R d ! {0,1}. Let us use h(x) to denote the output of model h on genome x. When h consists of a conjunction (i.e., a logical-AND) of a set R Ã of rules r Ã , we have whereas, for a disjunction (i.e., a logical-OR) of rules, we have: given a set R of candidate rules, the SCM algorithm attempts to find a model that minimizes the empirical error rate The function I is defined as I condition ½ ¼ 1, if the condition is true; I[condition] = 0, if the condition is false, when using the smallest number of rules in R (29).
We used the program Kover, which implements the SCM algorithm (29,104). Kover combines the SCM algorithm with the k-mer representation of genomes, which reveals uncharacteristically sparse models that explicitly highlight the relationship between genomic variations and the phenotype of interest (29). We ran the program using data set 1 (see "Input data set preparation for ML models" below) for k-mer sizes from 31 to 127 nucleotides (taking only the odd numbers between them). The smallest value of k was set to 31 since extensive testing has shown that this size is optimal for bacterial genome assembly and has been employed for studies based on reference-free bacterial genome comparisons (87,107). The greatest k-mer size was set to 127 since it is the maximum value accepted as a parameter in Kover. We chose only odd values of k to avoid the formation of palindromes (108). For each k-mer size, we split data set 1 into a training data set (two-thirds of the genomes) and a testing data set (one-Identification of Acinetobacter baumannii GC1 mSystems third of the genomes). Then, we trained the 49 models corresponding to each k-mer by using a conjunction/disjunction model type. The best conjunctive and/or disjunctive model for each k-mer was selected using 5-fold cross-validation to determine the optimal rule scoring function with default parameters. Input data set preparation for ML models. We split the 500 genomes included in data set 1 into unitigs by using the program DBGWAS (62). Unitigs are stretches of DNA shared by the strains in a data set. The DBGWAS method proposes connecting the overlaps of k-mers in a compressed de Bruijn graph (DBG) so that k-mers are extended using the adjacent sequence information in the population, forming unitigs present in the same set of samples as their constituent k-mers. During the first step of the DBGWAS process, the program built a variant matrix, where each variant is a pattern of the presence/absence of unitigs in each genome present in data set 1 (62). We wrote a Python script (code available on GitHub at https://github.com/vealvarez/SVM_GC1) to format the variant matrix and create a presence/ absence (coded with the values 1 and 0, respectively) binary matrix with unitigs as columns (features) and the accession numbers of the genomes as rows (instances). We used the binary matrix as input for the SVM algorithm. In this matrix, we discarded the data about low-frequency unitigs (unitigs found in fewer than 50 A. baumannii GC1 and non-GC1 genomes). We also integrated the MLST data corresponding to data set 1 genomes and created a two-column matrix used as input for the SVM algorithm. The first column of the matrix contained the accession number of the genomes, and the second column contained a binary variable that indicated whether each genome was typed or not typed as A. baumannii GC1 (21 = A. baumannii non-GC1 genome and 1 = A. baumannii GC1 genome) according to MLST typing.
For the SCM approach, we first packaged data set 1 sequences stored in FASTA files into a Kover data set using the create from contigs command. This command also received a tab-separated value (TSV) with the data set 1 genome classification according to MLST typing (A. baumannii GC1 or non-GC1). The first column of the TSV file described the genome accession number, and the second column had the value 1 to indicate that the genome belonged to A. baumannii GC1 or the value 0 to indicate otherwise. We ran Kover for k-mer sizes from 31 to 127 nucleotides. For each k-mer size, Kover constructed a reference-free input matrix based on k-mer profiles generated with the DSK k-mer counting software. A k-mer presence/absence binary matrix based on data set 1 genomes was then created and used as input for the SCM models of each k-mer size.
Unitig selection using SVM for putative biomarker analysis. After obtaining the SVM model with the best hyperparameters, the values of the features weight vector (referred to as the hyperplane normal vector wÃ in the first equation in "Machine learning analyses" above) were accessed through the attribute sklearn.model_selection.GridSearchCV.best_estimator.coef_. The values were sorted from highest to lowest and used to decide the relevance of each unitig sequence (associated with each weight value) during the model prediction (109,110). It is worth mentioning that in our study, sequence unitigs were used as features in the models. The positive sign of a feature weight value indicates that the feature contributes to A. baumannii GC1 class prediction (represented by the value 1) and the negative sign indicates that the feature contributes to A. baumannii non-GC1 class prediction (represented by the value 21) (111). Considering the above-mentioned, 100 unitig sequences with the highest weight values were selected to be analyzed as putative biomarkers of A. baumannii GC1 genomes.
Machine learning performance metrics. The performances of the SVM and SCM models were evaluated in terms of sensitivity, specificity, accuracy, precision, and F1 score. They were defined as follows: sensitivity = TP/(TP 1 FN), specificity = TN/(TN 1 FP), accuracy = (TP 1 TN)/(TP 1 FP 1 TN 1 FN), precision = TP/(TP 1 FP), and F1 score = 2 Â ((precision Â sensitivity)/(precision 1 sensitivity)), where TP (true positives) was the number of A. baumannii GC1 strains predicted to be A. baumannii GC1, TN (true negatives) was the number of A. baumannii non-GC1 strains predicted to be A. baumannii non-GC1, FP (false positives) was the number of A. baumannii non-GC1 strains predicted to be A. baumannii GC1, and FN (false negatives) was the number of A. baumannii GC1 strains predicted to be A. baumannii non-GC1.
BLASTN searches. BLASTN searches (112) were done using data set 1 and data set 2 as subjects, and the unitigs/k-mers that contributed most to A. baumannii GC1 genome prediction according to the SVM/SCM models were used as queries. We identified whether unitigs/k-mers matched known genes or intergenic regions and provide their putative function when possible. AYE (AN CU459141.1) and ACICU (AN CP000863.1) genomes were used as references to annotate the genome location and gene product related to A. baumannii GC1 and non-GC1 genomes, respectively. In cases where the unitig was not found in the AYE genome, the AB0057 (CP001182.2) genome was used instead. Also, we counted the number of A. baumannii GC1 and non-GC1 genomes matched by each unitig. We considered a cutoff E value of E 210 , 100% of identity, and 100% of query cover. To analyze the target of the primers designed below, BLASTN searches (112) were done by using the primer sequence as the query and the nucleotide collection (nr/nt) database of GenBank or data set 1 and data set 2 as the subject. Finally, BLASTN searches (112) were also done by using the fragment of the U1 genomic biomarker of A. baumannii GC1 amplified with those primers (excluding the sequence of the primers) as the query and data set 1 and data set 2 as the subject.
Primer design. The primers to amplify the U1 genomic biomarker of A. baumannii GC1 were designed by using Oligo Primer Analysis software version 6.22 (113,114).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. FIG S1, TIF
We are grateful for the technical assistance of Gabriela Camicia and Nicolás Donis from CONICET.