Predicting Antimicrobial Resistance Using Partial Genome Alignments

ABSTRACT Antimicrobial resistance (AMR) is an important global health threat that impacts millions of people worldwide each year. Developing methods that can detect and predict AMR phenotypes can help to mitigate the spread of AMR by informing clinical decision making and appropriate mitigation strategies. Many bioinformatic methods have been developed for predicting AMR phenotypes from whole-genome sequences and AMR genes, but recent studies have indicated that predictions can be made from incomplete genome sequence data. In order to more systematically understand this, we built random forest-based machine learning classifiers for predicting susceptible and resistant phenotypes for Klebsiella pneumoniae (1,640 strains), Mycobacterium tuberculosis (2,497 strains), and Salmonella enterica (1,981 strains). We started by building models from alignments that were based on a reference chromosome for each species. We then subsampled each chromosomal alignment and built models for the resulting subalignments, finding that very small regions, representing approximately 0.1 to 0.2% of the chromosome, are predictive. In K. pneumoniae, M. tuberculosis, and S. enterica, the subalignments are able to predict multiple AMR phenotypes with at least 70% accuracy, even though most do not encode an AMR-related function. We used these models to identify regions of the chromosome with high and low predictive signals. Finally, subalignments that retain high accuracy across larger phylogenetic distances were examined in greater detail, revealing genes and intergenic regions with potential links to AMR, virulence, transport, and survival under stress conditions. IMPORTANCE Antimicrobial resistance causes thousands of deaths annually worldwide. Understanding the regions of the genome that are involved in antimicrobial resistance is important for developing mitigation strategies and preventing transmission. Machine learning models are capable of predicting antimicrobial resistance phenotypes from bacterial genome sequence data by identifying resistance genes, mutations, and other correlated features. They are also capable of implicating regions of the genome that have not been previously characterized as being involved in resistance. In this study, we generated global chromosomal alignments for Klebsiella pneumoniae, Mycobacterium tuberculosis, and Salmonella enterica and systematically searched them for small conserved regions of the genome that enable the prediction of antimicrobial resistance phenotypes. In addition to known antimicrobial resistance genes, this analysis identified genes involved in virulence and transport functions, as well as many genes with no previous implication in antimicrobial resistance.

A ntimicrobial resistance (AMR) threatens global health by preventing effective treatments against bacteria, parasites, viruses, and eukaryotic pathogens. AMR usually arises as a natural consequence of genetic alterations; however, misuse or overuse of antimicrobials accelerates the selection of resistant variants. In 2019, the United States Centers for Disease Control and Prevention reported that antimicrobial-resistant infections occur annually in over 2.8 million people in the United States, with at least 35,000 deaths (1), and the European Centre for Disease Prevention and Control (ECDC) reported 33,000 deaths due to antimicrobial-resistant infections in the European Union and the European Economic Area each year (2). In addition to causing disease and mortality, AMR also causes major economic burdens to health care systems because of longer hospital stays, additional tests, and use of more expensive drugs (3).
In bacteria, AMR mechanisms can be grouped in three general categories, i.e., intrinsic, acquired, and adaptive. Intrinsic resistance involves all of the inherent features of a microorganism that prevent antibiotic effects, such as outer membranes with low permeability in some of the Gram-negative bacteria (4). Acquired resistance refers to the acquisition of mutations or genes on mobilizable elements such as plasmids, transposons, integrons, or to the transformation of naked DNA (4,5). Adaptive resistance is triggered by environmental factors, such as antibiotic or nutrient stress, and can be revertible. Alterations in gene expressions and duplications of existing genes are known consequences of adaptive resistance (4,5). These acquired or genetically encoded mutations and cell responses can cause resistance through enzymatic deactivation of antibiotics, reduction in the amount of antibiotic in the cells by efflux mechanisms, and modifications of the cell surface that make the cell less permeable (4).
To provide effective treatments and prevent rapid AMR spread, it is essential to know the resistance phenotypes of the pathogen (6). In the clinic, this is usually done using traditional antimicrobial susceptibility testing (AST), in which a bacterial culture is subjected to various antibiotics (7). As increasing numbers of genome sequences paired with AST phenotypes have become available, several studies have published machine learning (ML) models for predicting AMR phenotypes based on sequence data (8). Different ML algorithms have been used with good effect for making predictions; these include adaptive boosting (9), random forest (10)(11)(12), extreme gradient boosting (13,14), set-covering machines (15), support vector machines (16,17), and neural networks (10,18). The choice of features also differs, and studies have been published where the authors used the AMR genes (10,16,19), k-mers based on the whole-genome sequence (9,13,14), whole-genome alignments (11), and alignments of the entire pangenome (17,20). Two previous studies have also shown that the phylogeny of the isolates can provide a predictive signal (18,21).
Although many studies have shown that ML methods can yield accurate models, the robustness of these models is highly dependent on the quality of the training set. For instance, training set size, phylogenetic diversity, and the various drug susceptibility testing methods can impact model accuracies (22)(23)(24). Ideally, ML models should be built from data sets that are balanced by class (e.g., resistant versus susceptible), and the number of samples should be greater than the number of features (24,25). In practice, large and well-balanced training sets of genomes paired with antimicrobial susceptibility test data can be difficult to obtain.
In previous work, Nguyen et al. demonstrated that AMR phenotypes could be predicted using sets of conserved genes that are held in common among the members of a species, even if they are not known to be involved in AMR (26). The reasons relatively high accuracies are observed in the non-AMR genes are not entirely clear. Based on previous studies, it is likely that phylogeny plays a role in providing a predictive signal in these core gene models (18,21). However, the authors also showed several cocorrelating virulence factor genes with high feature importance values, and other ML studies have been used to identify previously unrecognized epistatic mutations in the genome (17). It is also possible that some core genes could have a previously unrecognized function relating to AMR.
Knowing which conserved parts of the genome provide signal for AMR prediction models could potentially help to enhance our understanding of AMR mechanisms and the compensatory changes that result from the acquisition of resistance. In this study, we generate AMR prediction models from short chromosomal subalignments and use them to systematically identify regions of the genome with predictive power and potential links to AMR.

RESULTS
Chromosomal subalignments can be used to predict AMR phenotypes. In this study, we wanted to gain a more systematic understanding of the conserved chromosomal regions that are useful for predicting AMR phenotypes. To do this, we selected three species, Klebsiella pneumoniae, Salmonella enterica, and Mycobacterium tuberculosis, which at the time of writing, all had rather large collections of AST data paired with whole-genome sequences (see Fig. S1 to S3 in the supplemental material).
The collection of genomes and the AST data were downloaded from the Pathosystems Resource Integration Center (PATRIC) ( Table 1; see also Tables S1A to S1C in the supplemental material) (27), and the contigs of each genome were aligned against the chromosomal sequence of a high-quality reference genome. This resulted in a single global nucleotide alignment for the chromosomes of each species. The global chromosomal alignment was then used to build a matrix for machine learning. Susceptibility or resistance to each antibiotic was predicted by building a random forest classifier (28) for each species and antibiotic. For K. pneumoniae, the average area under the receiver operating characteristic curve (AUC) of the test set genomes in the 5-fold cross-validation is 0.878 6 0.023 (Table 1; see also Fig. S4 in the supplemental material). The AUC values of each antibiotic range from 0.733 6 0.044 for cefepime to 0.974 6 0.006 for levofloxacin. For M. tuberculosis, the average AUC for all antibiotics is 0.778 6 0.028, and the values for individual antibiotics range from 0.676 6 0.026 for streptomycin to 0.833 6 0.013 for rifampin. For S. enterica, the AUC for all antibiotics is 0.804 6 0.029, with individual values ranging from 0.740 6 0.027 for gentamicin to 0.841 6 0.033 for sulfisoxazole. When the Salmonella models are recomputed using the chromosome of Salmonella enterica serovar Typhi CT18 as the reference, we also observe nearly identical results (see Table S1D in the supplemental material), indicating that the choice of the reference chromosomes has little impact. Overall, these AUCs and corresponding model statistics, including F1 scores and error rates (see Tables S1E to S1G in the supplemental material), are consistent with previous studies that have used k-mers and AMR genes as input (14,15,26).
In order to assess how much sequence data are required to make an accurate AMR phenotype prediction, we randomly sampled smaller regions of each chromosomal alignment to generate subalignments of various sizes and built models for each subalignment using the same algorithm and parameters. For all three species, the AUC increases as the subalignment length increases (Fig. 1). This effect is most pronounced in M. tuberculosis, perhaps because of the high similarity between strains of this species (Fig. S2). The accuracies, F1 scores, and Matthews correlation coefficients (MCCs) follow the same trend as the AUCs, and error rates trend downward as alignment length increases (Tables S1E to S1G). These results indicate that small conserved chromosomal regions of only a few thousand bases in length contain a predictive signal that can be identified by the machine learning algorithm.
Chromosomal regions yielding high-and low-accuracy models. In order to understand which regions of the chromosome provide high-and low-accuracy AMR predictions, we plotted the AUCs of the subalignment-based models based on their alignment positions. To do this, we chose a subalignment size of 5 kb for K. pneumoniae and S. enterica, and an alignment size of 10 kb for M. tuberculosis. The longer subalignment size was chosen for M. tuberculosis since the increase in accuracy starts to become less dramatic at 10 kb (Fig. 1).
When the AUCs for each small subalignment-based model are plotted based on their chromosomal positions, we observe a consistent pattern of relatively high AUCs across the reference chromosome (Fig. 2), although the average AUCs are generally 5 to 12% lower than those predicted by the corresponding model based on the whole chromosomal alignment (Table 1 and Fig. S4). For each species, there are several peaks that are higher than the background. For example, three dramatic peaks at the approximate positions 770,000, 2,160,000, and 4,250,000 in M. tuberculosis correspond with subalignments containing the rpoB, katG, ermA, and ermB AMR genes (see Fig. S5 and Table S1H in the supplemental material). The rpoB, katG, ermA, and ermB genes encode proteins that can confer resistance to rifamycin, isoniazid, and macrolide antibiotics, respectively (29)(30)(31).
On the other hand, we also observe several valleys where subalignment models fail to predict AMR phenotypes. For example, these include the approximate positions 1,310,000, 3,440,000, 4,035,000, and 4,535,000 for K. pneumoniae. These valleys often correspond with the locations of mobile elements ( Fig. S5 and Table S1I in the supplemental material). Indeed, when we plot the alignment conservation for each column, many of the regions with poor alignment conservation corresponded with poor accuracies in the subalignment models (Fig. 2). Overall, these results indicate fairly stable AUCs over the reference chromosome, with higher-than-average predictive power for regions containing chromosomally encoded AMR genes and lower than average AUCs in regions of poor conservation.
Sequence similarity influences subalignment model performance. The subalignment-based machine learning models can detect signal in most conserved regions of the chromosome, even though many of these small regions do not contain annotated AMR genes (Fig. 2). Since the machine learning models learn from nucleotide similarity that has been observed across samples in the training set, some of the high accuracies are likely due to similar sequences occurring in both the training and testing sets, which may obscure the more broadly conserved nucleotide signatures relating to, or correlating with, resistance. This has been observed previously, and several studies have tried to balance strains based on phylogeny to reduce this effect in the ML models (10,11,21,26). In essence, having related genomes in the training and testing sets can improve accuracies, but may also potentially reduce the generalizability of the models if the overall phylogenetic distribution of the training set is biased.
To observe how strain similarity affects the predictions, we clustered each strain using the nucleotide k-mer similarity from the whole chromosomal alignment and evaluated the subalignment-based model performances by changing the clustering threshold. All strains were used in the models; however, strains that were members of the same cluster were restricted to either the testing or the training set in each fold of the 5-fold cross-validation. The similarity thresholds that were used differed depending on the diversity of the sequences for each species. For K. pneumoniae and S. enterica, clustering was evaluated from 99% to 75% k-mer similarity, and from 99% to 95% for M. tuberculosis (see Fig. S6 in the supplemental material).
As expected, when the clustering thresholds decrease in k-mer similarity, the model performances also begin to decrease (Fig. 3). This happens because as the clusters become larger and more inclusive, the genetic distance between strains in the training and testing sets are increasing. Going from no clustering to clustering with the lowest similarity threshold, the average AUCs drop approximately 10% for K. pneumoniae and S. enterica and approximately 2% for M. tuberculosis. This trend is also observed when the analysis is repeated using a k-mer similarity that is based on the individual sequences of each subalignment, rather than the entire chromosomal alignment (data not shown). These results indicate that the random forest models can learn the underlying phylogeny and use this information to aid the phenotype prediction.
As the clustering becomes more inclusive, the decrease in performance is not uniform across species and antibiotics (see Fig. S7 in the supplemental material). For instance, in K. pneumoniae, ciprofloxacin and levofloxacin have the largest number of subalignments with AUCs of .0.80 at 75% clustering, for M. tuberculosis pyrazinamide  Table S1P in the supplemental material. A separate set of 5-fold cross-validated models was computed for each subalignment and antibiotic.
Predicting AMR Using Partial Genome Alignments retains the largest number of informative subalignments at 95% clustering, and for S. enterica, chloramphenicol and streptomycin have the largest number of informative subalignments at 95% clustering (Fig. 4). The higher accuracies for certain antibiotics, such as ciprofloxacin and levofloxacin in K. pneumoniae, may be due to chromosomally encoded AMR genes, epigenetic effects, or the time at which the AMR gene or mutation became fixed in the population. For instance, previous work in Neisseria has demonstrated larger epistatic effects relating to ciprofloxacin relative to those related to other antibiotics (32).
Some subalignments contain broadly conserved AMR signals. When similar strains are prevented from occurring in both the training and testing sets, we observe an expected decrease in subalignment model performances. However, we also observe large standard deviations in the average AUCs for the subalignment models, even at the lower similarity thresholds (Fig. 3). This implies that some of the subalignments that retain high AUCs contain sequence signatures that are conserved and are being learned even when the strains are less closely related. We reasoned that these subalignments are likely to contain sequence signatures that are more phylogenetically widespread within the species and could be regions that cooccur with AMR or that are involved with AMR-related functions. We chose to examine these highly performing subalignments in greater detail.
To attempt to distinguish the subalignments containing genes encoding proteins with well-characterized functions from those that are poorly characterized, we plotted the subalignments with AUCs of .0.80 at each clustering threshold based on their  Fig. S5 and Tables S1H and S1I in the supplemental material. (Bottom) The alignment conservation for each whole chromosomal alignment. The y axis depicts the percentage of sequences with a nucleotide in each column, and the x axis depicts chromosomal alignment position. chromosomal alignment positions (see Fig. S8 in the supplemental material). We searched within each subalignment for genes that are known to encode functions that are involved in AMR, virulence, or membrane transport. This was done by comparing the genes to the PATRIC (33), Comprehensive Antibiotic Resistance Database (CARD) (34), Virulence Factor Database (VFDB) (35), National Database of Antibiotic Resistant Organisms (NDARO) (36), and Transporter Classification Database (TCDB) (37) resources. We chose to look for virulence factors and transporter genes because these functions often correlate or cooccur with AMR (38)(39)(40). The subalignments were then colored in the plot based on whether they contain one of these genes ( Fig. 5 and Fig. S8). Overall, there is an enrichment in AMR, virulence factor, and transporter genes in this set of subalignments (see Tables S1J to S1L in the supplemental material). There are also many subalignments that do not have an annotated AMR, virulence, or transporter gene (see Tables S1M to S1O in the supplemental material). M. tuberculosis has the fewest highly predictive subalignments, and most of these have AMR or virulence factor matches. In all three species, there are cases where subalignments are predictive for several antibiotics, possibly indicating a protein function relating to a class of antibiotics. The informative subalignments do not tend to cluster based on their coordinates in the reference chromosome and instead appear to be spread out over the reference chromosome.
Many of the subalignments that retain high AUCs in spite of the clustering do not contain annotated AMR, virulence factor, or transporter genes (Tables S1M to S1O). These regions could have a previously unrecognized role in AMR or may be important cocorrelates with AMR in the evolution of resistant strains. For example, in both K. pneumoniae and S. enterica, we observed high-AUC subalignment models that contained the gene encoding the chaperone protein HscA. HscA is a member of the Hsp70 family and is known for its role in maturation of iron-sulfur-cluster-containing proteins (41,42). Although hscA is not known as an AMR gene, a study tracking the acquisition of resistance mutations in Salmonella has documented single-nucleotide polymorphisms (SNPs) in hscA (43), so a role in AMR could be plausible. In comparison, DnaK, which is another Hsp70, has a role in survival under unfavorable conditions such as exposure to oxidative stress, heavy metals, and antibiotics (44). Additionally, we detected many AMR-related mutations in metabolic genes, including in genes encoding sulfur acceptor protein, cysteine desulfurase, and 3-mercaptopyruvate sulfur transferase. Cysteine desulfurase is known for its role in protection from oxidative stress, which might also be important for protecting the bacterium from antimicrobial stress (45). Recently, Collins and colleagues detected metabolic mutations that confer AMR for streptomycin, ciprofloxacin, and carbenicillin in pathogenic bacteria using in vitro analyses. Overall, our analysis identifies 10 of the 17 metabolic genes originally found by Collins et al., including cycA, gltB, mltA, rsxC, sucA, dppA, yqiK, bcsC, mdfA, and rpoB (46). Taken together, these detected genes point to a relationship between stress conditions and AMR and provide potential targets for further phenotypic characterization at the bench.

DISCUSSION
In this study, we built models for predicting AMR phenotypes for K. pneumoniae, M. tuberculosis, and S. enterica using short chromosomal subalignments that were 5 to 10 kb in length (approximately 0.1 to 0.2% of the length of the reference chromosomes). Overall, these models have average AUCs that are 5 to 12% lower than the AUCs for models that are based on the whole chromosomal alignments. As the subalignment length increases, the performance of the models also tends to increase. These results FIG 4 The number of subalignments and corresponding to a given model AUC for each antibiotic. Genomes were clustered based on similarity, and samples belonging to the same cluster were restricted to either the testing or the training sets in each fold of the 5-fold cross-validation. A clustering threshold of 75% k-mer similarity is shown for K. pneumoniae, and a clustering threshold of 95% is shown M. tuberculosis and S. enterica. Results for additional thresholds are shown in Fig. S7 in the supplemental material.
are consistent with previous work by Nguyen et al. that showed that AMR phenotypes can be predicted using small sets of conserved genes (26). Using the short subalignments, we plotted the model performances, based on their coordinates on the reference chromosome, with a fairly high degree of resolution. In all three organisms, the data show relatively consistent AUCs across the chromosome, with specific regions yielding high-and low-performing models. For instance, in K. pneumoniae and S. enterica, we observed that the poorly performing subalignments included insertions, such as phage and mobile elements. These also tended to correspond with areas of low alignment conservation. In M. tuberculosis, most of the regions with dramatically high prediction performances contain known AMR genes, such as those encoding DNAdirected RNA polymerase beta subunit (rpoB), catalase-peroxidase (katG), and integral membrane indolylacetylinositol arabinosyltransferase (embA, embB, and embC), which FIG 5 Protein-encoding gene functions for subalignments with high AMR prediction accuracies. Each panel includes predictive subalignments with AUCs of .0.8 after clustering the strains at a given k-mer similarity threshold. Points represent each subalignment, and they are plotted based on the corresponding position on the reference chromosome. Subalignments are colored according to hits for AMR (green), virulence (blue), and transporter genes (orange), respectively. If a subalignment contains multiple gene categories, the color will appear as a mixture. Subalignments that do not produce hits in any of the known AMR-related genes are colored in pink. Results for high-scoring subalignments at other clustering thresholds are shown in Fig. S8 in the supplemental material.
Predicting AMR Using Partial Genome Alignments confer resistance to rifampin, isoniazid, and macrolides, respectively (47)(48)(49). Although it is unsurprising that the subalignments containing the AMR genes have high model performances, we were surprised by the large number of subalignments with residually high AUCs, many of which contained no annotated functions relating to AMR.
In order to understand how genome similarity influences the accuracy of these models, we clustered the strains based on the k-mer similarity using the whole chromosomal alignments. We then prevented similar sequences from existing in both the training and testing sets for each fold of the 5-fold cross-validation. The AUCs dropped gradually as the clustering became more inclusive, indicating that the models rely on sequence similarity. This is consistent with previous studies that have performed similar experiments to control for the effects of related strains in ML models (10,11,21,26). However, we also found that there were many subalignments that retained high AUCs in spite of the clustering, which indicated the presence of sequence signatures that were more strongly conserved relative to their background similarity. These regions were enriched in AMR, virulence, and transport-related functions. Although the cooccurrence between AMR, virulence factors, and mutations in transporters is well known (38)(39)(40), we also observed several regions that did not have an obvious role in AMR, virulence, or transport, which merit further analysis.
The ability to predict AMR phenotypes over the genome clearly differs by antibiotic, with the subalignment models for some antibiotics having better overall performances than those of others. This can be seen in the color banding pattern of the plots in Fig. 2. For example, in K. pneumoniae, the AUCs for the 5-kb amikacin models were only 3% lower than those for the whole chromosomal alignment-based model, whereas for trimethoprim-sulfamethoxazole, there was a 20% drop. This difference can also be seen in the subalignments that remained predictive after k-mer similarity was used to prevent similar sequences from occurring in the training and testing sets. For example, for K. pneumoniae, a large number of the subalignments that were accurate after this filtering step had high AUCs for predicting ciprofloxacin and levofloxacin phenotypes, followed by those for amikacin and tobramycin. On the other hand, only two predictive subalignments were obtained for trimethoprim-sulfamethoxazole at 99% k-mer similarity. Previous studies have shown that certain antibiotics may have a larger epistatic impact on the genome than others (50), which might result in the signal that is being detected by the models. However, we note that filtering the subalignments based on sequence similarity may not have entirely eliminated other potentially nonrandom effects, such as biased strain sampling or linkage disequilibrium, which could also cause this difference (51,52).
Using short chromosomal subalignments to predict AMR phenotypes presented a few drawbacks that are worth noting. First, we used a clustering approach that was based on the k-mer similarity of the whole chromosomal alignment rather than on the k-mer similarities of each individual subalignment. When we clustered based on subalignment similarity, we obtained similar results, but balancing the sets became more difficult due to a lack of diversity in many subalignments. Another drawback of using an alignment-based modeling approach is that it has the potential to result in very large matrix files as the diversity of the training sequences increases. This could eventually limit the number of strains or species that could be used in an alignment-based model, although we did not encounter this problem in this study. In comparison, the size of a k-mer-based matrix is usually scoped to the number of possible k-mers of a given length. Overall, the benefit of using alignment-based models is that they clearly preserve feature importance down to the nucleotide position in the alignment.
The results of this study suggest that a complete genome sequence may not be necessary for predicting AMR phenotypes. However, the antibiotic, subalignment size, and the region of the genome sequence used in the model have clear impacts on the accuracies that can be obtained. This may eventually inform the development of bioinformatic workflows that can make predictions using incomplete genome sequence data. For instance, using read mapping against a known region of a reference genome sequence could be a potentially fast and appealing way to predict AMR in the incomplete genomes that are often found in metagenomic samples, although there would be potential drawbacks, including the reduced accuracy of partial genome models and the need to ensure that similar strains can be adequately differentiated. Nevertheless, the approach outlined in this study provides a potential way forward, and AMR may be only one of many phenotypes that might be predicted in this way.
In conclusion, this study offers a unique approach for identifying AMR-predictive regions in bacterial chromosomes and may eventually provide a means for understanding and predicting the chromosomal changes that accompany the evolution of resistance.

MATERIALS AND METHODS
Data sets. Three bacterial species were selected for this study, K. pneumoniae, M. tuberculosis, and S. enterica (nontyphoidal serovars). Genomes for 1,664 K. pneumoniae, 16,906 M. tuberculosis, and 5,268 S. enterica isolates were downloaded from the Pathosystems Resource Integration Center (PATRIC) in October 2019 (27,33). Metadata information, including susceptibility and resistance determinations (48) and MICs, was downloaded from the PATRIC FTP site (ftp://ftp.patricbrc.org/RELEASE_NOTES/PATRIC _genomes_AMR.txt). MIC data were converted into sensitive (S) and resistant (R) phenotypes for K. pneumoniae, using Clinical and Laboratory Standards Institute (53) guidelines (53). For M. tuberculosis, interpretations were based on World Health Organization (WHO)-defined critical concentrations from the original studies (54,55). For S. enterica, CLSI and United States Food and Drug Administration (56) guidelines were used to interpret MIC values (56).
The large number of M. tuberculosis and S. enterica isolates were downsampled to ;2,500 and ;2,000 diverse genomes, respectively, to keep the number of genomes per antibiotic relatively consistent and to reduce computational overhead. Downsampling was performed using whole-genome k-mer distances using overlapping 8-mer oligonucleotides, as in Nguyen et al. (14). KMC (version 2.3.0) (57) was used to compute k-mers, and clustering was performed using the cluster.AgglomerativeClustering function from the Python Scikit-learn library (version 0.19.2) (58) by setting affinity to "l1" and linkage to "complete." The list of isolates and AMR phenotypes is provided in Tables S1A to S1C in the supplemental material. Due to low counts, the intermediate phenotypes were not modeled in this study. Poor-quality genomes were removed from the analysis based on low chromosomal alignment coverage (,50%), defined as having extreme genome lengths (two times longer than the reference genome or shorter than the half of the reference genome) or poor genome quality scores based on the PATRIC genome quality tool (59).
Alignment generation. The assembled genomes for each isolate were globally aligned to the chromosome of a relevant high-quality reference genome. In this case, we chose Klebsiella pneumoniae subsp. pneumoniae HS11286 (PATRIC identified [ID] 1125630.4; GenBank accession number CP003200.1), Mycobacterium tuberculosis H37Rv (PATRIC ID 83332.12; GenBank accession no. AL123456.3), Salmonella enterica subsp. enterica serovar Typhimurium strain LT2 (PATRIC ID 99287.12; GenBank accession number AE006468.2). Reference genomes were based on the NCBI reference genome collection (60). Global alignments were generated using KMA (version 1.2.0) (61) with the following parameters: "-dense -ref_fsa -ca -mem_mode -mrs 0 -Mt1 1 -e 1.0." The KMA tool produces output files that include the mapping information, alignment statistics, and the consensus alignment against the template. The "dense" option prevents the insertion of gaps into the consensus sequence. This preserves contiguity of the reference chromosome and prevents the subsequent matrix files from becoming sparse.
Subalignments were generated by randomly sampling continuous regions of the global chromosomal alignment by choosing an arbitrary starting point. When experiments used many subalignments, the total length of the sampled subalignments was not allowed to exceed the length of the reference chromosome, to help prevent oversampling of the same regions. In practice, this means that a larger number of short subalignments than long subalignments could be generated (see Table S1P in the supplemental material).
Model generation and cross-validation. For each species, a matrix was generated from either the global chromosomal alignment or the subaligned regions where the rows represent each isolate, the columns represent the positions of the alignment, and the entries are DNA nucleotides for each position. To decrease the matrix size, alignment columns having no variation were removed from consideration, since they provide no discriminative information. We used one-hot encoding to convert nucleotide sequence data into combinations of one(s) and zero(s) as described by Aytan-Aktug et al. (10). The utils. to_categorical function in the Python Keras library (version 2.2.4) (62) was used to perform the one-hot encoding.
Unless otherwise stated, the models generated in this study are binary classifiers that predict susceptible or resistant phenotypes for a single antibiotic. If genomes had intermediate or unknown phenotypes for a given antibiotic, they were excluded from the training, testing, and validation sets for the corresponding model. Random forest (63) was chosen for the machine learning algorithm in this study because it is a robust tree-based method that is relatively easy to interpret and has been used with good effect in in many studies for regression and classification purposes (10)(11)(12)28). The Python Scikitlearn package (version 0.19.2) ensemble.RandomForestClassifier was used for generating the models (58). Unless otherwise stated, we chose to use the random forest parameters defined previously (10), which set the number of trees to 200, and used default settings for the remaining parameters. Additionally, the class weight was set to "balanced" in this study. This adjusted the class weights according to the class Predicting AMR Using Partial Genome Alignments frequencies in the input data and was intended to help prevent biased predictions caused by class imbalances. The chosen parameters are intended to be optimal for the majority of the models, although it may be possible to find a more ideal set of parameters for any given subalignment.
Standard 5-fold cross-validations were performed for each model using a training, a testing, and a held-out validation set to monitor models for overfitting. Unless otherwise stated, the average test performances are reported as area under the receiver operating characteristic curve (AUC) values with a standard deviation. In experiments where a subalignment was sampled multiple times, we report the average of all tests with a standard deviation. Model performances were assessed using the AUC, macro F1 scores, and Matthews correlation coefficient (MCC) using the Scikit-learn package (version 0.19.2) (58). Major error (ME) and very major error (VME) rates were also used as metrics. Major errors are defined as susceptible isolates that are classified as being resistant, and very major errors are resistant isolates that are classified as being susceptible.
Clustering subalignments. Machine learning models can learn AMR phenotypes based on the genome similarity or correlations between training, test, and validation subsets (21,26). To explore how genome similarity effects the machine learning predictions, we clustered the aligned sequences using the KMA index (version 1.3.7) (61). KMA uses the 16-mer oligonucleotide k-mers and the Hobohm-1 algorithm to generate clusters (63). Whole alignments that are similar to each other within a certain similarity threshold (between 75 and 99% k-mer similarity) were clustered, and the corresponding subalignments were kept in the same partition of the 5-fold cross-validation. Thus, the isolates sharing similarity greater than the given threshold were only used in either the training or testing sets. To help prevent biased predictions due to imbalances between cluster sizes, each subalignment was weighted in inverse proportion to the cluster size. No hold-out set was used in this analysis in order to maximize the number of clusters that could be tested.
The genes encoded within subalignments that had high model performances were evaluated for a potential role in AMR. These subalignments were aligned to the PATRIC (33), CARD (34), and NDARO (36) databases to identify AMR genes, VFDB (35) and Victors (64) to identify virulence factors, and TCDB (37) to identify transporter genes. Furthermore, to explore the precise AMR-related subalignment positions, we calculated feature contributions to the AMR predictions using random forest's feature importance implementation. The most informative 10 features were considered per fold.
Tree generation. Distance matrices were computed for the whole chromosomal alignments using KMA (version 1.3.7) using the k-mer distance option. KMA calculates input distances using 16-mers and accepts inputs in single indexed file. Input sequences were indexed using "-NI," "-Sparse TG," and "-nbp" parameters. Distance trees were constructed using CCPhylo (version 0.0.15) (65), which generates a neighbor-joining tree, and visualized using iTOL (version 4) (66).
Data availability. The Python 2.7.15 scripts used for this project are available on Bitbucket (https:// bitbucket.org/deaytan/aligned-fragments/). All genomes and metadata can be accessed through the PATRIC resource (https://www.patricbrc.org) using the genome identifiers given in Tables S1A to S1C.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.