Utilizing co-abundances of antimicrobial resistance genes to identify potential co-selection in the resistome

ABSTRACT The rapid spread of antimicrobial resistance (AMR) is a threat to global health, and the nature of co-occurring antimicrobial resistance genes (ARGs) may cause collateral AMR effects once antimicrobial agents are used. Therefore, it is essential to identify which pairs of ARGs co-occur. Given the wealth of next-generation sequencing data available in public repositories, we have investigated the correlation between ARG abundances in a collection of 214,095 metagenomic data sets. Using more than 6.76∙108 read fragments aligned to acquired ARGs to infer pairwise correlation coefficients, we found that more ARGs correlated with each other in human and animal sampling origins than in soil and water environments. Furthermore, we argued that the correlations could serve as risk profiles of resistance co-occurring to critically important antimicrobials (CIAs). Using these profiles, we found evidence of several ARGs conferring resistance for CIAs being co-abundant, such as tetracycline ARGs correlating with most other forms of resistance. In conclusion, this study highlights the important ARG players indirectly involved in shaping the resistomes of various environments that can serve as monitoring targets in AMR surveillance programs. IMPORTANCE Understanding the collateral effects happening in a resistome can reveal previously unknown links between antimicrobial resistance genes (ARGs). Through the analysis of pairwise ARG abundances in 214K metagenomic samples, we observed that the co-abundance is highly dependent on the environmental context and argue that these correlations can be used to show the risk of co-selection occurring in different settings.


Introduction
Antimicrobial resistance (AMR) has become a global health threat, with ramifications not only for modern medicine but also for agriculture and environmental health [1][2][3] .It is widely acknowledged that the misuse of antimicrobials has accelerated the dissemination and prevalence of antimicrobial resistance genes (ARGs) 4 .Most attempts to reduce the burden of AMR have focused on reducing the use of single classes of antimicrobial agents considered of critical importance 5 .Despite considerable efforts in various settings, such as livestock, these regulations have not significantly reduced the spread of ARGs 6,7 .It has been shown that after banning the use of specific antimicrobials, resistance to that antibiotic will still be prevalent in the environment 8 .Indirect co-occurrence of ARGs has also been shown to happen under selective antibiotic pressure 7,[9][10][11] .We have also recently observed that changes in the selective pressure of even a single antimicrobial agent influence several ARGs in pig metagenomes 11 .However, there is also evidence that adding mobile ARGs to a microbiome increased its stability, whereas if the ARGs were encoded in the chromosome, the stability decreased 12 .The origin of the interplay between the microbiome and resistome, however, remains largely unresolved, as there are many ecological, functional, and evolutionary properties unaccounted for.It is unclear whether the co-selection of ARGs happens due to genetic linkage or inter-species interactions.
Studies focused on studying the link between antimicrobial usages (AMUs) and the prevalence of AMR have typically only focused on the effects of using one antimicrobial agent and the impact on developing resistance to that agent, ignoring the overall effect on the abundance of ARGs in the environment.This effect on the resistome must be understood better to prevent these collateral damage effects from happening 13,14 .Cooccurrence of microbes has been evaluated in soil 15,16 and marine environments 17,18 , whereas ARG co-occurrences have been studied in sewage sludge 19 , freshwater 20 , marine 21 , swine 8 , and cattle 22 .However, most of these studies have only been evaluated on a smaller scale and not always using the same methods.There is currently a large collection of nextgeneration sequencing datasets from metagenomic samples available in public repositories, providing an excellent resource to quantify the prevalence of ARGs by analyzing the read abundances [23][24][25][26] .
In this study, we have analyzed the co-abundance of sequencing reads aligned to acquired ARGs to assess how resistance to one specific antimicrobial agent is linked to the abundance of another class of antimicrobial.With a collection of 214,095 metagenomic datasets from public repositories 27 , we examined the correlation between pairwise ARG read abundances with a compositional approach 28,29 using SparCC (Sparse Correlations for Compositional data) 30 .Our results demonstrated that many ARG pairs interact but are highly specific to the environment.We believe that these interactions provide another way to study how ARGs are being co-selected independently of the microbial context, and the findings can be used to design targeted interventions to limit the spread of AMR.

Data collection and pre-processing
We have previously described in detail the process of downloading and analyzing 214,095 metagenomic samples 25,27 , but in brief: We downloaded raw sequencing reads corresponding to 442 Tbp from metagenomic samples deposited in the European Nucleotide Archive 31 that were uploaded between 2010-01-01 and 2020-01-01 and had at least 100,000 reads and were shotgun sequenced.The raw sequencing reads were quality-checked with FASTQC v.0.11.15 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and trimmed with BBduk2 36.49 32.The trimmed reads were then globally aligned using KMA 33

Homology-based clustering of ResFinder sequences
With the possibility of detecting up to (3085,2) = 3085!(2!(3085−2)!)= 4,757,070 pairwise interactions between the 3,085 ARGs of ResFinder, we decided to cluster similar reference sequences together to reduce the number of possible combinations.Using USEARCH 36 11.0.667 and 90% nucleotide identity, we did a homology-based clustering that grouped the ARGs into a total of 716 groups.For each of these groups, the read counts for the genes in that group were first adjusted by the length of the gene and then aggregated together.These new group counts are used in the correlation analysis described in the next section.For the rest of the manuscript, we refer to these 716 ARG clusters as simply ARGs.

Calculating relative abundances
For a category label, we calculated the relative abundance of fragment counts assigned to different genes or classes as: where  is the label,   is the count of read fragments assigned to gene , and  = 100 is a scaling constant.

Inferring pairwise correlations with SparCC
The SparCC algorithm 30 was used to obtain correlations using pairs of log-ratio transformed ARG read counts to infer linear Pearson correlations.SparCC obtains linear Pearson correlations and p-values through an iterative approach that adjusts for spurious correlations and lowers the false discovery rate.The ARG-ARG correlations were inferred as the average over 50 iterations, and one-sided pseudo p-values were obtained through a bootstrapping procedure of 100 rounds.In each bootstrapping round, the input count matrix was shuffled, and correlations were averaged over 10 iterations to infer one-sided pvalues to test whether the correlation for the observed data was statistically significant.
Correlations values ≥ 0.6 with p-values ≤ 0.01 were selected for further analysis.We implemented SparCC to run on GPUs on the Danish National Supercomputer for Life Sciences (https://www.computerome.dk).
We ran SparCC on the entire dataset of the 214K metagenomic samples and subsets of samples grouped by sampling host and environment, where at least 800 samples existed.
Due to inconsistent labeling of the sampling sources, we made new source groups, as shown in Table 1.We only consider genes for SparCC analysis that are present in at least 10 samples with a minimum read fragment count of 50.In total, 11 different correlation matrices were made, i.e., one for each of the source groups listed in Table 1.The correlation networks were visualized in R 4.1.0 37with packages igraph 38 , qgraph 39 , and ggraph 40 using the Fruchterman-Reingold layout algorithm 41 .

Network comparisons
The topology of the different networks is described using different metrics: the number of nodes () and edges (), the global clustering coefficient, network density, edge density, and the number of components.The global clustering coefficient, or the graph transitivity, measures the density of node triplets in the network 42 .The network density is calculated as 2/( − 1) as given in Parente et al. (2018) 43 .Edge density is the number of edges over the number of possible edges 44 .The average correlation between ARGs of two antimicrobial classes was calculated using Fisher's z-transformation on the correlation values, averaging the z-scores and converting it back to a correlation score with the inverse Fisher transformation 45 .

Data and code availability
The matrix of raw ARG read counts is available at https://doi.org/10.5281/zenodo.6519843 .
The code used to run the analysis and create figures and the output files of SparCC are available at https://github.com/hmmartiny/global_resistome_correlations.Classifications of antimicrobial importance were retrieved from the 6 th revision of critically important antimicrobials for human medicine from https://www.who.int/publications/i/item/9789241515528,accessed 2022-10-10.

Results
This study investigated the correlation of pairwise ARG abundances across a set of 214,095 metagenomic sequencing datasets.This collection represents a highly diverse set of publicly available metagenomic samples collected between 2000 and 2020 from locations all over the world, where the sequencing datasets consist of short-read reads produced on Illumina platforms (Figure S1).We observed that not all samples contained sequence read fragments aligned to ARGs, so we only included those that did in the correlation analyses (n = 119,206; Table 1).Looking at the various sampling sources, read fragments aligned to ARGs conferring resistance to tetracycline and beta-lactams were very common in hostassociated sources, whereas phenicol resistance was more frequently observed in environmental sources (freshwater, marine, and soil) (Table 2).catA1 was the most dominant gene in the environmental samples, especially in marine samples, whereas various tet genes had high relative abundances in livestock or human samples: tet(W) in chicken and pig samples; tet(Q) in cow, human, and pig metagenomes.blaTEM-52B accounted for more than 30% of the read fragments assigned to ARGs in air metagenomes (Figure S2).

Balancing the sparsity and network complexity
Initially, we inferred correlations with SparCC on all samples that had at least one read fragment aligned to an ARG.We observed that most ARGs were found to correlate with each other in all environments, even in cases where there was only one read fragment per reference (Table S1).Across all samples, we observed template hits ranging from 1 to 252 with a median of 20 and fragment counts spanning 1 to 4.510 7 , with a median of 880 fragment counts.To balance the sparsity and network complexity, we introduced a filtering step.This step determined whether an ARG should be included or excluded based on a minimum number of fragment counts and a minimum number of samples supporting hits to that ARG template.Among the 716 ARGs considered, we noticed that without any filters, SparCC indicated correlations even with very low read fragment counts.Therefore, we required that for an ARG to be included, it had to have a minimum fragment count of 50 across at least 10 samples (Table S1).

Analysis of large-scale metagenomic correlation networks
Based on our filter settings, we constructed a global network using the correlation coefficients for the entire collection of metagenomes, with each node representing an ARG and each edge representing a pairwise ARG connection (correlation ≥ 0.6, p-value < 0.01, Figure S3).The global network, nicknamed 'all', contained 225 ARGs connected through 2,344 correlation edges (Figure 1a).As this all-network was hard to interpret due to the many highly interconnected ARGs, we also inferred pairwise ARG correlations in specific sampling groups (Table 1).The genes that were part of these networks were found to correlate with varying degrees of strength (Figure 1a-b).For example, the human network contained many correlation coefficients, but most were less than 0.8 (Figure 1c).Another example is the marine network, where only a few ARGs were found to correlate, but with values above 0.9 (Figure 1b-c).Despite the networks reflecting the composition of the various environments, we still observed overlaps between which ARGs were found to correlate.One example is that all the correlations inferred from the pig metagenomes also existed in the human metagenomic network (Figure 1d).
There was a limited number of correlations for ARGs encoding resistance to fluoroquinolones, steroid antibiotics (fusidic acid), colistin, and rifampicin.On the other hand, beta-lactam, tetracycline, and aminoglycoside ARGs had many correlations with each other and with other classes (Figure S4, Figure S6).Despite resistance to some antimicrobial classes being the most abundant, the ARGs did not always correlate to many others.For example, tetracycline ARGs were the most abundant in dog metagenomes, but no correlations were inferred for these ARGs.Similarly, in the human samples where aminoglycoside and beta-lactam ARGs were less abundant than tetracycline ARGs, aminoglycoside and beta-lactam resistance genes had a higher number of correlation coefficients were reported (Table 2, Figure S4).
On the level of ARG abundance, we observed that just because an ARG was highly abundant in a sampling group, it did not automatically mean that it correlated to many other ARGs.
The highly abundant catA1 gene in marine (97.4% of all ARG reads), freshwater (55.1%), and soil samples (84.7%) (Figure S2) did only correlate with one or two other genes in the water environmental networks, and none in the soil network.On the other hand, catA1 did seem to be correlated with 15 other genes in the pig correlation network despite not being highly abundant in that group of samples (Figure S5a).mef(A)_1 accounted for 15.9% of the reads aligned to ARGs in cow samples and 6.63% in pigs (Figure S2) and was also strongly correlating with other genes, which mainly conferred resistance to aminoglycosides, (fluoro)quinolones and tetracyclines (Figure S5b).tet(L)_4 only accounted for 4.01% of the read fragments aligned to ARGs in metagenomes of chicken origins (Figure S2) but was shown to correlate in its abundance with the abundance of 8 other ARGs, e.g., with a correlation of 0.77 with lnu(A)_1 (Figure S5c).

The hidden signals between ARGs profile the potential risk of co-selection in different environmental contexts
Using the correlations between ARGs in the various environments (Figure 1), we calculated the average correlation between ARGs of different antimicrobial classes (Figure S6).These average correlations can then serve as profiles to assess the risk of indirectly selecting ARGs that confer resistance to different ARGs through co-and cross-resistance.These risk profiles can then be used to judge the strength of interactions upon using one antimicrobial in each setting.Upon constructing these profiles, we observed that the strength and the number of correlations highly depend on the antimicrobial classes and the environmental context (Figure S6).We hypothesize that if an important (IA) or highly important antimicrobial (HIA) is used, first, the resistance to the antimicrobial class will likely flourish, and, secondly, through co-and cross-resistance, so will ARGs conferring resistance to other classes, including those that are CIA.
Figure 2 shows two risk profiles for ARG correlations for glycopeptide and macrolide, two highly critically important antimicrobials.Correlations between ARGs of glycopeptide resistance to other resistance classes were much rarer than those connected with macrolide ARGs (Figure S6) and those correlations that were observed were relatively low (correlation<0.8,Figure 2).Different vancomycin resistance cassettes were responsible in different environments, namely VanHAX, VanC2, and VanX_bc in human samples and VanHDX and VanC1XY in pig samples (Figure S7a).VanHAX only has one correlation, whereas the remaining five correlate with many different ARGs.On the contrary, ARGs conferring resistance to macrolide were much more interactive with other classes of resistance genes in all networks and had strong correlations with specific classes (correlation > 0.9, Figure 2, Figure S6).The macrolide ARGs co-abundant with other ARGs were many but separated into distinct network clusters (Figure S7b).For example, mef(A) and msr(D) were usually found together in different environments, both in small and large clusters.
While Figure 2 highlights how CIA ARGs interact, it is just as important to investigate what ARGs of less critically important antimicrobials correlate with.As seen in Figure 3, ARGs for pleuromutilin resistance (IA) and for tetracycline resistance (HIA) were found to interact with many other classes, including those that are critically important.For example, pleuromutilin ARGs are few but well connected (Figure 3, Figure S6), as seen with the connections with lsa(E) and cfr(C) (Figure S8a).Tetracycline ARGs correlated with the abundance of multiple ARGs, such as those conferring resistance to lincosamides, macrolides, and phenicols (Figure 3).While there were many ARGs for tetracycline resistance, they often correlated to the same ARGs (Figure S8b).

Discussion
Considering the complexity of microbiomes, studying how microbial composition shapes the distribution of ARGs is a challenging task but one that could shed light on how ARGs indirectly select one another.However, with the high-throughput sequencing technologies and many metagenomic datasets available in public repositories, it is now much more feasible to extract the patterns of how ARGs co-occur without knowing their microbial origin.Using our recently published collection of 214K metagenomic datasets 27 , we have inferred correlations of pairwise ARG abundances to profile which types of resistance influence the shape of resistome and which genes are the key players.To the best of our knowledge, this is the first study to relate ARG abundances on such a large and broad scale.
Our correlation networks revealed that not all ARGs are connected, as the pairwise ARG interactions were largely shaped by the composition of the environmental resistome.We observed cases where a highly abundant ARG had no correlations and the opposite where a sparse ARG had many correlations, highlighted by our effort to apply appropriate filters to ensure robust correlations (Table S1).However, by filtering on the individual abundance of ARGs in each resistome, there is a risk of removing important links between ARGs, especially in environments with very low abundant ARGs.We continued with settings that ensured that sparse ARGs in complex environments were removed but that ARGs in overall sparse environments could still be included, such as for the air resistome.More sensitive filtering could be applied to further improve the robustness of the inferred ARGs for each resistome.
The network showing correlations for all metagenomic samples was the most complex, which we speculate is due to both the wide variety of sampling sources and the overrepresentation of human metagenomes (Table 1).We found the differences in the animal and environmental networks much more interesting, as they reflect the dynamics of ARG abundances in more localized settings (Figure 1).On the level of which antimicrobial class an ARG confers resistance to, we could also observe that some resistance classes had more and stronger correlations than others (Figure S6).We found several cases of strongly correlated ARG pairs that have also been reported in other studies.These cases were often due to the ARGs being present in the same microbial genome.For example, the gene cassette vanHAX has been found together with msr(C) and aac (6') in genomes of human isolates 46 and mef(A) linked with tet(O) 47 and mdf(A) with blaTEM, aph (6), sul2, and, tet(A) 48 (Figure S7).
As highlighted in Figure 2 and Figure 3, we argue that the correlations can serve as a way to profile the collateral effects occurring in a resistome when the abundance of different ARG mechanisms changes.Antimicrobials have been classified differently to reflect their importance to human medicine, of which glycopeptides and macrolides are CIA with the highest priorities.Glycopeptide and macrolide resistance have previously been linked genetically 7,9 in pigs, where we also can report the presence of correlations between ARGs of glycopeptide and macrolide resistances not only in pigs but also in human and mouse environments (Figure 2).Pleuromutilin and tetracycline antimicrobials are classified as being less critically important 49 , but SparCC reported many correlations of ARGs for these two classes in various networks (Figure 3), which suggests that there are still risks associated with the use of these two.Tetracycline resistance has been found to occur together with resistance to macrolides [50][51][52] , aminoglycosides 50,53 , folate pathway antagonists 54,55 , lincosamide 11 , and beta-lactams 53 , to name a few studies.This high connectivity of tetracycline ARGs seems in line with our results, as this specific group of ARGs was connected to almost all classes of antimicrobials in most of our networks (Figure 3, Figure S8b).The variety of connections to antimicrobials of less importance to human health should be more in focus, as our results show that there are risks of critical antimicrobial resistances emerging from the enrichment of less essential resistance genes.
Following this line of thought of focusing on ARGs that give resistance to less critically important antimicrobials, a recent study by Tarek and Garner (2022) 56 proposed to create a monitoring framework based on isolated components, or clusters, in correlation networks.
They constructed a correlation network encompassing ARG abundances in samples from wastewater treatment plants and argued that representative ARG members from each cluster in their network should be monitoring targets.Our networks suggest that limiting to only a few ARGs would fail to capture the complete picture in some environments, such as human microbiomes (Figure 1).Instead, we propose to limit the set of monitoring targets to only what happens to ARG abundances during exposure to one type of antimicrobials In order to use correlation networks for surveillance of AMR, more work is needed to confirm that the observed interactions do indeed exist in nature 57 .We have defined interactions as being indirect since SparCC does not indicate which way the interaction occurred.One way to investigate this could be to investigate physical linkages or by using other analytical methods.For example, to determine whether vanHAX influences msr(C) or the other way around (Figure S7a), the SPIEC-EASI 58 method could be used to infer such directional dependences.A directional correlation could be included in a risk profile.
Our choice of only studying the co-abundances of ARGs does not capture the causality and directionality of the interactions.We have not disentangled whether the interactions are due to co-selection or cross-selection, and neither investigating genetic linkages, co-exposure, and the presence of other genetic elements such as mobile genetic elements.
Understanding some of these factors could be achieved by expanding our analysis with bacterial read counts and counts of mobile genetic elements.We have speculated that the resulting correlations could reflect the community's response to external stressors, such as different usages of antimicrobials.While there has been work on linking the ARG prevalence with AMUs in various settings 7,9,11 , incorporating AMU values, other non-antimicrobial stressors 59 and microbial signatures 12 in our analysis would likely help us better understand the microbiomes' response to selective pressures.
By utilizing the wealth of information on ARG abundances available in a collection of 214K metagenomic datasets, we have studied the co-abundance of ARGs to discover how these interactions shape the prevalence of resistances in different environments.The inferred correlation networks provide insights into how two resistance types indirectly and speciesindependent select for each other in different habitats.Our results further highlight that there are instances of genes of one type of resistance often co-occurring with many other types of resistance and that the environmental context plays an important role, revealing them as important targets in surveillance programs to limit their impact on global health.
(2022).The relative abundance is the number of read fragments aligned to the group out of all fragments aligned to ARGs.For each resistance class, the parenthesis shows the number of ARGs belonging to that category.

Figure 3 ,
FigureS7Figure S8).If a monitoring system is implemented, it would need to be updated regularly to show the changes in antimicrobial usage and ARG cooccurrences since the correlations we have inferred in this study only reflect the current and past abundances.

Figure 1 :Figure 2 :
Figure 1: The resistome networks consisting of correlations (edges) between pairs of ARGs (nodes).a.Each correlation network is visualized, where each ARG node is colored by resistance class, and edges are colored by the correlation coefficient value.b.Metrics of the correlation networks reveal how interconnected and complex the networks are.No. stands for number of.c.Distribution of included correlation coefficients in each network.Figure S3 shows the distribution of all inferred correlations and their p-values.d.A heatmap showing how similar the content of the two networks is.In the upper half of the heatmap,

Figure 3 :
Figure 3: Correlation profiles for ARGs conferring resistance to important pleuromutilins (left) and highly important tetracyclines (right).Each column shows the average correlation from, e.g., tetracycline ARGs to ARGs for other antimicrobial classes.The circle is colored by the average correlation, where a white circle indicates no statistically significant correlations of ARGs observed between the two antimicrobial classes.

Figure S1 :
Figure S1: Metagenomic origins colored by the sampling group.a. Overview of sampling year for metagenomic samples colored by their sampling origin.100 samples were taken before 2000, and 84,238 did not have a valid sampling date.b.Overview of the amount of sequencing reads available for each sampling origin.c.Sampling locations of metagenomic samples used in the correlation analysis were split by their sampling source.Number of samples with no coordinates available; All: 83,361; Air: 16; Dog: 3,159; Chicken: 570; Cow: 262; Freshwater: 61; Human: 40,003; Marine: 363; Mouse: 2,842; Pig: 320; Soil: 477.The 'Other' label refers to those that are not in one of the source-specific networks but are included in the 'All' network.

Figure S2 :
Figure S2: Top 10 most abundant ARGs per sampling group.A barplot showing the ARGs having the highest 10 relative abundances per sampling source colored by resistance class.

Figure S3 :
Figure S3: Distribution of correlations and p-values produced by SparCC for each data grouping.Each point is a correlationbetween two ARGs and is colored by whether the point was selected if the p-value < 0.01 and correlation ≥ 0.6.

Figure S4 :
Figure S4: The relative abundance, number of ARGs, and the number of correlations for each resistance class in each sampling group.The relative abundance in green shows the percentage of read fragments for each class.In grey is the number of ARGs for each resistance class.Orange coloring indicates the number of correlation coefficients inferred for ARGs in a resistance class.

Figure S5 :
Figure S5: Networks illustrating how one ARG interacts differently depending on the sampling environments.The three genes catA1_1, mef(A)_1, and tet(L)_4 are selected to illustrate thatthe genes correlate more or less with other genes in their abundance in some habitats.

Figure S6 :
Figure S6: The average correlation between ARGs of different resistance classes in the upper triangle of the heatmaps with the lower half shows the correlation coefficients between the two classes.Note that a correlation coefficient between two ARGs might be present in more than onetile, as some ARGs confer resistance to multiple classes of antimicrobials.

Figure S7 :
Figure S7: Highlights of interactions between ARGs of a. glycopeptide resistance and b. macrolide resistance.The ARG node size shows whether the corresponding ARG gives resistance to the antimicrobial class in focus.Only correlation edges and ARG nodes are colored if they correlate with the highlighted ARGs; otherwise, they are colored grey.The coloring schemes for nodes and edges are given in Figure 1a.

Figure S8 :
Figure S8: Highlights of interactions between ARGs of a. pleuromutilin resistance and b. tetracycline resistance.The ARG node size shows whether the corresponding ARG gives resistance to the antimicrobial class in focus.Only correlation edges and ARG nodes are colored if they correlate with the highlighted ARGs; otherwise, they are colored grey.The coloring schemes for nodes and edges are given in Figure 1a.

Table 1 :
Grouped labels for hosts and environments.The parenthesis after each sampling source label denotes the number of samples assigned to that label if the group consisted of multiple labels.

Table 2 :
Relative abundance of read fragments aligned to each resistance class (row) for each sampling source (column).