In silico discovering relationship between bacteriophages and antimicrobial resistance

Abstract Bacteriophages and their potential contribution to antimicrobial resistance (AMR) have attracted growing attention in the context of medicine and pharmaceutics. A major objective of the CAMDA challenge is to acquire more knowledge about the relationship between viruses, their hosts and AMR genes in determining if AMR indeed can spread through phages. This study is focused on exploring the relationship and possible dependencies between bacteriophages and AMR based on the data collected from urban environments all over the world. The samples in the data are classified into two categories: high and low, according to the observed levels of AMR genes. The approach used in our analyses consists of several different methods which assess the differential abundance of phages, their diversity across samples, the impact on AMR categories and associations with AMR genes. The relationship between phages, their hosts and AMR is also explored by a Bayesian spatial model, using the AMR category (low vs high) as a factor. We found a higher relative risk for phages known to infect Staphylococcus aureus alone or both S. aureus and Acinetobacter baumannii in the high AMR group, which implies that these phages may have a role in the dissemination of antimicrobial genes.


Introduction
Antibacterial resistance poses a serious medical problem that will need new methods in tackling bacterial infections.One main reason for developing such resistance is the overuse of antibiotic agents in human medicine.Antibiotics are often prescribed incorrectly with no obvious benefit.In addition, lack of patient compliance during the antibiotic treatment facilitates the proliferation of drug-resistant bacterial strains.The extensive use of antibacterial agents in agriculture is another major driver of antimicrobial resistance (AMR).Selective pressure due to antibacterial agents leads to acquiring new genetic traits either by spontaneous mutations or by horizontal gene transfer.In both cases the survival rate of the bacteria will increase.
Bacteriophages (or phages) are viruses that infect and replicate within bacteria and, since their discovery in the early 1900s by Tword and d'Herelle, have become a focus of intensive research as potential treatments of pathogenic bacteria as well as possible causes for AMR.Their effect on the horizontal gene transfer via transduction and transformation can contribute to the development of AMR [1][2][3][4].The transfer involves DNA sequences such as prophages, transposons, plasmids and others.General, specialized and lateral transduction occur because of errors in the phage packaging, imprecise excision or delayed excision.However, their role in the dissemination of AMR genes (ARG) is not certain and some studies [5] point out that phages may not necessarily encode such genes.Therefore, understanding of the extent to which phages contribute to the propagation of antibacterial resistance is an important task [6].The main mechanism behind introduction of new resistance factors is transduction, which is a process of transfer of bacterial Antimicrobial resistance; bacteriophages; diversity indices; compositional data analysis; relative risk; Bayesian spatial models DNA between a bacterium infected by bacteriophage and another susceptible bacterium.
The idea of metagenomic studies is based on generation of huge datasets of short read sequences from prokaryotic cells in microbial communities [7].Antimicrobial profiles exhibit regional differences in global urban sewage [8,9] and Subway and urban environments [10].Some studies also explored the relationship between bacteriophage diversity and ARG profiles [11].Many of the existing tools for analyzing metagenomic sequences rely on the so-called k-mer approaches to extract microbial taxa from such data.Some recent methods have been developed based on deep learning approaches to find potential mutated or viral sequences [12][13][14].
The main goal of the CAMDA [15] metagenomics challenge is to explore systematically the co-evolution of the bacteria and their corresponding phages and to potentially be able to predict AMR events from samples collected in strategic monitoring areas.understanding the relationship between viruses and their hosts will be important in elucidating whether phages can spread AMR.In more details, there are four specific tasks: (1) discovery of phages and prophages in metagenomic samples with relevance to AMR; (2) establishing species that may be infected by detected phages, horizontal AMR gene transfers etc.; (3) applications and assessments of relation mining for the occurrence of phages and bacteria in the context of AMR; (4) advancement of algorithms for the identification and discovery of phages and prophages from bacterial genomes; In this work we will focus on analyzing the data and performing the first three tasks.

Materials and methods
In this work, we combine several different methods to explore the relationship between phages and ARGs distributions.We first use compositional data (CoDa) analysis [16] to find specific phages that separate samples based on their ARGs profiles.We then calculate different alpha diversity measures such as Chao1 [17], Shannon [18] and Simpson indices [19] as well as beta diversity based on Bray-Curtis dissimilarity [20].We use penalized lasso regression [21] to find best predictor models for AMR diversity based on the phage abundances.Finally, we use a Bayesian spatial model [22] to estimate relative risk of phages taking into account their host abundances.

Data
The CAMDA [15] challenge provides a set of 124 samples from Subway and urban public transport environments across all continents divided into two groups based on their ARGs profiles (62 high and 62 low AMR samples).Operational Taxonomic unit (OTu) counts of phages taxa normalized as reads per million (RpM) were extracted from the shotgun sequences of the samples using Kaiju -v1.8.2 [23] metagenomics classifier the with virus database of the National Center for Biotechnology Information (NCBI).The classifier uses protein level classification and achieves high sensitivity.The total number of non-zero abundant viruses is 6618 while the total number of non-zero abundant phages is 2687.We also extracted bacteria taxa using the proGenomes2 database [24] and summarized the counts on the level of species.For the AMR abundances we used the ARG profile data provided by CAMDA and pangea which included normalized sets of both ARGs and taxa classes.

CoDa analysis
CoDa are vectors of non-negative components showing the relative weight or importance of a set of parts in a total [25].parts cannot be interpreted isolated from the others.Standard methods such as correlation analysis, which assumes euclidean geometry applied directly to CoDa, may lead to biased results since the sample space for CoDa is simplex.One of the common approaches of analysing CoDa is to apply standard methods to log-ratio transformed data and then use inverse log-ratio transformation to return to the original space [16].Data are considered compositional if they consist of vectors with positive components whose sum is constant (e.g. 1 for proportions or 100 for percentages).More specifically, this defines the sample space as a hyperplane called Aitchison Simplex [26] Metagenomics data (e.g.OTu count data) are CoDa since the sum of all component values depends on the sampling procedure and the data are proportional [27].Taking into account this structure, we perform differential abundance analysis using compositional methods.We compare the phages abundance for high and low AMR samples to find for which phages there is a difference between the two groups.We use the R-package ALDEx2 [28], which was developed using Bayesian methods for detecting ANOVA type of differential expressions specifically for CoDa.We also perform Fisher exact test to find possible overrepresentation of families of phages taxa.

Alpha diversity with Chao1, Shannon, Simpson indices and beta diversity with Bray-Curtis dissimilarity index
Alpha diversity measures variation of microbes in a single sample, whereas beta diversity measures variation across samples.For more detailed information about the applications of diversity measures in metagenomic analysis and the methods for their calculation, see Xia et al. [29].In this work, we use both measures to characterize our data.In particular, we use three indices such as Chao1 diversity [17] index, which is defined as where S chao1 is the estimated number of species, S obs is the total number of species observed in the sample, n 1 is the number of species represented by only one individual and n 2 is the number of species represented by two individuals.Next, we use the diversity index [18], which is defined as where p i is the proportion of the community represented by the ith species and S is the total number of species.Both Chao1 and Shannon indices give more importance to less common species.Finally, we use Simpson index [19], which presents the species heterogeneity and is defined as where p i is the relative abundance of the ith species in the sample.Here Simpson's index ranges from 0 (low diversity) to almost 1.Contrary to the indices described above, Simpson index gives more weight to the more dominant species [29].
The Bray-Curtis dissimilarity index [20] is a statistic measure used to quantify the compositional dissimilarity between two different samples, based on counts at each sample [29] and is defined as follows where X ij , X ik is the number of individuals in species i in each sample (j, k) and n is the total number of species in samples.We use the above methods for estimation of species richness, namely Chao 1, Shannon-Wiener and Simpson diversity indices as implemented in the R package vegan.We perform Kruskal-Wallis test to compare the diversity measures across AMR status and geographical location.
We apply Constrained Analysis of principal Coordinates with Bray-Curtis dissimilarity (capscale from R package vegan).We also use peRMANOVA (function adonis [30] in the vegan package), which partitions sums of squares using semi-metric and metric distance matrices.It performs ANOVA-like tests of the variance in beta diversity which is explained by categorical predictors like AMR status or geographical location.The method is usually applied to large genetic data such as metagenomics OTu data.

Penalized lasso regression
In this study, we apply several types of regression models provided in R package caret [31] and we focused on lasso (Least Absolute Shrinkage and Selection) penalized regression [21,32] for high-dimensional data.This model adds a L1 type penalty term (which is estimated from data by cross-validation technique) for some variables in the model which are highly correlated.As a result, the penalty term will be able to set up some of the model coefficients exactly as zero.Thus, the lASSO is a penalized type of the model selection approach.The estimated model is more parsimonious and interpretable than the models obtained with other regression models like the ridge regression [33].The lASSO model has many applications because it is an efficient method in regression problems that arise in many scientific areas [34].The lASSO model is applied to the phages data to predict the AMR variation within the samples and subsequently to account for the two major classes of AMR -High AMR and low AMR.let us denote the AMR classes by G taking the values 1 and 2. As described before [32,35], the logistic regression model represents the class-conditional probabilities through a linear function of the predictors We split the data into a training and a testing set and use cross-validation as provided in the lasso implementation of package caret.

Bayesian spatial model to estimate relative risk of bacteriophages
As described before, spatial autocorrelation is very common when observations that are close in space have similar values [35,36].A proportion of this spatial autocorrelation is usually modelled by known covariate risk factors in a regression model.However, it is common for spatial structure to remain in the residuals after accounting for these covariate effects.These models include a set of spatially autocorrelated random effects which have a conditional autoregressive (CAR) prior.Spatial models such as Bayesian hierarchical models are then used to augment the linear predictor with a set of spatially autocorrelated random effects depending on the neighborhood structure of geographic areas.These random effects are typically represented with a CAR prior, which induces spatial autocorrelation through the adjacent structure of the areal units.
Bayesian hierarchical models is commonly used in epidemiology to model the relative risk of a disease [37].Recently, they have also been applied to metagenomics data to build networks of taxonomic units [38].We adapt the model to investigate further the relationship between phages and AMR by taking into account the geographical location and host bacteria abundance.In disease mapping the host are humans or animals, while in our context the host are bacteria and the disease is defined as a consequence of the bacteriophages effects.There are two main biological hypotheses with respect to the role of the phages in the dissemination of ARG, namely: the bacteriophages transmit the ARGs through processes such as bacterial transduction or the ARG transmission is accomplished by another biological process such as bacterial conjugation without the involvement of bacteriophages.In the second case, the phages abundance is expected to be proportional to the host abundance.In the second case, the correlation between the abundance of certain phage taxa and ARG is expected to be merely a result of the abundance their hosts, bearing the ARGs.In order to test the two hypotheses, we estimate the relative risk of phages using a Bayesian spatial model which incorporates the spatial information of the samples and controls for the host bacteria abundances.
Since the samples in the CAMDA2021 challenge have spatial information such as latitude and longitude and previous studies [10,39] indicated that metagenomics taxa profiles exhibit spatial correlations we use a Bayesian hierarchical model, in particular Besag-York-Mollie [22], which is a convolution model with a CAR prior.The model uses a spatial neighbourhood matrix W which measures the distance between the samples based on the available geographic information such as latitude and longitude.The model will estimate the phage's risk which is a posterior estimate of the SIR ratio (Observed/expected) of a phage across the samples where the expected abundances are proportional to the corresponding host bacteria taxa counts.The model includes the AMR category (low vs high) as a potential factor.If the phage's relative risk is higher than 1 and the AMR category credible set does not include zero we can conclude that the phage's abundance is higher than what we would expect based on the host bacteria counts and this higher risk can be partially explained by the antimicrobial category status.For example, one AMR group (e.g.high) has an increased risk while the other group (e.g.low) has a decreased risk relative to the expected numbers which are proportional to the phage's host bacteria.The corresponding host bacteria for each phage were queried from https://www.genome.jp/virushostdb/.In case where phages infect more than one bacteria we include the additional host bacteria abundance in the model.For a description of the model in more strict mathematical terms, see lawson [37].
Here ν are spatially unstructured random effects that assume normal distribution while u are the random effects that capture the spatial autocorrelation b e t w e e n t h e s a m p l e s .
T h e n V .We use W D 1 (Normalized Distance) which is the neighboring matrix between samples latitude and longitude positions.The response is assumed to follow poisson distribution and it accounts for overdispersion Var( ) ( ) O E O > and this is an advantage over the pure poisson model.We use the Bayesian setting implementation in R 3.6.3package CARBayes [40], where inference is based on Markov chain Monte Carlo simulation (MCMC) [35,41].The model is fit with the function S.CARbym from the above package.Moran's I test [42] was used to measure the spatial autocorrelation.The model convergence is also checked by Geweke z-scores [43].We further calculate the correlations between phages and specific ARGs.

Results
From the applied CoDa analysis, we obtained 136 phages which differentiate the two categories: High AMR and low AMR levels as it was set up by the sum of relative ARGs counts (pangea.orgwithin the CAMDA challenge).One hundred and fourteen of them had significantly higher abundance (by FDR) in the High AMR group, while 22 had significantly lower abundance (by FDR) in the same High AMR group.The two heatmaps in Figure 1 show the distribution of those phages across the continents.The abundance levels were pooled by phages families and the Myoviridae family was over-represented in the group of differentially abundant phages.
Further, we calculated the correlations between more detailed levels of phages (e.g.species) and ARGs.There were two major bi-clusters of strongly positive correlations (>0.8).Among the top correlated genes in the larger bi-cluster were arcB, parE, rpoB and tufAB, which were correlated mostly with the Myoviridae phage family.The other cluster involved genes such as blaZ, ermX, mphC, tetK, vga and the corresponding correlated phages, mostly from the Siphoviridae family.The phages in one of the clusters were negatively correlated with the ARGs from the other cluster.
Alpha diversity measures (Chao1, Shannon and Simpson) were compared between the groups of High and low AMR samples and also across the continents.All indices showed different diversity distributions across the samples according to their AMR status with Kruskal-Wallis test p-values <0.001 for Chao1 and Shannon indices and p = 0.05 for Simpson index.pairwise Kruskal-Wallis test was performed to compare differences in alpha diversity measures across continents.Significant differences adjusted for multiple comparisons were found between AF(943) and eu(1209), AF(943) and SA(1356), eu(1209) and OC(685), NA(1165) and OC(685) and between OC(685) and SA(1356) for Chao1 index; AF(5.24) and NA(5.6),AF(5.24) and SA (5.8) and between OC(5.19) and SA(5.8) for Shannon index; AF(0.987) and SA(0.99) and eu(0.988) and SA(0.99) for Simpson index.SA had the highest diversity measures while OC had the lowest Chao1 and Shannon indices.Constrained analysis of principal Coordinates using Bray Dissimilarity index for phages (Figure 2) showed clear differentiation between the High and low AMR samples (Figure 2a) and also within the continents (Figure 2b).From these analyses, the profiles of phages from Africa and Oceania were more distinct than the others and their samples were separate from the other continents.We performed peRmutational MANOVA in R with the function adonis2 using Bray-Curtis dissimilarity and 1000 permutations.The results were significant (p < 0.001) for both the effect of the AMR group and the continent.
Our next analysis step used lasso regression in the context of machine learning (R package caret) to select phages that can differentiate the two AMR categories and can be used as predictors for the AMR status.The model achieved high sensitivity (0.9) and specificity (0.95).The top features with non-zero coefficients were mainly from the two most frequent family classes Myoviridae (Escherichia phage RCS47, Synechococcus phage ACG-2014d, Cyanophage Syn30, Synechococcus phage S-SM2, Psychrobacter phage pOW20-A, Acinetobacter phage Ac42) and Siphoviridae (Skunavirus, Bacillus virus SPbeta, Enterobacteria phage YYZ-2008, A. phage Bphi-B1251, Sitaravirus, Enterobacteria phage cdtI) as well as Autographiviridae (Pseudoalteromonas phage RIO-1) and Podoviridae (Escherichia phage TL-2011b).
Bayesian spatial analysis was used in the study mainly to estimate the relative risk of phage abundance compared to their host abundance.We used as examples two bacteria (Salmonella enterica and Staphylococcus aureus) which are recently studied as part of generalized transduction process [44] and their corresponding phages are found in the virus-host database https://www.genome.jp/virushostdb/.Figure 3 shows the relative risk (i.e.adjusted for the host bacteria abundance).AMR category was used as a covariate to test the effect on the phage abundance.Staphylococcus aureus-related phages showed differences among AMR category with the credible set not containing zero and average relative risk higher in the High AMR group (Figure 3a).Salmonella enterica showed similar profiles across continents, where were also not different among the AMR groups with the credible set containing zero (Figure 3c).The higher observed relative risk can also be due to additional host bacteria abundance, which is not accounted for, since phages often affect similar bacteria.
For example, some phages that infect S. aureus also infect A. baumannii.Adding this host will reduce the relative risk but still the average relative risk for the high AMR group will be about 4 times greater than the one for the low AMR group with the creditable set not containing zero (Figure 3b).Due to the relatively small size with available spatial information especially for europe, Oceania and Middle east we need to reconfirm the findings with additional samples.This model will be very useful when we have closely related municipalities and cities so we can track how the phages evolve and if their relative risk changes across borders.This is very similar to the disease mapping model but in this setting the virus is a bacteriophage and the human host are bacteria.Here, to improve the accuracy, we can use several factors as potential covariates: AMR category, continents, cities, climate conditions and other relevant information.
Here we calculated the correlations between phages (on the level of species and families) and ARGs.There were two major bi-clusters between phages and ARGs of strongly positive correlations (Figure 4).Among the top correlated genes in the larger bi-cluster were arcB, parE, rpoB and tufAB correlated mostly with the Myoviridae phage family.The other bi-cluster involved genes such as blaZ, ermX, mphC, tetK, vga and the corresponding correlated phages, mostly from the Siphoviridae family.The phages in one of the bi-clusters were negatively correlated with the ARGs from the other bi-cluster.We also observed strong positive correlations between phages related to S. aureus and the ARGs tetK, blaZ, ermC, which confirms the transduction related events for this bacterial species listed in Hassan et al. [45].

Discussion
The role of bacteriophages in the horizontal spread of ARG has been investigated by many authors with controversial results.enault et al. [5] found that ARGs are rarely encoded in phages' genomes.On the other hand, Rodr 1 guez-Rubio et al. [46] found that armA into phage capsids was up to 10000 times more frequent than in large plasmids.A significant correlation between phages and ARGs was observed [8], indicating that phages may play a role in ARG dissemination.Due to a low abundance of the phages identified, further studies are needed to establish the relationship between phages and ARGs.Some of the findings indicate that phages in sewage environment are widely as free-living phages not always associated with bacteria and persisting longer than their host.In our work, we confirm those findings by observing strong correlations between phages and ARGs as well as increased relative risk (higher than expected phage presence in some samples) which is correlated to the AMR category.
The goal of our study was to examine the relationship between bacteriophages and AMR.We developed a set of different and interlinked approaches to answer this question.Based on the compositional differential abundance analysis, we found that phages which differentiate the AMR category and are highly correlated with the ARGs are overrepresented by the Myoviridae family.Beta diversity measures differentiate between the AMR categories and the continents.Samples from South America have the highest alpha diversity indices and Oceania is among the continents with lowest alpha diversity indices.It is possible to make very good predictions of the AMR categories based on phages abundance using lasso regression.
In our analysis, we used a Bayesian hierarchical model to take into account the geographical location of the samples and estimate the relative risk of the phages by incorporating information about their host bacteria.The bacteriophages can transmit the ARGs through bacterial transduction or the ARG transmission can occur through bacterial conjugation without the involvement of bacteriophages.In the second case, the phages abundance is expected to be proportional to the host abundance.Bayesian spatial analysis can help us to test the hypotheses about the role of the phages in the dissemination of ARG by examining how the phages evolved across spatially correlated samples using both phages and hosts abundances.This model will be very useful when we have closely related municipalities and cities so we can track how the phages evolve and if their relative risk changes across borders.To improve the accuracy of the model, we can further use several additional factors as potential covariates: climate conditions and antibiotic usage.Such analogous approaches are adapted in disease mapping model [37] where the virus is a bacteriophage and the corresponding host are bacteria.

Conclusions
The CoDa analysis identified phages showing different abundances between high and low AMR groups.The majority of these phages belonged to the Myoviridae family.
Diversity measures showed significantly different diversity distribution across samples, with South America and Oceania having the highest and lowest diversity indices, respectively.using machine learning, we identified phages that could be as predictors of AMR status: the identified taxa belonged mainly to the Myoviridae and Siphoviridae families.
The spatial Bayesian model showed a higher relative risk of phage abundances in the high AMR group when controlling for host abundance, which indicates that phages may play a role in ARG dissemination.
In future, we plan to continue our work by: using larger databases for viruses and phages within CAMDA and MetaSuB projects.This will allow to use more samples (e.g.equal number of low and high AMR samples within each continent) and to extend the model by including interactions between AMR category and geographical locations.Such new extension of the work will provide an opportunity to include more relevant covariates in the Bayesian model and to develop additional ways to define the expected (e) phage abundance.

Figure 1 .
Figure 1.Scaled and not scaled log transformed Rpm (reads per million) top differentially abundant phages (FDR, Welchs t test with Benjamini-hochberg < 0.05) pooled by the corresponding families.the phages in the family Myoviridae are overrepresented (p < 0.01) in the set that differentiated high and low amR samples.aF africa, aS asia, eu europe, na north america, me middle east, oc oceania, Sa South america.

Figure 2 .
Figure 2. Distributions of phages.Distribution across category (a), high amR (red) and low amR (green); and distribution across continents (b), aF africa, aS asia, eu europe, na north america, me middle east, oc oceania, Sa South america.

Figure 3 .
Figure 3. Relative risk (log scale) plots based on Bayesian hierarchical model for the pulled abundance for all phages that are associated with Staphylococcus aureus (a), Staphylococcus aureus and Acinetobacter baumannii (b) and Salmonella enterica(c).here, the SiR ratio is observed phage abundance while expected is proportional to their host abundance.aF africa, aS asia, eu europe, na north america, me middle east, oc oceania, Sa South america.

Figure 4 .
Figure 4. correlations among phages and aRgs.the two clusters are between phages from Myoviridae family and genes such as acrB, parE, rpoB and tufAB and Siphoviridae and genes such as blaZ, ermX, mphC, tetK, vga.