Antimicrobial resistance determinants are associated with Staphylococcus aureus bacteraemia and adaptation to the healthcare environment: a bacterial genome-wide association study

Staphylococcus aureus is a major bacterial pathogen in humans, and a dominant cause of severe bloodstream infections. Globally, antimicrobial resistance (AMR) in S. aureus remains challenging. While human risk factors for infection have been defined, contradictory evidence exists for the role of bacterial genomic variation in S. aureus disease. To investigate the contribution of bacterial lineage and genomic variation to the development of bloodstream infection, we undertook a genome-wide association study comparing bacteria from 1017 individuals with bacteraemia to 984 adults with asymptomatic S. aureus nasal carriage. Within 984 carriage isolates, we also compared healthcare-associated (HA) carriage with community-associated (CA) carriage. All major global lineages were represented in both bacteraemia and carriage, with no evidence for different infection rates. However, kmers tagging trimethoprim resistance-conferring mutation F99Y in dfrB were significantly associated with bacteraemia-vs-carriage (P=10-8.9-10-9.3). Pooling variation within genes, bacteraemia-vs-carriage was associated with the presence of mecA (HMP=10-5.3) as well as the presence of SCCmec (HMP=10-4.4). Among S. aureus carriers, no lineages were associated with HA-vs-CA carriage. However, we found a novel signal of HA-vs-CA carriage in the foldase protein prsA, where kmers representing conserved sequence allele were associated with CA carriage (P=10-7.1-10-19.4), while in gyrA, a ciprofloxacin resistance-conferring mutation, L84S, was associated with HA carriage (P=10-7.2). In an extensive study of S. aureus bacteraemia and nasal carriage in the UK, we found strong evidence that all S. aureus lineages are equally capable of causing bloodstream infection, and of being carried in the healthcare environment. Genomic variation in the foldase protein prsA is a novel genomic marker of healthcare origin in S. aureus but was not associated with bacteraemia. AMR determinants were associated with both bacteraemia and healthcare-associated carriage, suggesting that AMR increases the propensity not only to survive in healthcare environments, but also to cause invasive disease.


DATA SUMMARY
New data -submitted to NCBI SRA accession number PRJNA690682.
All sequenced bacterial isolates are deposited in Short Read Archive. Nine isolates included as part of another study are found under accession number PRJNA369475 (https://www. ncbi. nlm. nih. gov/ bioproject/? term= PRJNA369475) and sequences of the remaining isolates under accession number PRJNA690682 (https://www. ncbi. nlm. nih. gov/ bioproject/? term= PRJNA690682). A complete listing of sequences is contained in supplementary material (Table S5, available in the online version of this article).
The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

BACKGROUND
Staphylococcus aureus is a common coloniser of human mucosal surfaces and skin but also a major human pathogen [1,2]. It is a leading cause of hospital and community acquired infection and one of the leading causes of bloodstream infection worldwide [1][2][3][4]. Over 12000 cases occur each year in England, and the rate continues to increase, despite improving control of methicillin-resistant S. aureus (MRSA) bacteraemia [5]. Mortality following S. aureus bacteraemia (SAB) has not declined over recent years, and remains generally over 20 % at 30 days, even with appropriate antimicrobial therapy [6][7][8].
S. aureus possesses diverse and variable virulence mechanisms facilitating tissue invasion, inflammation and evasion of host immune factors. These include a thick peptidoglycan wall, polysaccharide capsule, toxins [9,10], complement control proteins [11], and bound adhesins [12]. With the exception of specific toxinoses such as toxic shock syndrome [13][14][15], and the role of Panton-Valentine leucocidin (PVL) in skin or soft tissue infections [16] and pyomyositis [17], evidence linking bacterial genetic variability to clinical disease phenotype is inconclusive.
The majority of evidence accrues from case-control studies using candidate gene approaches or microarrays to examine gene presence or absence. Such studies have implicated several specific genes encoding putative virulence factors in invasive S. aureus disease. Secreted enterotoxins [18][19][20][21], haemolysins [18] and leucotoxins [22]; surface proteins which mediate tissue attachment, invasion and immune evasion [18,22], the presence of the intercellular adhesin locus [18,23] and variation in the accessory gene regulator (agr) system [22] have all been shown to co-occur with invasive S. aureus. However, the evidence for these associations is inconsistent, and for every study reporting an association, there is at least one large study that shows no evidence of effect [18,24].
The fact that a few S. aureus lineages account for the majority of S. aureus infections suggest important inter-lineage differences in virulence but again evidence is conflicting for a relationship between CC and invasive disease [22,[24][25][26][27][28][29].
These discrepant results may reflect the variable sensitivity of probes employed, or the inconsistent methods used to control for effects of population structure. In particular, associations cannot be reliably inferred without considering linkage disequilibrium between candidate genes and potential virulence factors elsewhere in the genome. Different genomic associations have been identified for community-associated (CA) or healthcare-associated (HA) SAB [30]. Loss of function mutations in agr, a central controller of S. aureus expression, have been frequently found in HA bloodstream infections [31]. Reduced cytotoxicity and low agr expression were also independent predictors of mortality in a study of nosocomial MRSA pneumonia [32]. There is a direct relationship between mecA expression and agr dysfunction: the altered penicillin binding protein (PBP) expressed by some MRSA can directly reduce agr-mediated toxin production [33]. This dysfunction may be a fitness 'cost' , overcome by the relative advantage of antimicrobial resistance in healthcare settings, and the relatively lower host defences found in hospitalised patients. However, evidence shows MRSA clones traditionally conceived of as HA lineagessuch as CC-22 -are equally capable of transmission in the community, including household settings [34], questioning the notion that MRSA requires the healthcare setting to gain a relative advantage.

Impact Statement
Bacteria like Staphylococcus aureus are part of our normal flora, carried by one in three adults, and are also found in serious, life-threatening disease like bloodstream infection (bacteraemia). Many studies have examined whether any part of the genetic code of S. aureus might increase the chances of these serious infections, and the results of these have been conflicting. In particular lineages (i.e. related strains), the presence of some genes and genetic mutations have all been reported as possibly increasing the risk, but these reports conflict. This may partly be because the population structure of S. aureus makes it difficult to disentangle true associations from false. Here we compare 2001 S. aureus from bacteraemia and carriage, using methods which account for those challenges. We find all lineages are equally likely to cause bacteraemia, but some genes and mutations encoding antimicrobial resistance (AMR) are found more often in bacteraemia, while distinct AMR mutation factors are associated with healthcare-associated carriage. Our study settles persisting questions about the propensity of differing lineages to cause bloodstream infections. It further raises the suggestion that AMR may predispose to serious infection, a possibility which should increase the urgency of our global fight against the growing threat of AMR.
There is growing evidence that the relationship between toxin production and virulence in S. aureus is not straightforward. While superantigen toxins and leukocidins have been linked to certain disease phenotypes, genomic changes associated with reduced bacterial toxicity may actually enhance bacterial survival in the bloodstream, evidenced by lower lymphocyte toxicity and greater fitness in human serum exhibited by bacteraemia isolates compared to those found in nasal carriage or soft tissue infection [35]. agr defective strains have been found in association with persistent bacteraemia [36,37], and associated with higher mortality [38]. Naturally occurring loss of function mutations in the regulatory protein repressor of surface proteins (Rsp) have been documented arising within-host and in bloodstream infections [39]. These mutations were associated with attenuated mortality in a mouse model of disease, but preserved the ability to disseminate and form abscesses [40], and have been shown to alter survival in blood and antibiotic tolerance [41]. A comprehensive survey of withinhost evolution of S. aureus infection demonstrated evidence of bacterial genomic adaption, with protein altering variation in regulatory genes, and the cell surface proteins under control of key regulators [42]. Similar signals of adaptation were found in the genetic changes associated with prolonged bacteraemia [43].
Thus, while conflicting evidence for the role of gene presence in bacteraemia-vs-carriage arises from case-control studies, there are observations supporting the hypothesis that subtle genetic variation -including that of a type traditionally thought to diminish virulence -could increase the likelihood of SAB [35][36][37][38][39][40][41][42][43][44]. Recent developments in bacterial genome wide association studies (GWAS) demonstrate that these powerful tools can help delineate the genomic basis of bacterial infection. A study of bacteraemia caused by the ST-239 lineage investigated associations between bacterial genetic variants, toxin production and severity of disease in a mouse model [45]. Conversely, GWAS of S. aureus lineage CC-45 did not identify genomic predictors of bacteraemia-vscarriage [46]. Genomic variants were integrated with bacterial phenotyping and clinical data in 300 adults with bacteraemia involving the CC-22 and CC-30 lineages, finding that bacterial predictors of mortality varied by lineage [47]. Investigating whether a genomic basis for invasive disease exists more generally at a population level requires careful control for population structure, an otherwise potent confounder. Bacterial GWAS incorporating such controls has recently identified PVL as the key determinant of S. aureus pyomyositis in a paediatric population [17].
Here we present a bacterial GWAS of SAB across bacterial lineages, studying population-representative cases of bacteraemia and nasal carriage controls, integrating clinical data with 2001 bacterial sequences to investigate whether bacterial lineage or genomic variation is associated with bacteraemia-vs-carriage. Within S. aureus carriage, we further examine genomic features associated with HA-vs-CA carriage.

Identification of cases and controls
Cases of S. aureus bacteraemia (SAB) were identified from three UK hospital trusts between 2008-2014: Oxford University Hospitals NHS trust (Oxford UK), Brighton and Sussex University Hospitals NHS trust (Brighton, UK) and University Hospitals Plymouth NHS trust (Plymouth, UK). These sites were part of the UK Clinical Infection Research Group (UKCIRG) which established prospective cohort study of SAB in 2008 [6], and the International Staphylococcus Aureus Collaboration (ISAC) which established a multinational prospective cohort study of SAB in 2006 [7]. Sequential individuals from these studies over 13 years of age with S. aureus on blood culture were included if there was an isolate available for sequencing, with associated clinical data, and the blood culture had not been deemed to be a contaminant on local clinical review. We identified 1203 cases in patients that were not contaminants, after excluding repeat episodes (775, 232 and 196 at each centre respectively). Bacterial isolates were found from 724, 207 and 163 episodes at each centre respectively, and a minimum clinical data set (see below) was available for 674, 187 and 160 cases. We successfully sequenced 1017 of these 1021 cases for inclusion. These included 417 cases from a previously sequenced collection investigating identifying antimicrobial resistance [48] (Fig.  S1).
S. aureus isolates from nasal carriage in individuals without S. aureus infection were identified from two studies of S. aureus carriage in Oxfordshire, UK (Fig. S2).
The first was a study of S. aureus nasal carriage in adults in the community between July 2009 and April 2013 [49]. Of 1123 individuals enrolled, 360 individuals carrying S. aureus at recruitment and 211 swab-negative individuals were invited to supply nasal swabs at two-monthly intervals [49]. Where co-habiting individuals carried the same spa-type, only one individual was considered for inclusion to avoid over-sampling of strains with household transmission. The second was an investigation of nosocomial carriage and transmission at the John Radcliffe Hospital, Oxford between September 2009 and August 2011 [50]. Individuals admitted to three study wards had nasal swabs for S. aureus carriage performed on admission and at fortnightly intervals until discharge. In total, 1146 individuals were found to have S. aureus carriage on any swab during the study (enrolled from Intensive Treatment Unit (ITU; 729 individuals), Trauma unit (352) or one of two elderly care wards (65)). Carriers with S. aureus originally isolated on the first or second day of admission and no overnight stay in the preceding 12 months were classified as community-associated controls (269). Carriers with S. aureus originally isolated more than 2 days after admission and carriers who had been admitted for three or more nights in the preceding 28 days were classified as healthcare-associated controls (335). All carriers with S. aureus isolated from a clinical sample in the previous 12 months were excluded. In total 984 asymptomatic carriage controls were successfully sequenced (Fig. S2). For carriers with multiple positive samples, the latest sample of the longest carried spa-type was selected.

Epidemiological data
For each episode of SAB, the minimum dataset for inclusion was patient gender and age at the time of infection, date of admission, the number of days between admission and first blood culture from which S. aureus was cultured, and the number of days since the most recent discharge from hospital. If available, the clinically-determined focus of infection and vital status 90 days after infection were recorded. Epidemiological data on episodes of SAB was collected as part of on-going service evaluation studies, as part of multi-centre collaborations with the UK Clinical Infection Research Group (UKCIRG) [6] and the International Staphylococcus aureus Collaboration (ISAC) [7]. Further data, including for carriage controls were obtained from the Infections in Oxfordshire Research Database (IORD) which links information about patient attendances with results from pathology services in an anonymised research database [51].
Cases were deemed healthcare-associated (HA) if the first blood culture positive for S. aureus was collected on the third day or later of a hospital admission (healthcare onset cases), or if the patient had an inpatient admission in the previous 90 days (community onset, healthcare-associated cases) [52]. Cases were deemed community-associated (CA) if the first blood culture positive for S. aureus was collected on the first or second calendar day of admission, and there was no inpatient admission in the previous 90 days.

Microbiological methods
S. aureus isolates from blood culture were characterised using standard operating procedures of clinical laboratories at all three centres. Isolates for inclusion were retrieved and frozen in 15 % glycerol stock prior to DNA extraction.
Hospital carriage swabs were collected by research nurses using dry cotton-tipped swabs. Community carriage study participants self-collected swabs, returning these by mail as previously described [49]. All swabs were incubated in 5 % saline enrichment broth (Oxoid LTD, Basingstoke, UK) overnight at 37 °C before subculture onto SaSelect chromogenic agar (Bio-Rad, Watford, UK). Plates were examined after 24 h incubation and potential S. aureus colonies confirmed by catalase, DNase and Prolex Staph Xtra Latex kit (Pro-Lab Diagnostics, Birkenhead, UK). Isolates were stored at −80 °C in 15 % glycerol. Isolates were spa-typed as previously reported [49].

Whole genome sequencing
For each bacterial culture, a single colony was sub-cultured and DNA was extracted from the sub-cultured plate using a mechanical lysis step on FastPrep homogeniser (MPBiomedicals, Santa Ana, CA) followed by extraction with the Quickgene-mini80 device (Autogen Inc, Holliston, MA), and sequenced at the Wellcome Trust Centre for Human Genetics, Oxford. Six hundred isolates were sequenced on the Illumina HiSeq 2500 platform (San Diego, California, USA), with paired-end reads 150 base pairs long. A further 417 isolates were sequenced for an earlier study, 26 on with Illumina HiSeq 2500, and 391 on the HiSeq 2000 platform, with paired-end reads of 99 base pairs.

Variant calling
Following established methods [17], we used Velvet [53] v1.0.18 to assemble reads into contigs de novo. Velvet Optimiser v2.1.7 was used to choose the kmer lengths on a per sequence basis. The median kmer length for assembly was 123, however this was affected by sequencing read length, being significantly lower for assemblies based on 99 bp reads (median k=69) than those based on 150 bp reads (median k=125) (P<10 −5 , Wilcoxon rank sum test).
We used blast [54] to find the relevant loci, and defined multilocus sequence type (MLST) using the online database at http:// saureus. mlst. net/ Strains that shared six of seven MLST loci were considered to belong to the same CC. Antibiotic sensitivity was predicted by interrogating the assemblies for a panel of resistance determinants as previously described [48].

Reconstructing the phylogenetic tree
We constructed a maximum likelihood phylogeny of mapped genomes for visualization using RAxML [60] assuming a general time reversible (GTR) model, and fine-tuned the estimates of branch lengths using ClonalFrameML [59].

Kmer counting
We used a kmer-based approach to capture non-SNP variation [61]. Using the de novo assembled genomes, all unique 31 base haplotypes were counted using dsk [62]. If a kmer was found in the assembly it was counted (i.e. determined to be present for that genome), otherwise it was treated as absent. This produced a set of 23 860 793 variably present kmers, with the presence or absence of each determined per isolate. We identified a median of 2 760 000 kmers per isolate, including variably present kmers and kmers common to all genomes (IQR 2725000-2806000). The number of kmers found per isolate did not differ significantly with sequencing platform (P=0.4, Wilcoxon rank sum test). From a smaller set of 1610 isolates sequenced with 150 bp reads, we identified 22 284 204 variably present kmers.

Calculating heritability
We used the Genome-wide Efficient Mixed Model Association tool GEMMA [63] to fit a linear mixed model for association between a single phenotype (bacteraemia vs asymptomatic nasal carriage, [encoded as 1 and 0, retrospectively]). We calculated the relatedness matrix from biallelic SNP and kmer presence for tests of each allele type. We used GEMMA to estimate the proportion of variance in phenotypes explained by genotypic diversity (i.e. heritability).

Genome wide association testing of SNPs and kmers
We performed association testing using an R package bacte-rialGWAS (https:// github. com/ jessiewu/ bacterialGWAS), which implements a published method for locus testing in bacterial GWAS [64]. The association between each SNP and kmer with the phenotype was tested controlling for population structure and genetic background using the linear mixed model (LMM) implemented in GEMMA [63]. We included healthcare or community origin of case/control status as a fixed covariate in the model when testing for associations with the bacteraemia-vs-carriage phenotype. The parameters of the linear mixed model were estimated by maximum likelihood and a likelihood ratio test was performed against the null hypothesis (that each locus has no effect) using the software GEMMA [63], using a minor allele frequency of 0 to include all SNPs. GEMMA was modified to output the ML log-likelihood under the null and alternative hypothesis and -log 10 P values were calculated using R scripts in the bacterialGWAS package.

Testing for lineage effects
We tested for associations between lineage and phenotype using principal components (PCs) in the R package bugwas (available at https:// github. com/ sgearle/ bugwas), which implements a published method for lineage testing in bacterial GWAS [64]. PCs were computed based on biallelic SNPs using the R function prcomp. To test the null hypothesis of no background effect of each principal component, we used a Wald test [65] against a χ 2 distribution with one degree of freedom.

Kmer mapping and sequence alignment
We used Bowtie [66] to align all 31 bp kmers from short-read sequences to the reference genome MRSA252 [56]. For all 31 bp kmers significantly associated with case-controls status, the likely origin of the kmer was additionally determined by nucleotide sequence blast [54] of the kmers against a database of all S. aureus sequences in GenBank. We used blast [54] to identify the best match for coding sequences of interest in the de novo assembly, and used Jalview [67] to and visualize the assembled sequences.

Multiple testing correction
Multiple testing was accounted for by applying a Bonferroni correction [68]; the individual locus effect of a variant (PC, kmer or SNP) was considered significant if its p value was smaller than α/n p , where we took α = 0.05 to be the genomewide false-positive rate and n p to be the number of PCs, kmer phylopatterns or SNP phylopatterns. We defined each phylopattern to be a unique partition of individuals by the alleles at that kmer or SNP.
The Bonferroni correction represents a conservative approach to controlling for type 1 error. The harmonic mean p-value (HMP) has recently been developed as a method to combine alternative hypotheses against the null hypothesis, without sacrificing power, even when the tests are not independent [69]. The HMP was calculated for coding regions using the R package harmonicmeanp v3.0 (https:// CRAN. R-project. org/ package/ harmonicmeanp). The HMP across a region was then adjusted for the proportion of kmers mapping to that region: HMP adj =HMP/ω (where ω=proportion of kmers mapping to that region, compared to the total number of kmers mapping to coding regions). The adjusted HMP was compared directly to the significance threshold α L , being a nominal threshold α (0.05) adjusted for the number of unique p-values being tested [69].

S. aureus lineages do not differ strongly in their propensity to cause bacteraemia
A phylogeny of 2001 cases and controls demonstrated that a broad diversity of S. aureus lineages among our cases (Fig. 1), with representatives from all clonal complexes (CC) as defined in the MLST scheme [70]. bacteraemia-vs-carriage (P=0.02, Wald test), but this was not statistically significant after adjusting for multiple testing (Figs 1 and S3). The overall sample heritability was predicted to be low (2.1%, 95 % CI 0.0-5.3 %). This comprehensive survey of SAB and carriage indicates that lineages of S. aureus do not differ substantially in their intrinsic propensity to cause bacteraemia.

Antimicrobial resistance determinants are associated with S. aureus bacteraemia
Testing all identified SNPs for association with case/control status in 2001 isolates did not identify any statistically significant associations between individual SNPs and bacteraemia at the genome wide level when controlling for population structure (Fig. S4). However, the SNP coming closest to a statistically significant association (P=10 -5.6 , likelihood-ratio test (LRT)) was an A to T mutation at position 1 497 290 in the MRSA252 reference genome. This SNP encodes a phenylalanine to tyrosine substitution at codon position 99 in dihydrofolate reductase (dfrB); this F99Y mutation confers trimethoprim resistance [48]. It was relatively rare, being found in 41 cases and five controls, and correlated with resistance to other antibiotics: 36/46 (78 %) of isolates with this variant were also methicillin resistant. This variant was found most commonly in isolates from CC-22 (63 %) and ST-36 (22 %).
To further investigate associations between genomic content or sequence variation and the bacteraemia-vs-carriage phenotype, we used a kmer approach [60] to detect variation in the accessory genome, and variants such as small insertions or deletions that are not well captured by mapping SNPs.
When testing kmers found in all 2001 isolates, we found 1214 kmers significantly associated with carriage. These kmers mapped to multiple sites across the genome, most of which were repeat regions, including 16S rDNA and transposon insertion sequences (Fig. S5a). However, the presence of these kmers was strongly affected by the Illumina sequencing read length, being found in isolates sequenced using 150 bp but not 99 bp reads (Fig. S5b). De novo assembly was repeated using a constrained kmer length in assembly (up to 79 bp) to control for the variation in read length, and when these new assemblies were used, 924/1214 (76 %) of the previously identified 31 bp kmers were no longer significantly associated with the phenotype (Fig. S5c). We concluded that varying length of sequencing was a major source of confounding in kmer-based associated estimates.
To avoid false positive results, we therefore restricted the investigation of kmers associated with bacteraemia-vscarriage to isolates sequenced with 150 bp reads ( Table 2). This reduced set of cases showed similar epidemiological characteristics to the larger group (Table 1).
No further individual kmers met the threshold for significance, but there were distinctive peaks enriched for small p-values in the Manhattan plot (Figs 2 and S6). We calculated HMP to perform aggregate kmer-based tests of association across coding regions of the genome. At the whole-genome level, the pooled evidence for association between coding sequence variation and bacteraemia was considerable (HMP=10 -3.3 , LRT). The evidence for kmers associated with bacteraemia, when pooled, was significant in several loci (Fig. 2), including the dfrB locus, SAR1439 (HMP=10 -6.9 , HMP adj =10 -3.3 , LRT)). The presence of kmers mapping across the SCCmec region was also significantly associated with bacteraemia (HMP=10 -4.4 , HMP adj =10 -2.2 , LRT). The strongest evidence within SCCmec was for mecA (SAR0039, HMP=10 -5.3 , HMP adj =0.02, LRT), which encodes PBP2a, a transpeptidase which has low beta-lactam affinity and confers methicillin resistance. When this region was examined in closer detail, high-risk kmers covered the entirety of the SAR0039 locus, encoding PBP2a, a PBP with low affinity for beta-lactams (Fig. 2b). There are no alternate low-risk kmers in this region, suggesting the presence of this gene, rather than variation within it, is associated with bacteraemia. Thus, genomic sequences associated with AMR to both trimethoprim and beta-lactams, but not other antimicrobial classes, were significantly associated with bacteraemia.

Genomic signals of healthcare-associated carriage include antimicrobial resistance factors, as well as variation in a virulence determinant, prsA
Regulatory gene changes have been associated with persistent bacteraemia and increased mortality, and it has been hypothesised these changes either convey or accompany relative survival advantages in healthcare environments [31,32]. We compared the genomic factors associated with healthcare environments by conducting a GWAS for HA-vs-CA among carriers (330 HA, 654 CA). We focused only on carriage isolates because we had more extensive data about hospital admissions in this group, and epidemiological data to further demonstrate that isolates from CA-carriage truly reflected community origin. The MRSA CC-22 lineage showed the strongest association with HA-vs-CA carriage (P=0.06, Wald test, Fig. S7), but this lineage effect was even less significant than for bacteraemia-vs-carriage (P=0.02, Wald test, Fig. S3), suggesting that no lineages were strongly associated with healthcare acquisition of S. aureus carriage.
In total 124 SNPs were associated with HA-vs-CA carriage after controlling for population structure, and adjusting for the 77 597 SNP phylopatterns in the population (Fig. S8, Table S3). The most significant were non-coding (intergenic) variants in a region encoding tRNAgly at 2034022-2034039. However, basecalls at these sites were only made in 380/984 (38. 6%) with calls imputed in the remaining 604/984 (61.4 %).
Excluding isolates with imputed calls at these sites reduced the unadjusted OR for finding these SNPs in HA-vs-CA carriage from 2.1 to 0.87, suggesting that the observed association was a product of SNP imputation. The most significant coding variants included a G to A substitution at 2 417 648 in MRSA242 (P=10 -7.2 , LRT), encoding a proline to leucine substitution at 707 in SAR2345, an AcrB/AcrD/AcrF family protein, which is a multidrug efflux system subunit. They also included a C to T substitution at 7255 in MRSA252 (P=10 -7.2 ), which encodes L84S in gyrA and confers quinolone resistance [48]. A variant at these positions was called in all 984 isolates, being found in 43/330 (13.0 %) HA carriage isolates and 23/264 (3.5 %) CA carriage isolates, with no calls imputed. A significant association was also seen with a band of 91 low frequency SNPs, with shared minor alleles co-inherited predominantly in the MRSA sub-clade of CC-22 (Fig. S9).
The dfrB mutation seen associated with bacteraemia was not significantly associated with HA-vs-CA carriage (P=0.4).
All carriage isolates were sequenced with 150 bp reads, so all were included in a study of kmer associations. We found 188 31 bp kmers in 59 unique patterns were significantly associated with HA carriage (Fig. 3a, Table S4); 123/188 (65.4 %) kmers comprising 52 phylopatterns mapped to a single gene -prsA -and their absence was associated with HA carriage (P=10 -7.2 -10 -19.4 , LRT) (Fig. 3b) While CA carriage isolates showed conservation of the prsA sequence across the population, multiple patterns of variation arising in different lineages were seen in HA carriage isolates (Fig. S10), consisting of both SNPs and deletions (Fig. S11). When a conditional SNP GWAS was performed including the presence or absence of any of the kmers most strongly associated with HA-vs-CA carriage as a covariate in the model, no SNPs reached the threshold for significance, suggesting the signal of association accompanying these SNPs was better explained by prsA variation. Variation in prsA was common in the MRSA sub-clade of CC-22 but was not limited to this lineage. PrsA is a surface bound foldase protein, responsible for post-translational processing of virulence factors, including proteases and cell surface proteins [74,75] which has also been shown to modulate susceptibility for both beta-lactam and glycopeptide antibiotics [76]. No peak of significant kmers occurred at the prsA locus in the bacteraemia-vs-carriage GWAS (Fig. 2a), suggesting that variation at this locus is specifically associated with carriage arising from the healthcare environment, but not with bacteraemia.

DISCUSSION
These findings reflect analysis of a large collection, representing the populations of S. aureus circulating in community and healthcare settings, including the whole genomic content of the population, with control of population structure. In doing so, we address the conflicting results from smaller case-control studies which have found evidence for [26,29] and against [18,24] differing invasiveness between S. aureus lineages. We conclude that S. aureus lineages do not differ in the frequency with which they caused bacteraemia, compared to their frequency in carriage. However, across lineages we found evidence that genetic variants underlying AMR were associated with increased odds of bacteraemia versus carriage. These included some determinants of methicillin resistance, contrasting with previous research indicating that methicillin resistance incurs a fitness cost for S. aureus, reducing its pathogenicity [77,78].
The most obvious possible explanation for association between methicillin resistance and bacteraemia would be survival benefit in the presence of beta-lactam antibiotics. However more than half of our MRSA bacteraemia cases were community-associated, being first detected prior to or within 48 h of hospital admission. While we do not have data about pre-hospital antibiotic treatment, it is unlikely that the majority of patients with CA bacteraemia were on betalactam antibiotics at the time that bacteraemia developed. It is also possible that other between-group differences, such co-morbid illnesses, may also account for the different prevalence of MRSA observed between patients with bacteraemia and asymptomatic carriers, and we have not been able to measure patient factors confounding between MRSA carriage and invasive infection.
However, there is in vitro evidence that mecA -the primary determinant of methicillin resistance -modifies bacterial virulence independently of antibiotic selection pressure, through modulation of agr-mediated toxin expression [33]. Previous evidence has implicated the altered PBP encoded by mecA in persistent and complicated bacteraemia, and our observations concords with those findings. Analysis of SAB over 21 years in the USA showed that within CC-8, the MRSA lineage USA300 was associated with higher rates of metastatic infection compared to the rest of the spa-t008 complex after adjusting for patient and clinical variables [79]. S. aureus strains which are phenotypically methicillin susceptible but contain the mecA gene have been reported to have a higher risk of persistent bacteraemia, even when treated with vancomycin [80]. AMR elements have been implicated as virulence determinants in another Gram-positive species; Streptococcus pneumoniae, where a PBP variant associated with penicillin tolerance (though not resistance) was associated with meningitis among isolates found in invasive pneumococcal disease [81].
Our finding of trimethoprim resistance associated with bacteraemia is not easily explained through direct antibiotic selection since trimethoprim containing antibiotics (e.g. co-trimoxazole) are not advised in therapeutic guidelines for treatment of skin and soft tissue infections in the United Kingdom, where CA-MRSA rates are low. It is plausible that patients may receive trimethoprim for urinary tract infections in the community, and this could reflect an impact on commensal flora, enabling invasion by trimethoprim-resistant S. aureus strains. It is also possible that mutations in dfrB, a metabolic gene important in bacterial DNA synthesis, affect bacterial persistence: trimethoprim resistant S. aureus have shown variable growth rate and survival under environmental stress according to the mechanism of resistance [82].
Overall, we found MRSA was more prevalent in bacteraemia than carriage when sampling the population representatively.
In the UK at this time MRSA was almost exclusively healthcare associated. Consequently, one potential explanation for this finding is unmeasured healthcare exposure among CA bacteraemia cases for which healthcare exposure data was only available for the preceding 12 weeks. However even restricting to HA associated cases and controls MRSA was associated with bacteraemia. Furthermore, by controlling for population structure, we can be confident the association demonstrated between mecA and bacteraemia is not simply a reflection of healthcare adapted lineages, and that such lineage effects could have been detected using the GWAS methods employed here. In fact, while variation within prsA was strongly associated with HA carriage, variation in this gene was not associated with bacteraemia-vs-carriage. Likewise, a quinolone resistance mutation in gyrA was associated with HA carriage but not with bacteraemia. The association with quinolone resistance is consistent with previous reports of quinolone resistance associated with HA S. aureus infections in the absence of quinolone treatment [83]. However these contrasting observations suggest that distinct bacterial factors favour healthcare environment adaptation or transmission compared with those favouring bloodstream infection.
The surface bound foldase protein PrsA has been implicated as a secondary resistance factor for both beta-lactam and glycopeptide antibiotics [84]. In vitro assays have shown that S. aureus can survive despite disruption to prsA, but strains with prsA disruption are more susceptible to oxacillin [84]. PrsA reduces the membrane quantity of PBP2a without altered transcription of mecA, regulating expression of a methicillin-resistant phenotype independently of mecA. Additionally, PrsA plays an important role in the post-translational processing of virulence factors, including proteases and cell surface proteins [74,75], and while the secretion of some virulence factors is decreased when prsA is deleted [74], PrsAdeficient bacteria have enhanced aggregation and adherence [75], changes which might favour survival or transmission in the healthcare environment.
Our study demonstrates some constraints and pitfalls for bacterial GWAS. Firstly, sequencing read length was a strong source of confounding in our data which, without adequate control, produced false positive results. This is an ongoing challenge for studies pooling existing sequencing data where it is not possible to randomize cases and controls across sequencing batches. We dealt with this using the conservative option of excluding 391 cases from kmer analysis, representing a substantial sacrifice of power. A further limitation was that population structure appeared to be incompletely controlled for low frequency variants [85], as exhibited by a set of 91 SNPs in strong linkage disequilibrium (LD) associated with HA-vs-CA carriage. Linear mixed models are able to control for lineages, and cryptic population structure through the use of a relatedness matrix, but they make the assumption that closely related isolates are unlikely to have large differences in phenotype. This assumption means that they can incompletely account for lineage effects arising from closely related strains which vary significantly in frequency between phenotypes [85]. In our study, the SNPs in LD were found in the predominantly HA-MRSA isolates within CC-22. This may represent either a true association with that lineage, or co-carriage with another genomic element in that lineage. In this case, prsA variation was common in CC-22, but not detected in the SNP based study (as the variation was a deletion rather than a nucleotide substitution), and the signal of association accompanying these SNPs was better explained by prsA variation. Overall, the kmer-based methods were more fruitful in our study, both in their ability to detect non-SNP based variation (such as deletions in prsA, and the presence of mecA), and retaining the ability to identify significant SNPs (including a dfrB variant).
In previous studies we have identified within-host adaptation of S. aureus genes associated with development of invasive disease from a colonising isolate -particularly involving the agr locus, and the cell wall proteins under regulatory control of agr and rsp [42]. Such variation was not associated with bacteraemia in this population-based study, perhaps because it provides only a short-term advantage to the bacteria, and in the longer term is detrimental, by adversely affecting transmission. In contrast, variation which confers AMR is likely to confer a bacterial survival advantage in carriage in the face of antibiotic selection pressure in the healthcare environment, as well as in the bloodstream, allowing these variants to survive in the population.

CONCLUSIONS
In a study of over 2000 isolates from S. aureus bacteraemia and nasal carriage in the UK, we found strong evidence that all S. aureus lineages are equally capable of causing bloodstream infection, and of being carried in the healthcare environment.
We found that genomic variation in prsA (encoding a foldase protein) was a novel genomic marker of healthcare adaptation in S. aureus. This predictor of healthcare-associated carriage was not associated with bacteraemia, while AMR determinants were associated with both bacteraemia and healthcareassociated carriage, raising the suggestion that in addition to enabling survival in healthcare environments, AMR functions as a virulence factor, promoting invasive disease.
Given studies demonstrating a direct effect of mecA on toxin expression [33] and reduced toxicity enhancing bloodstream survival [35], we hypothesise that lowered expression of toxic virulence factors seen in MRSA may be one method by which S. aureus gains a short-term survival advantage and causes bloodstream infection.