A phylogenomic framework for assessing the global emergence and evolution of clonal complex 398 methicillin-resistant Staphylococcus aureus

Distinct clones of methicillin-resistant Staphylococcus aureus (MRSA) have emerged as important causes of infection in individuals who have exposure to livestock (livestock-associated MRSA; LA-MRSA). Clonal complex 398 (CC398) is the most prevalent LA-MRSA clone, and has been reported from several geographical settings, including Europe, the Americas and Asia. To understand the factors contributing to the global dissemination of this clone, we analysed CC398 MRSA isolates from New Zealand (NZ), a geographically isolated country with an economy strongly dependent on livestock farming. We supplemented the NZ CC398 MRSA collection with global datasets of CC398 MRSA and CC398 methicillin-susceptible S. aureus. Here, we demonstrate multiple sporadic incursions of CC398 MRSA into NZ, as well as recent importation and spread of a swine-associated clade related to the European LA-MRSA lineage. Within a larger global phylogenomic framework, Bayesian modelling suggested that this NZ clade emerged in the late 2000s, with a probable origin in swine from Western Europe. Elucidating the factors responsible for the incursion and spread of LA-MRSA in geographically distant regions, such as NZ, provides important insights into global pathways of S. aureus transmission, and will inform strategies to control importation and spread.


Phylogenomic analysis
Whole-genome short-read sequence data from 145 S. aureus isolates was trimmed using Trimmomatic (Bolger et al., 2014) and mapped to the NZ15MR0322 chromosome using Snippy (Version 3.1; https://github.com/tseemann/snippy). This generated a consensus MRSA0322 chromosome containing variant sites and indels and the modified chromosomes (each with 2,839,203 bases) were concatenated into a single MultiFASTA file. Gubbins (Croucher et al., 2015) was used to identify recombination. Three major regions were identified: (1) a ~123kb region previously identified by Price et al., of about 120kb in length (Price et al., 2012); and (2) two additional regions encompassing insertion sites of the two phages ( Figure 1). These regions were removed from the alignment, resulting in an alignment of 2,138,149 coding bases, with 6,654 variable sites. A ML tree was produced using IQTREE (www.iqtree.org) with the following command: iqtree -m TEST -mset JC,HKY,GTR -pre model_test -s cc398_parts123_recomb_filtered_core.fasta -nt 16 This phylogeny was used to assess the correlation between branch length and year of isolation with TempEst (version 1.5.4) (Rambaut, 2016) 1949 -1952), using the two pass mode. This indicated to us that there was sufficient heterochronicity in the data to undertake subsequent analysis using BEAST2.
The above alignment was then used as input to BEASTIFY, a Python script that produces NEXUS files suitable for input in BEAUTI 2.4.3 to produce XML files for BEAST 2.4.3. BEASTIFY takes as input the MultiFASTA alignment, and a Genbank file containing the annotation for the chromosome. It then partitions each site in the alignment into first, second, and third codon positions, inter-genetic regions, and sites with overlapping CDS annotations (i.e., sites belonging to more than one codon position). BEASTIFY allows the user to control which partitions are outputted, and whether to output all bases or a random sample of bases from each partition. In our analysis, we only outputted first, second, and third codon positions.
Four models were trialled in BEAST 2.4.3: (1) Strict molecular clock (SMC) with a constant population coalescent prior on the genealogy; (2) SMC with an exponential population growth coalescent prior on the genealogy; (3) Uncorrelated Log-Normal Clock (ULNC) with a constant coalescent prior on the genealogy; and (4) ULNC with an exponential population growth coalescent prior on the genealogy.
We set a log-normal prior on the clock rate with mean of -10 and standard deviation of 2.0, which established that 95% of the prior density on the clock rate was between 9.01e -7 and 2.29e -3 . We assumed a GTR+G for each of the partitions, with the G distribution approximated by four categories. The a parameter of the G distribution was estimated from the data, as well as the frequency of each of the six possible substitutions. All other priors were kept as default.
Each model was run for 50 million iterations, and the traces were examined for convergence. The first 25 million steps were discarded as burn-in for all analyses. RWTY (https://github.com/danlwarren/ RWTY) was used to examine the posterior distribution of trees, and posterior distribution of split frequency. AICm was used to evaluate the model choice, as described by Baele et al.(Baele et al., 2013). The AICm suggested that the ULNC with an exponential population growth coalescent prior on the genealogy fitted the data better than any other model.
The final results from this model all had effective sample sizes (ESS) of over 200 and had convergence across all runs. . The tree files for each run were combined and sub-sampled using logCombiner in order to produce 25,000 posterior genealogies, and to produce the final most recent common ancestor (MRCA) estimates. The maximum clade credibility tree (with nodes at median heights) was produced using TreeAnnotator.