Comparison of genomic diversity between single and pooled Staphylococcus aureus colonies isolated from human colonization cultures

Graphical abstract


InTRoDuCTIon
Large-scale whole-genome sequencing (WGS) of bacterial pathogen species offers hope for more accurate infectious disease surveillance and a better understanding of within-host evolution during human disease and asymptomatic colonization [2][3][4].Typically, bacterial genomics-based surveillance studies sequence DNA isolated from single bacterial colonies cultured on selective media from each clinical sample tested [2,[5][6][7].The single-colony bacterial culture can be further tested in the laboratory for phenotypes such as antibiotic resistance and toxicity.
However, individual colonies do not provide insight into the genetic diversity of the population of the species in the sample as there could be multiple strains present in the sample [8][9][10].Even if only one strain is present, there will be accumulated microdiversity between individual isolates that is roughly proportional to the duration between initial colonization and time of sampling, assuming the absence of bottlenecks [6,11].
A few studies have undertaken sequencing of multiple single colonies from clinical samples [10,12,13].This strategy can allow the comparison of phenotypes associated with intra-sample genetic variation and the construction of phylogenetic trees to trace the relationship between samples.However, costs rise linearly with each additional colony sequenced per sample, necessitating a cost-benefit analysis of how many colonies to sequence.Sequencing colonies from isolation plates of a single sample is an alternative approach [13][14][15][16].This is sometimes called 'sweep sequencing' , 'population sequencing' or 'pool-seq' , and the latter term will be used here.In pool-seq, multiple colonies from the same species can be sampled genetically at the same economic cost as for sequencing an individual isolate.Pool-seq has generally been found to be reliable in measuring sequence variation and allele frequency [12,17].However, the disadvantages of this method are the perceived complexity of the bioinformatic analysis and the complications of assessing the phenotypic characteristics of the population of bacterial clones, such as antibiotic resistance, when these assays typically require clonally purified single colonies.
Single-isolate sequencing is a convenient sampling strategy based on the assumption that strain mixtures are rare and capturing within-strain microdiversity is not worth the additional expense of sequencing.While the cost of raw sequence production has steadily declined, the costs of labour and infrastructure, such as DNA extraction, library preparation, physical sample storage and bioinformatic analysis, have not scaled down at the same rate [18].There has also been little analysis of what the increased sequencing and storage costs from sampling multiple colonies or pools yield over single colonies.However, pool-seq can still provide insights into the natural history of even well-studied pathogens, and inform us about the fate of adaptations that enhance virulence and antibiotic resistance [2,[19][20][21][22].Therefore, optimizing sampling strategies and genomic workflow design is essential to minimize the number of samples processed while maximizing the information obtained from each clinical sample.
In this work, we use samples from an ongoing study of Staphylococcus aureus colonization on humans to compare the three strategies outlined above: single-isolate sequencing, sequencing collections of multiple single colonies and pool-seq.S. aureus is a ubiquitous nosocomial pathogen prevalent worldwide, causing invasive disease syndromes such as bacteraemia, endocarditis

Impact Statement
While pooled population sequencing has been employed to study within-host diversity, the differences in attainable information between single and pooled sequences are not clear.A direct comparison between single-isolate and pooled population sequences can help devise optimal sampling strategies for clinical and within-host diversity studies.In this study, we attempt to answer the question of how many colonies obtained from a single patient are sufficient to obtain a representation of the total population diversity within the patient while keeping in mind time and labour costs.These findings have implications for using whole-genome sequencing (WGS) in the clinical microbiology laboratory to identify and speciate pathogens and to determine their antimicrobial susceptibilities.
and osteomyelitis [23,24].As with other prominent pathogens, WGS has significantly improved S. aureus epidemiological studies, and our ability to track the spread of antibiotic resistance and virulence across populations [3,[25][26][27][28][29].Here, we used samples from human participants who had an index methicillin-resistant S. aureus (MRSA) skin and soft tissue infection (SSTI) as part of an ongoing study, SEMAPHORE (Study of the Evolution of MRSA, Antibiotics and Persistence Having the Outcome of Recurrence).The study was designed to examine the clinical and demographic characteristics of the participants, and the genomes of the colonizing S. aureus to identify factors associated with recurrent skin infections.However, for this paper, we focused on the relationship between the pool-seq and collections of single-isolate genome sequencing.We first quantified the amount of variation within the collections of single genome and pool-seq and then investigated three specific questions.(1) Can the pool-seq data identify clonal S. aureus populations (comprising a single sequence type or ST) from mixtures of diverse lineages?(2) Can pool-seq data be used to estimate the number of sites within mono-ST populations undergoing polymorphisms?(3) Was pool-seq more sensitive in detecting antimicrobial resistance (AMR) genes than sequencing single clones?

Strain sampling
Participants were enrolled into the SEMAPHORE study after presenting with a culture-positive MRSA SSTI.Up to four time points (every 3 months for 1 year) and up to three body sites (anterior nares, oropharynx and/or inguinal skin) were sampled for each participant.Each swab was streaked out onto BBL CHROMAgar S. aureus (SACA).Then eight individual colonies (singles) were subcultured onto blood agar and sequenced.The remaining colonies, ranging between 3-200, were pooled, sub-cultured onto blood agar and then sequenced.In cases where there was no growth from directly plating the swabs, each swab was enriched for growth in tryptic soy broth (TSB) overnight.These overnight cultures were then plated and colonies were picked as mentioned above.One hundred and eighty-one pools and 1448 singles were directly plated on SACA (direct cultures).The remaining 73 pools and 584 singles did not show growth upon direct plating and therefore were enriched in TSB and then plated (enrichment cultures).
In this study, we only analysed swabs from which eight singles and a pool were obtained.Swabs from which fewer than eight singles were obtained were not considered.In total, we obtained 254 pools from 85 participants and 8 singles corresponding to each pool, giving us a total of 2286 genome sequences.

Library preparation and sequencing
Genomic DNA extractions were performed using Qiagen kits.Library preparation and WGS were performed by the Children's Hospital of Pennsylvania Microbiome Center using Illumina MiSeq or Hiseq platforms.

Pairwise SnP distance calculation, dereplication and phylogeny
For each group of eight singles in a collection, we used Parsnp v1.5.3 to align the single-colony isolated genomes and used snp-dists v0.7.0 to calculate pairwise SNP distances [41].To dereplicate singles, isolates with SNP distance <10 were collapsed into clusters and a random isolate was chosen as the cluster representative using Assembly-Dereplicator [42].The final set comprised 296 singles, where each collection is represented at least once.This resulting set of singles was aligned again using Parsnp v1.5.3 and recombinant regions were detected and masked using ClonalFrameML v1.12 and maskrc-svg v0.5, respectively.A recombination-masked core genome phylogeny was then constructed using IQ-TREE v1.6.12 with the GTR+FO model and 1000 ultrafast bootstraps [43,44].Phylogeny was visualized using the R package ggtree [45].

number of variants, segregating sites and allele frequency calculation
We calculated allele frequencies from bam files generated by Bactopia using bcftools mpileup [46].The reference for each pool was auto-chosen by Bactopia based on the closest complete S. aureus genome in terms of Mash distance [38].The median Mash distance between all the pools and their corresponding reference sequence was 0.0122.A Mash distance of 0.01 approximates to 200 SNPs [47].All singles were then aligned to the same reference as their corresponding pool.We used gubbins v2.4.1 to detect within-collection recombinant regions in the singles and found that the average number of recombination events was <1.Therefore, we proceeded without masking the recombinant regions [48].Only variants with a QUAL score >50 and with at least a read depth of 25 were considered for the analysis.For each collection, we calculated the allele frequencies for every position across the genome where at least one read piled up with a base call differing from the reference allele.Variants with frequencies <0.05 were considered '0' (absence) and those with frequencies >0.95 were considered '1' (fixed).The allele frequencies for the expected and downsampled pools were calculated based on the number of singles the variant was fixed in.For example, if a variant was present in one out of the eight singles at a frequency >0.95 (fixed), its allele frequency in the expected pool would be ⅛ or 0.125.Two or four random singles out of the eight were selected to measure the allele frequencies in the downsampled pools.Variants with intermediate frequencies in the singles (>0.05 and <0.95) were not considered.
To calculate the number of segregating sites across the true and expected pools, we wanted to exclude variants that occurred simply because of alignment against a specific reference.If a given variant was fixed in the expected pool (present in eight out of eight singles at an AF >0.95) and in the true pool (AF >0.95), we considered these variants to be ancestral and did not count them as segregating sites.All remaining sites with AF >0.05 were counted.
We calculated nucleotide diversity (π) with InStrain [49] using the auto-chosen reference and the alignment bam file from Bactopia.Expected pools (eight colonies) and downsampled pools (two and four colonies) were generated by combining equal proportions of reads from all eight, two or four colonies.For each collection, we used reformat.sh from the bbtools suite [31] to sample 12.5 % of reads from all eight colonies for the expected pool, 50 % of reads from two randomly selected colonies for the two-colony downsampled pool and 25 % of reads from four randomly selected colonies for the four-colony downsampled pool.All artificial pools (expected and downsampled) contained 1 million reads.To measure π for known strain mixtures, we first artificially introduced SNPs at the rate of 0.01 (1 SNP every 100 bases) to S. aureus JE2 reference genome (accession GCF_000013465.1) using Mutation-Simulator v2.0.3 [50].Then, we used InSilicoSeq v1.5.4 [51] (command: iss generate --seed 1000 --model MiSeq) to mix the wild-type (WT) JE2 and the artificially mutated JE2 in the ratios 99 : 01, 90 : 10, 80 : 20 and 60 : 40 (WT : mutant), respectively.We then ran InStrain for all the artificial mixtures to calculate π.

Logistic regression
Logistic regression was performed in R using the glm function [52].Seventy per cent of 254 pool-seq samples were used as the training set and the remaining 30 % were used as the test set.A pool was considered multi-ST if the MLST alleles in the pool and the corresponding eight singles were not identical.A table containing all raw values for the variables used in the logistic regression can be found in Supplementary Data Sheet 1. Continuous probabilities from the logistic regression model were converted to binary using a cutoff of 0.89 (if probability >0.89, the prediction was considered to be multi-ST).This cutoff was estimated using the optimalCutoff function from the R package InformationValue [53].McFadden R 2 was calculated using the pR2 function from the R package pscl [54].Variance inflation factor was calculated using the vif function from the R package car [55].

Statistical analyses and data visualization
All statistics and PCA were performed in R using packages stats and rstatix [52,56].All plots were visualized using R package ggplot2 [57].Other graphics were created using bioicons and draw.io [58,59].

RESuLTS
Samples from the SEMAPHORE study were plated on CHROMAgar S. aureus and a 'collection' of eight individual S. aureus colonies ('singles') was obtained (Fig. 1b).The remaining S. aureus colonies (3-200) on the plate were pooled and sequenced, hereafter referred to as 'pools' or 'pool-seq' .The collective sequencing data obtained from all eight singles for each pool were referred to as 'expected pools' .Similarly, sequencing data sampled from two random singles and four random singles, respectively, were combined to generate 'downsampled pools' .The SEMAPHORE data used in this work was from 85 participants with 254 samples (254 pools and 254 collections of 8 singles -2032 singles in total) (Fig. 1a).All FASTQ files for the singles were capped to 100× genome coverage.

most collections have only one multilocus sequence type (mLST)
Upon analysing the singles alone, we found a wide range of SNP distances between singles from the same collection, with a minimum of 0 and a maximum of 15 315.For 241 out of 254 collections (~95 %), the maximum SNP distance between any 2 pairs of isolates was <100 (Fig. 2a), suggesting that most collections of 8 singles comprised only closely related isolates.However, 12 collections of singles (5 %) showed clear signs of mixed strains, with a maximum SNP distance of >4000.When we compared the MLSTs amongst the singles for each collection, all 12 of these collections had at least 1 isolate that was a different ST from the remaining ones.This showed that comparing STs and pairwise SNP distances between multiple singles within collections could identify potential mixed infections, as mono-ST collections had lower maximum pairwise SNP distances.Although 59 STs were identified in total, 51 % of singles (1051/2032) belonged to ST8 and ST5 (Fig. 2b, d).This was expected, as ST8 and ST5 are the most common hospital-associated and community-associated STs in the USA.In addition, we also found 36 unique MLSTs spread across 271 isolates that we grouped into the category 'unknown' .Eight of these 36 STs have at least 1 undetected MLST allele and therefore cannot be assigned an ST.Another eight STs have multiple alleles detected for the same gene and therefore also cannot be assigned an ST.The remaining 20 STs, while all alleles have been detected, do not have an assigned ST in pubmlst.org.
When we included both pools and singles in the ST analysis, we found that for 209/254 collections (~82 %) the STs for the 8 singles and the pool were identical, suggesting that they were mono-ST collections.In the remaining 45 collections (~18 %) either at least 1 single or the pool had a different ST (Fig. 2b, c).This includes the 12 aforementioned collections, where we detected multi-ST collections based on singles alone.This means that we found more potential multi-ST collections when we included the pools in our analysis.For 15 of these 45 collections, the ST of the pools were untypable due to the presence of multiple alleles for the same gene, confirming that these were multi-ST collections based on observing the pool alone.This means that for 30 collections, we needed to refer to the STs from both the pool-seq and from the corresponding 8 singles to infer whether the sample was multi-ST.
We observed no significant differences in the occurrence of multi-ST pools across the different time points, body sites and culturing methods (chi-squared test, P>0.01).These data suggested that a given collection usually had a low level of S. aureus diversity and that we can find ST mixtures by comparing SNP distances and STs within collections of single colonies.

Pool-seq samples with elevated average minor allele frequency, elevated number of contigs, higher nucleotide diversity and untypable mLST were associated with strain mixtures
As mentioned previously, for 30 collections we were unable to confirm the presence of a multi-ST population based on the ST assignment in the pool-seq alone.Therefore, we examined what features of the pool-seq data could be used to assess whether there was 1 ST present in the pool, or a mixture of STs, by comparing the 254 pool-seq samples to their cognate collections of singles.We focused on four measures: assembly quality, nucleotide diversity, gene number and minor allele frequency (MAF).
Sequencing reads from both singles and pool-seq were processed identically using the Bactopia pipeline with the same quality control parameters [30].Both single and pool-seq reads had a final average quality score of 36.3 (Welch's t-test P>0.01).We expected the genome assemblies (generated using the SKESA assembler [32]) from single colonies to be higher quality than poolseq, as the latter may contain multiple S. aureus strains and possibly contaminating species from the culture plate.We evaluated assembly quality using CheckM and QUAST [33,34], observing that while most pools and singles had comparable coverage (Fig. 3a, Wilcoxon P>0.01, effect size=0.052), pools had a higher number of contigs (Fig. 3b, Wilcoxon P<0.01, effect size=0.20), higher heterogeneity (Fig. 3c, Wilcoxon P<0.01, effect size=0.347) and higher contamination scores (Fig. 3c, Wilcoxon P<0.01, effect size=0.239).Thirty-two out of 224 pools (14 %) had more than 200 contigs in contrast to only 5 out of 1792 singles (0.2 %).The CheckM heterogeneity score indicated the source of the contamination -a heterogeneity score <50 % indicates that the source of contamination is phylogenetically distant and vice versa [32].While all singles had contamination and heterogeneity scores of 0, the pools ranged from low heterogeneity contamination to high heterogeneity contamination (Fig. 3c).Seven pools (3 %) had a heterogeneity score >50 with a contamination score >10, suggesting that they were contaminated by phylogenetically similar sources.Fifteen pools (6 %) had a heterogeneity score <50 with a contamination score >10, suggesting that they are contaminated by phylogenetically distant sources.We ran Kraken [40] on our pools and found Staphylococcus haemolyticus to be the most common contaminant, occurring in 44 pools at a read fraction >1 %.Overall, these results suggested that genome assembly quality can be useful for assessing the population heterogeneity of pool-seq.
Allele frequency (AF), i.e. the fraction of reads piled up over a particular variant position, is a useful metric for genomic heterogeneity.We mapped the pooled sequences against the closest complete S. aureus reference genome (see Methods) to obtain the variant sites.The eight singles were also aligned to the same reference as the corresponding pool.For each sample in a collection, we calculated the average minor allele frequency, i.e. the sum of all minor allele frequencies divided by the total number of variant sites (MAF).If all reads mapped to only the reference or only the alternate alleles, the average MAF would always be zero, as would be expected from ideal single pure cultures.We plotted the average MAF against the total number of variant sites for 254 pools (Fig. 4a).We split the plot into four quadrants based on 2 parameters -a number of variants cutoff (2800 sites or 0.1 % of the S. aureus genome), suggesting only few variant sites, and average MAF cutoff of 0.05, below which we deemed the sample as having no minor alleles.
Two hundred and twenty-three pools (~88 %) had a total number of variant sites <0.1 % of the S. aureus genome (Fig. 4a left quadrant).Fifty-five of these 223 had an average MAF <0.05 (Fig. 4a bottom left quadrant), suggesting highly homogeneous samples.In contrast, there were 14 pool-seqs with more than 2800 variants (~0.1 %) of MAF >0.05 (Fig. 4a top right quadrant).Pool-seqs in the bottom right quadrant (average MAF <0.05 with >2800 variants) have a large number of variants because they are distant from the reference sequence used for variant calling; however, their low average MAF (< 0.05) indicates that they are still pure samples (see Methods for how reference sequences were chosen).To assign a diversity score based on MAFs, we used the product of the total number of variants and the average MAF, which we termed the 'MAF index' .We wanted to use the MAF index to separate intraclonal variation (Fig. 4a top left quadrant) and multi-strain variation (Fig. 4a top right quadrant).The MAF index was higher for samples with both a large number of variants compared to reference as well as a high average MAF (Fig. 4a top right quadrant).The MAF index of mono-ST pools was not significantly greater than the MAF index of singles (Welch's t-test, P>0.01).However, within the pools, the multi-ST pools had a significantly greater MAF index than the mono-ST pools (Welch's t-test, P<0.01).
AFs were measured based on whether or not a given position in a read mapped to the reference, but our calculations did not take into account the possibility of multiple alternate alleles (which we assumed to be very rare given that the number of variants was only a small percentage of the chromosome).Therefore, we also calculated intra-sample nucleotide diversity between the true pooled sequences and our expected pools as an analogous method for measuring genomic heterogeneity.Using the InStrain software [49], we estimated nucleotide diversity (π), which is for each position, 1 minus the sum of the frequency of each base squared: 1 − [(frequency of A) 2 + (frequency of C) 2 + (frequency of G) 2 + (frequency of T) 2 ].This value was then averaged across the whole genome.We used InStrain to measure the average π across our singles, downsampled pools (four and two colony pools), expected pools and true pools.The π value was significantly greater in true pools compared to singles (Welch's test, P<0.01).However, in 216 pools (96 %), the diversity observed in the true pools was less than the expected diversity observed from a simulated mixture of two S. aureus isolates 30 000 SNPs apart in a 99 : 1 ratio (Fig. 4b, left black horizontal line; see Methods).This analysis provided evidence that most pools comprised only single strains.Moreover, similar to average MAF, the π of multi-ST pools were significantly greater than the π of mono-ST pools (Welch's test, P<0.01).In cases where there was an increased diversity value for our expected pools or downsampled pools compared to our true pools, we may have overestimated the diversity of our true pool by assuming that the single colonies were present in equal abundance.Alternatively, since we only pooled the remaining colonies on the plate after picking singles, we may have reduced some diversity from the pool by removing certain single colonies.In the 12 cases where the true pools had a diversity value greater than the corresponding expected pools (5%), the 8 colonies sampled did not capture the entire diversity of the pool.We have shown that the number of contigs, contamination, minor allele frequencies, and nucleotide diversity are significantly different between pools and singles (Figs 3 and 4).Next, we wanted to measure the magnitude of these parameters' contribution to the variability between pools and singles.
We performed principal component analysis (PCA) for five different parameters we measured (CheckM contamination, CheckM heterogeneity, MAF Index, nucleotide diversity and the number of contigs).PC1 and PC2 explained ~78 % of the total variance (Fig. S1a, available in the online version of this article).All five parameters had positive loadings in PC1 (>0.4) and the CheckM contamination score had positive loadings in PC2 (>0.6) (Table S1).This result suggested that the deviation of some pools from the singles was mainly due to contamination (PC1) and allelic variation (PC2).
We also performed an all-vs-all Pearson's correlation across the five different parameters mentioned above (Fig. S1b).We found that CheckM contamination, number of contigs and CheckM heterogeneity were positively correlated with each other, suggesting that contamination reduced the assembly quality (larger number of contigs).However, these three parameters did not have a high positive correlation with the MAF index or with nucleotide diversity.On the other hand, MAF index and nucleotide diversity were positively correlated with each other (Pearson r=0.77,Fig. S1b).This showed that contamination and allelic diversity can independently drive heterogeneity in the pool, and that pooling multiple colonies may impact on sequencing and assembly quality regardless of intra-species diversity.
From our analysis thus far, we showed that there are pools that provide information similar to pure singles, and there are true strain mixtures.As we mentioned earlier (Fig. 2), detecting multiple MLSTs in the pool or measuring pairwise SNP distances between singles from within a collection was a reliable way to ascertain true mixtures.However, when the MLST calls are unreliable (unassigned types/undetectable alleles) or hypothetically if we did not have single colonies, alternative methods would be required.Therefore, we wanted to test whether the above-mentioned parameters (number of contigs, CheckM contamination, CheckM heterogeneity, MAF index, nucleotide diversity) could serve as predictors for mixed pools and homogeneous pools.
We performed a logistic regression using the number of contigs, CheckM contamination and heterogeneity, the MAF index and nucleotide diversity as predictor variables to calculate the probability that a given pool is mixed (see Methods).Here, we defined a mixed pool or multi-ST pool as a pool with multiple ST calls, or a pool corresponding to a collection of singles where the maximum pairwise SNP distance is >2800 (0.1 % of the S. aureus genome).Our logistic regression model showed strong predictive ability with a McFadden R 2 of 0.59, a sensitivity of 1, a specificity of 0.94 and a receiver operating characteristic (ROC) curve with an area of 0.86 (Fig. S1c).The maximum variance inflation factor (VIF) across our predictor variables was <2.3, indicating low multicollinearity.Overall, these results showed that by using information from multiple statistics, the pool-seq data alone were sufficient to predict the presence of multi-strain populations with high accuracy.

numbers of variants in pool-seq and eight singles from the same sample are correlated but pool-seq had greater numbers
One of the advantages of pool-seq over groups of singles is the potential to discover mutant subpopulations that may be missing in samples of individual clones.We measured the number of variant positions that were shared between the pool and at least one of the eight singles.For each collection, we calculated the number of variant positions seen both in the pool and in at least one of the corresponding eight singles as a fraction of the total number of variant positions observed.For analysing variants in the singles, and the expected and downsampled pools that were built from singles, we only considered sites with an AF >0.95.We found that 152 collections out of 254 (~60 %) had a shared variant fraction >0.5, meaning that more than half the variants found in each pool and the corresponding singles were identical for 60 % of our samples (Fig. S2a).Curiously, we observed 30 collections (~12 %) with a shared fraction <0.05.This would be expected if these singles and pool-seq were not from the same sample (Fig. S2b).These 30 collections were not specific to a single ST, time point, or body site.Enrichment cultures (see Methods) were more abundant within these 30 collections compared to the rest of the samples (chi-squared, P<0.01).These collections may have been mis-sampled, and we opted not to use them for further comparisons of pools and singles within the same sample.This brought down our total number of collections from 254 to 224.Out of the 224 pool-seq samples, 204 were cases where the pool-seq and all 8 singles had the same sequence type.
We found that the number of variants found in the pools was greater than the combined number of variants from the 8 singles in 178 out of 224 samples (~79 %).This was as expected as the pools should more often contain more individual isolates than the collections.
To illustrate this point further we compared the number of variants detected in the pools against eight singles combined (expected pool), four random singles and two random singles combined (downsampled pools) and one random single.This was done to answer the question -how many variants would we have seen if we had sampled only eight colonies/only four/only two/only one?We considered a variant present in the expected or downsampled pools if it was present in at least one of the sampled singles at an AF >0.95.
We found that 198 pools (~88 %) captured more than 75 % of all the variants in the corresponding collection (Fig. 5a).This was significantly greater than the number of expected pools (129 pools or 56 %) that captured a fraction of variants>0.75 (Kolmogorov-Smirnov P<0.01).If we had only sampled one single colony for each collection, only 39 % of the singles would have captured a fraction of variants >0.75 (Fig. 5a -'one colony').
Although the number of variants observed in the pools was usually greater than in the singles, we found that the more singles a variant was present in, the more likely we were to detect the same variant in the pool.We counted the number of singles each variant was present in, plotted it against the AF of the same variant in the pool and found a strong positive correlation (Pearson r=0.83) (Fig. 5b).

numbers of segregating sites in pools and singles from the same sample are positively correlated
As a bacterial population expands from an introduction event, mutations accumulate as a function of time [60,61].Subpopulations can segregate from the parent population by accumulating nucleotide variants at different sites across the genome and the total diversity across different subpopulations can be altered by bottlenecks and selective sweeps [9].The number of segregating sites (or within-population polymorphic sites) can therefore be an important indicator of the demographic history of the population, and it would be useful to know how well the pool-seq data could be used to estimate this value.Because we compared pools and singles to a common reference, a certain number of variants were likely fixed in the ancestor of the population.We expected these to have an AF of 1 or close to 1 (0.95 or greater) in both the pools and singles and filtered them out.We also filtered out samples where the ST of the pools and collection did not match the ST of the auto-chosen reference, as this would lead to an elevated number of variants.This left us only with samples undergoing intraclonal variation.We found a moderate positive correlation between the number of segregation sites in the true pools and the expected pools of the 198 samples that had matched ST across Fig. 6.Allelic variation in pools and singles from the same sample were positively correlated.(a) The number of segregating sites in the true pools was proportional to the number of segregating sites in the expected pool.For mono-ST collections (collections where all eight singles, the pool and the auto-chosen reference were called the same ST), the number of sites with allelic variation was comparable between the true pools (y-axis) and the expected pool (x-axis) (eight singles combined).If the same site was fixed in all eight singles and in the pool, it was not included.Blue regression line depicts a linear relationship with a Pearson's r of 0.352.(b) AFs of variants in the expected pool did not reliably predict the AFs of the same variants in the true pool.Frequency distribution plot showing Pearson's r for all 198 mono-ST collections.x-axis depicts Pearson's r and y-axis depicts number of collections.
the pools, singles and the reference (Fig. 6a; Pearson r=0.352).The number of segregating sites in the collection ranged from 8 to 1658.While the number of segregating sites was comparable between the true and expected pools, we found that in general there were more segregating sites in the expected pools compared to true pools.The expected number of segregating sites may be overestimated because sequencing individual singles is more likely to detect rare variants than pool-seq.This is because the reads from each single are more likely to be unique, which increases the chances of detecting variants that are only present in that individual.Additionally, the expected pools are made under the assumption that the clonal genotypes are represented by only eight singles (see Methods), which is a simplification of the genetic complexity of the pool.
We also wanted to measure whether the proportion of the variants in the singles could reliably predict the proportion of the same variants in the pools.For each collection of sequences, we plotted the AF of the segregating sites observed in both the expected pool sequences and the true pooled sequences and calculated Pearson's coefficient (r).If the AF of a variant present in an expected pool was equal to the AF of the same variant present in the true pool, we inferred that the proportion of the variant in the two populations was comparable.In other words, the variant frequencies in the eight singles combined (for example, if the variant was present in seven out of eight singles, AF=0.875, if the variant was present in six out of eight singles, AF=0.75, and so on) was equivalent to the variant frequencies in the pool.In contrast, if the AF of variants between the expected and true pools was not comparable, the pool-seq was significantly different from the expected pools.The distribution of r values indicated only 82 collections (41 %) with r>0.5 (Fig. 6b).This result showed that the proportion of the variants in the singles was a positive predictor of the proportion of the same variant in the pool in slightly less than 50% of cases.
A median of one more AmR genes was detected in the pools compared to singles Finally, we wanted to know if the pools could harbour subpopulations with clinically relevant genes that may be missing in singles.We annotated AMR genes using AMRFinderPlus and counted the number of antimicrobial drug classes for which resistance determinants were found in our pools, individual singles and in the pangenome of our expected (all genes eight singles combined) and downsampled (all genes from two or four random singles combined) pools [36].In 177 collections (79 %), the number of AMR classes was identical in pools and the expected pool.This group of 177 collections (79 %) represented the bulk of the low-diversity samples in the study.However, overall, we observed a median of one additional AMR class in our true pools compared to the expected/downsampled pools and singles (Fig. 7, black vertical line).This showed that additional genes could be detected in the pool that are absent in the pangenome of the singles and that these genes can be of clinical relevance.Notably, all cases where we found mecA in the pools (134 out of 226 pools), we found mecA in at least 1 of the 8 corresponding singles.A summary file with all detected resistance determinants for all collections is reported in Dataset S2 (Supplementary Data Sheet 2) .
We initially used the total number of genes and the number of AMR classes in the pools as predictor variables in our logistic regression model in Fig. S1.We found that the number of genes was highly multicollinear with CheckM contamination (VIF >5) and the number of AMR classes had no predictive power (near identical AUC, accuracy, sensitivity and specificity with or without the AMR class parameter), and therefore did not include them in our final analysis.
Next, we wanted to measure the abundance of AMR genes present in both pools and the pangenome of the singles compared to AMR genes present in the pools alone.We hypothesized that in cases where an AMR gene was found in the pools but absent in the singles, the AMR gene was present at low abundances.To test this, we used Salmon [62] to estimate the abundance of AMR genes found in both pools and singles, and compared the abundances to when it was found in the pools alone (Fig. S3).We found that the mean copy number of genes belonging to a particular class of antibiotic was significantly lower when found only in the pools for eight out of nine AMR classes (Welch's t-test with Bonferroni correction).

DISCuSSIon
The primary purpose of this study was to compare sampling strategies for measuring intra-species diversity at different levels of resolution.We wanted to understand the cost-benefit trade-offs of sampling one or multiple single colonies per sample, and pooled populations.To do this, we compared pure single colonies and in silico single-colony mixtures (expected pools and downsampled pools), as well as true pools (Fig. 1).In 83 % of our samples there was only one S. aureus strain present (Fig. 2).However, the 17 % of samples with strain mixtures is a significantly high figure and shows that much S. aureus diversity would be missed by only sampling a single colony.Adding additional single colony genomes may be useful for certain studies focused on within-host variation and increases the chances of detecting strain mixtures, but there are diminishing returns for sampling new genetic diversity within homogeneous populations (Fig. 5a).
We derived insights into strategies for sampling the genomic diversity of S. aureus from asymptomatically colonized human skin and mucosal surfaces.Interestingly, there were no significant differences in the incidence of multi-ST populations across the three anatomic sites sampled [anterior nares, oropharynx (throat), inguinal skin], four different time points (at participant enrolment, or at 3 months, 6 months, 9 months and 12 months after enrolment), nor across different culturing methods (direct vs enrichment, see Methods).
The study had some limitations.First, the sampling methodology may in some cases have resulted in reduced diversity in the total pool by picking singles prior to pooling the remaining colonies.Our assumption in the design was that the population in a single colony may be present multiple times on a plate with hundreds of colonies, but this is less likely to hold true in scenarios where there were <10-15 colonies left after picking singles.This may lead to cases where the pooled population and the collection of singles are markedly different, as described in Fig. S2a.A 'minimum pool size' must be incorporated into the study design to mitigate this limitation.There are other limitations that are inherent with incorporating culture steps.Ideally, the number of colonies in a pool should be relatively constant across all samples, but in some cases very few colonies appeared on the selective agar.In other cases, the density of colonies was too great to measure colony counts accurately.Moreover, laboratory culture media could also cause biases in population growth.
Another limitation is that our sampling space is narrow -all our samples are from one geographical location, comprising only nares, throat and skin swab-acquired S. aureus.Therefore, the diversity we measured across singles and pools from a given sample may not apply to cultures from other clinical contexts of S. aureus or other S. aureus STs typically colonizing people worldwide.Despite these drawbacks, the data in this study allowed us to compare three strategies for sampling: individual single colonies, collections of up to eight colonies and pool-seq.
Sampling single colonies in pure culture is the traditional approach for assessing the genotypic and phenotypic characteristics of bacterial pathogens.Only one sequencing library is required and the bioinformatic analysis methods are straightforward.However, sampling only one colony will result in missing multi-strain infections (17 % of the time in this study) and therefore provide an overly simplistic view of the population structure of the pathogen.Strain-level characterization of these multi-strain populations would be an interesting and clinically relevant endeavour.However, it would require substantial additional testing with in silico and in vitro controls.The clustering thresholds and reference databases used can also affect the results, as we saw in our preliminary testing (Fig. S4), and as was also noted elsewhere [63].
The more single colonies sequenced, the better the estimation of true population diversity (assuming there are no systematic sampling biases in how colonies are picked from the culture plate).Having collections of single colonies also allows the construction of within-host phylogenies, observing gene gain/loss events and inferring demographic changes over longitudinal sampling.However, there is still no guarantee that the total diversity in the population is represented in the sample subset.Moreover, the cost of processing and physical and data storage scales linearly with the number of independent colonies sampled.Deciding the number of colonies to sample will be a complex calculus of budget and a priori estimation of the population diversity of the pathogen aligned with the goals of the study.
Ideally, pool-seq would provide the best estimation diversity in the population, and many more single colonies can be aggregated than sequenced individually.After pooling colonies, the stock can be treated as a single sample for storage, sequencing and analysis and therefore the cost is equal to that for one single colony, making pool-seq the best value for identifying variation.Here, we have shown that pool-seq can be used to accurately estimate the presence of multi-ST infections, can measure segregating sites within mono-ST populations (i.e intraclonal variation), can identify potential laboratory processing errors, and is most sensitive at finding AMR genes.The analysis of pool-seq data, which are effectively single-species metagenomes, is more complex than single-colony sequencing due to variation, especially in multi-ST samples, and the possibility of contamination.This heterogeneity also leads to unreliable phenotypic ascertainment.However, predicted phenotypes could be validated by replating the pool to pick singles.
Based on our analysis, we recommend that many studies may benefit from using a 'pool plus one' design, that is, sequencing one single colony plus pooling and sequencing all remaining colonies.Using the simple diversity measurements we show in this study, many of which are obtained from the default outputs of Bactopia, a streamlined beginner-friendly analysis pipeline, we can ascertain whether significant differences exist between the single colony and the pool (Figs 3 and 4).This can aid in deciding whether sequencing additional colonies from the pool is required.We note that we analysed the pools using conventional genome analysis methods typically used for single isolates.Despite this, we were able to identify metrics that reliably aid in differentiating between pools and singles.Pool plus one provides more information about the colonizing population than a single colony while demanding only one more unit of sequencing per sample.
The computation time to process one set of paired-end reads using Bactopia was 4 min on a 72 core server with 768 GB RAM.One sequencing run was approximately USD $70 per sample including laboratory supplies and storage.Assuming sequencing one pool incurs one unit cost (i.e.cost of time, labour and resources for sample preparation, storage, sequencing and analysis, which in our case was USD $70+4 min), every single colony added on top of the pool would incur an additional unit cost.For our dataset, this additional unit cost over the pool yielded a median of 19 new variants and 0 new AMR gene classes (Fig. 8).Moreover, we also showed that the pool alone is sufficient to predict the presence of a multi-ST population (Figs 3, 4 and S1).In this study, if we were only interested in identifying multi-ST had adopted a pool+one strategy instead of pool+eight, we would

Fig. 1 .
Fig. 1.Schematic representation of colony collection strategy, names and descriptions of isolate groups analysed in this study.(a) Diagram depicting the number of samples in this study.From 85 study participants, we collected a total of 254 culture samples.For each collection, we obtained eight single colonies, and pooled the remaining colonies on the plate.(b) Diagram describing the terminology for specific isolate groups used in this study.

Fig. 2 .
Fig. 2. Pairwise SNP distance between and within collections.(a) Box plots showing per-collection SNP distance distributions.For each collection shown on the y-axis, the x-axis shows the corresponding distribution of core genome SNP distances in log scale.Black vertical lines show the median SNP distances and boxes show the interquartile range.Whiskers represent values up to 1.5 times the first or third quartile.Black dots represent outliers beyond the whiskers range.(b) Bar plot showing number of genomes per ST.Multilocus sequence typing was performed by the software tool mlst (see Methods).The x-axis shows the number of isolates assigned to the corresponding ST shown on the y-axis.(c) Bar plot showing the number of STs detected per participant.Multilocus sequence typing was performed for all eight singles from a participant and the number of unique STs detected per participant was plotted.(d) Maximum-likelihood phylogeny of 296 isolates, representing at least 1 isolate from all collections.All non-identical genomes from each collection were aligned by snippy and a core genome phylogeny was constructed using IQ-TREE (see Methods).Tree tips are coloured by ST; only the top 10 most frequent STs are shown, and the remaining ones are grouped into 'other'.

Fig. 3 .
Fig. 3. Assembly quality can be used to assess population heterogeneity.(a) There was no significant difference in the assembly coverage between pools and singles.Violin plot showing distribution of assembly coverage between pools and singles.Assembly coverage for each pool and single was calculated by Bactopia against an auto-chosen reference (see Methods).Circles indicate mono-ST collections and triangles indicate multi-ST collections.Dark red points indicate pools and white points indicate singles.Multiple points stacked on top of each other may appear darker than isolated points.(b) Pool assemblies were more likely to have a higher number of contigs than single assemblies.Violin plot showing distribution of number of assembly contigs in pools and singles.Pooled samples were processed identically to singles with Bactopia using SPAdes.Circles indicate mono-ST collections and triangles indicate multi-ST collections.Dark red points indicate pools and white points indicate singles.Multiple points stacked on top of each other may appear darker than isolated points.(c) Pooled samples have varying sources of contamination while singles are pure.CheckM contamination and heterogeneity scores showed that all single colonies have no contamination, while 6 % of pools are contaminated by phylogenetically distant sources and 3 % of pools are contaminated by phylogenetically similar sources.The black vertical line marks a heterogeneity score of 50, below which the source of contamination is considered phylogenetically distant and vice versa.Circles indicate mono-ST collections and triangles indicate multi-ST collections.Dark red points indicate pools.Multiple points stacked on top of each other may appear darker than isolated points.

Fig. 4 .
Fig. 4. (a) The MAF index could be used to assess multi-ST pools.Dot plot depicting the number of variant positions and the average MAF for mono-ST (circles) and multi-ST (triangle) pools.The x-axis indicates the number of variant positions compared to the corresponding reference.The y-axis indicates the average minor allele frequency (MAF).The average MAF was calculated by summing the MAFs of all intermediate alleles and dividing by the total number of variant positions.Red dots correspond to mono-ST pools and triangles correspond to multi-ST pools.The black horizontal line indicates an average MAF of 0.1.The black vertical line indicates 0.1% of the S. aureus genome (2800 sites).The frequency of the dots at their corresponding x and y positions is indicated by the histogram above the x-and to the right of the y-axis respectively.(b) Average nucleotide diversity suggested most pools comprise single strains.94% of pools had nucleotide diversity less than a theoretical 99:01 mixture of two strains.LEFT: Dots and colours indicate average nucleotide diversity value for each pool, expected pool (reads from eight singles combined in equal proportions), downsampled pools (reads from four and two random singles combined in equal proportions) and all 2032 singles.Grey dashed lines connect corresponding samples.Black solid horizontal lines indicate the average nucleotide diversity value for in-silico mixtures of two S. aureus genomes 30,000 SNPs apart.The ratio of each mixture is indicated over each solid black line.The frequency of the dots at their corresponding x positions are indicated by the histogram to the right.

Fig. 5 .
Fig. 5. (a) Pools captured more variants than eight single colonies combined.Each bar indicates a collection, and the height of the bar indicates the fraction of variants found in the corresponding sample group (pools, expected pools, four colony pools, two colony pools, single colony) to the total number of variants found in all samples in the collection (pool plus all eight singles).For example, a bar with height 0.25 in the fifth row (one colony) shows that if one random single colony was examined from the specific collection corresponding to the bar, we would find 50% of the total number of variants found in the collection (pool plus all eight singles).Bars for each sample group are ordered by lowest to highest.A value of one indicates 100% of the variants found in both the pools and all eight singles combined are represented in the sample group.(b) Allele frequencies in the pool were proportional to the number of singles the variant was detected in.Boxplots showing allele frequencies of variants detected in zero singles up to eight singles.Allele frequency of each variant found in the pool increased as the variant was found in more colonies in the corresponding singles.Boxes show the interquartile range and whiskers represent values up to 1.5 times the first or third quartile.White dots represent outliers beyond the whiskers range.Black horizontal line in each boxplot indicates the mean.

Fig. 7 .
Fig. 7.A median of one additional AMR class can be observed in the pools compared to singles.Ridgeline plot showing number of AMR gene classes detected in pools, the pangenome of expected and downsampled pools (pangenome of eight, four and two singles combined), and a random single colony.The x-axis shows the number of AMR classes detected in the sample by AMRFinder and the y-axis shows the corresponding sample.Black vertical line shows the median number of AMR classes detected for each sample group.White circles under each ridgeline represent individual collections and the number of AMR classes detected.

Fig. 8 .
Fig. 8. Diminishing returns in the number of new variants or new AMR genes observed with the addition of more sequencing runs.Dot plot depicting the number of new variants (a) or new AMR genes (b) observed for additional sequencing runs.Red dots depict the first sequencing run being the pool, and the additional runs being single colonies (1, pool; 2, pool+one single; 3, pool+ two singles…).White dots depict only singles (1, one single; 2, two singles,…).