Benchmarking bacterial genome-wide association study (GWAS) methods using simulated genomes and phenotypes

Genome Wide Association Studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true “hits” (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically evaluated. Here we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination, and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely-used univariate GWAS approaches (cluster-based, dimensionality-reduction, and linear mixed models, implemented in PLINK, pySEER, and GEMMA) and one relatively new whole-genome elastic net model implemented in pySEER, across a range of simulated sample sizes, recombination rates, and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the elastic net whole-genome model was consistently amongst the highest-performing methods and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached good performance (Recall > 0.75) to identify causal mutations of strong effect size (log Odds Ratio >= 2) with a sample size of 2000 genomes. However, only elastic nets reached reasonable performance (Recall = 0.35) for detecting markers with weaker effects (log OR ∼1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to univariate models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for whole-genome models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods. Author summary Microbial populations contain measurable phenotypic differences with important clinical and environmental consequences, such as antibiotic resistance, virulence, host preference and transmissibility. A major challenge is to discover the genes and mutations in bacterial genomes that control these phenotypes. Bacterial Genome-Wide Association Studies (GWASs) are family of methods to statistically associate phenotypes with genotypes, such as point mutations and other variants across the genome. However, compared to sexual organisms such as humans, bacteria reproduce clonally meaning that causal mutations tend to be strongly linked to other mutations on the same chromosome. This genome-wide linkage makes it challenging to statistically separate causal mutations from non-causal false-positive associations. Several GWAS methods are currently available, but it is not clear which is the most powerful and accurate for bacteria. To systematically evaluate these methods, we developed BacGWASim, a computational pipeline to simulate the evolution of bacterial genomes and phenotypes. Using simulated genomes, we found that GWAS methods varied widely in their performance. In general, causal mutations of strong effect (e.g. those under strong selection for antibiotic resistance) could be easily identified with relatively small samples sizes of around 1000 genomes, but more complex phenotypes controlled by mutations of weaker effect required 3000 genomes or more. We found that a recently-developed GWAS method called elastic net was particularly good at identifying causal mutations in highly clonal populations, with strong linkage between mutations – but there is still room for improvement. The BacGWASim computer code is publicly available to enable further comparisons and benchmarking of new methods.

available, but it is not clear which is the most powerful and accurate for bacteria. To systematically 50 evaluate these methods, we developed BacGWASim, a computational pipeline to simulate the 51 evolution of bacterial genomes and phenotypes. Using simulated genomes, we found that GWAS 52 methods varied widely in their performance. In general, causal mutations of strong effect (e.g. those 53 under strong selection for antibiotic resistance) could be easily identified with relatively small 54 samples sizes of around 1000 genomes, but more complex phenotypes controlled by mutations of 55 weaker effect required 3000 genomes or more. We found that a recently-developed GWAS method 56 called elastic net was particularly good at identifying causal mutations in highly clonal populations, 57 with strong linkage between mutations -but there is still room for improvement. applied to Single Nucleotide Polymorphisms (SNPs) and K-mers (i.e. DNA words of length K) in 78 microbial genomes has identified mutations and genes associated with antibiotic resistance (4-10), 79 cancer (11), virulence (2,12,13) and host preference (14). In contrast to human GWAS, however, 80 bacterial association mapping is technically challenging due to the unique characteristics of bacterial 81 populations, and optimal approaches for bacterial GWASs have yet to be established.

82
The objective of a GWAS method is to maximize statistical precision and power, in order to identify 83 true causal genomic elements while ignoring spurious associations. To do so, GWAS methods must

176
We next investigated how GWAS power varies across the four methods to correct for population 177 structure: univariate models including cluster-based approaches, dimensionality reduction, and linear 178 mixed models, and the whole-genome model using elastic nets. In univariate models, the correlation 179 between each of the markers (SNPs or K-mers) and the desired phenotype is investigated separately. 180 Therefore, the covariance between the markers due to population structure needs to be included     (Figure 5a). In FaST-LMM, the phylogeny-based correction for population structure 255 outperforms the genotype matrix, mainly due to a boost in precision (Figure 5a). However, in this case 256 the 'true' phylogeny (known from the simulation) was used, but must be estimated in real 257 applications. Therefore, the accuracy of a phylogenetic correction might be lower depending on the 258 choice of methods to construct the phylogenetic tree. FaST-LMM has slightly higher recall (power) 259 than GEMMA, and its advantage was most pronounced for variants with low effect sizes (Figure 5b).

299
More encouragingly, causal variants can be detected with relatively high power and precision in 300 higher-recombining populations, akin to S. pneumoniae. These results highlight the importance of 301 assessing the LD landscape of the target organisms before deciding on a sample size and GWAS 13 design. It also suggests significant room for improvement in GWAS method development, 303 particularly for highly clonal bacteria. Promising methods include phyC (9) and treeWAS (15) which 304 detect homoplastic (convergent) mutations along a clonal phylogeny, providing strong control over 305 population structure given accurate phylogenetic tree. Due to the high computational burden of these 306 tests, especially with large sample sizes (35), we did not evaluate these methods here. Moreover, 307 although homoplasies do occur in our simulations (Figure 1b), their rate is not explicitly controlled 308 and thus their impact is hard to assess. In the future, BacGWASim could be extended to explicitly 309 model homoplastic mutations and to assess the performance of homoplasy-based methods.

310
Of the GWAS methods evaluated here, the elastic net whole-genome model generally performed best, 311 followed by the mixed model approach implemented in GEMMA. The clustering approach 312 implemented in PLINK also performed well, but varied significantly depending on the clustering 313 threshold, which can be challenging to optimize. Elastic nets implemented in pyseer also provide the 314 possibility to perform a GWAS on k-mers or unitigs, whereas this is not as easily done in GEMMA or 315 PLINK. Although elastic nets had the highest precision of the methods tested, there is significant 316 room for improvement, as mentioned above.

317
Our results also help explain the success of early bacterial GWASs with low sample sizes. We found 318 that, in a high-recombining population, a sample size ~1,000 is sufficient to reliably detect causal 319 variants of strong effect (log OR >= 2) with high power (>0.80). Such strong effect sizes may be 320 common for antibiotic resistance mutations, and other variants under strong positive selection.

321
However, samples sizes >3,000 will likely be needed to detect variants of lower effect (log OR ~1), 322 which may be more common for more 'complex' phenotypes with lower heritability. Of course, the 323 sample size required to achieve a desired power will vary depending on the recombination rate and 324 population structure of the organism of interest. BacGWASim thus provides a tool for study-specific 325 power calculations.

326
In an age of a rapidly growing array of options for performing GWASs, we hope that our results are 327 instructive in quantifying general trends and sources of bias, and that our simulation platform can 328 continue to be used to benchmark novel methods as they appear. For example, machine learning has 14 recently been used in bacterial GWAS, with successful application in highly clonal species such as 330 Mycobacterium tuberculosis (25-27,38). These methods have the advantages of being highly versatile 331 with high computational efficiency. Although not all artificial intelligence methods are currently 332 packaged for the purpose of bacterial GWAS, our findings suggest great potential for whole-genome 333 models to further fine-tune bacterial GWAS. Our simulation platform also provides a means to 334 evaluate and benchmark their performance against the traditional methods as they emerge.

394
To measure the range of LD in real bacterial species, genome data were retrieved from the CARD 395 database (48), SNPs were called using Snippy (49) and linkage levels were measured in 1,000 396 markers using Haploview (50) (Figure 2).

398
In every set of simulations, 100,000 randomly selected markers with minor allele frequency >0.01 399 were retained for GWAS analysis. An equal number of markers was selected in each simulation to 400 make them comparable without any confounding effect of multiple testing across replicates. For each 401 genome simulation, ten sets of randomly chosen markers were then used to generate ten replicate 402 phenotype simulations. Using the called variants and phenotype labels as benchmark datasets, we then 403 used the following methods to perform GWAS: