Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions

Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.

overrepresentation of samples of European ancestry in genotype databases. This has led to polygenic risk scores, which were originally thought to be highly accurate predictors of disease liability, to have significantly lower accuracy in non-European ancestry samples, which make up most of the global population (24). Additionally, these methods are unable to deal with missing input data. Variant calling, either by read pileup to generate single nucleotide polymorphisms (SNPs) or from a graph, when conducted separately for each population, is likely to produce disparate input sets, and with very different minor allele frequencies. Without using a method which produces consistent variant calls in test data sets, the accuracy of predictive models is likely to be heavily overstated. As well as methodological approaches, more representative sampling of the pathogen population which does not oversample clonal lineages may be advisable.
Here, we set out to develop a method which combines the desirable attributes of both of these classes of approaches when analyzing the genetics underlying bacterial traits. We wished to retain the simplicity and interpretability of traditional GWAS approaches and combine this with the flexibility and accuracy of machine learning methods which can be fitted to the entire data set at once. These models were previously applied to human GWAS data sets for inference (25)(26)(27) and prediction (28).
This pangenome-wide approach reflects the polygenic nature of complex traits better than older fixed effect methods which must select only some population structure covariates to include (29). Unlike marginal tests (the standard single predictor test in GWAS), a genome-wide regression approach gives rise to an increase of resolution when sample size increases, as was previously noted in human GWAS (30)(31)(32). Additionally, simultaneously analyzing predictors together in a regression model means that interactions and correlation between the predictors (e.g., population structure) may be included implicitly (33).
Using large genomic data sets from four different species and sixteen varied phenotypes, we find that an elastic net model (33) selects similar variants to a GWAS and does not sacrifice its major advantage of quantitative model interpretability. Using simulated data, we demonstrate improved power, but an increase in false-positive rate, compared to that of linear mixed models. We illustrate this use in practice on antibiotic resistance phenotypes in two species and show further results which find similar accuracy between new machine learning and simpler approaches, consistent with previous studies (4,5,34).
Our approach models the entire pangenome of the population to include the large proportion of variation which resides in the accessory genome. It explicitly addresses issues of population structure and consistent performance between trained and new (test) data sets. The method is broadly usable, not requiring programming knowledge or manual adjustments for new data sets, and allows for the sharing of models between researchers. We have implemented the elastic net model in the pyseer microbial package as a new "prediction" module and consistent pangenome variant calling in two further packages. An extensive tutorial for all of these methods is available online (https://pyseer.readthedocs.io/en/master/predict.html).

RESULTS
Method overview. The elastic net uses a linear prediction model as with a standard linear regression run between a phenotype and all genetic variants and tries to find slopes (effect sizes) for all the variants which best predict the phenotype. A shrinkage term () is used to prevent overfitting, adding a cost to each fitted slope proportional to its value. This has the effect of making many of the genetic variants have a slope of zero, and so they can be removed from the model entirely. To predict the phenotype in new samples, these fitted slopes form a simple linear model that can be applied to a new set of input variants. Furthermore, this also allows the calculation of R 2 , the variance in phenotype due to the variance of all genetic effects-the heritability (total effect of genetics) of the phenotype.
We use an alignment-free representation variation as input. Alignment-free approaches have proven particularly popular in bacterial populations, removing the need for selection of a particular reference and simultaneously modeling both gene content and sequence variation (11,12,35). Many methods have previously used k-mers, which are short sequence words of length k. The DBGWAS method proposes connecting the overlaps of these k-mers in a compressed de Bruijn graph (DBG) so that k-mers are extended using adjacent sequence information in the population, forming unitigs present in exactly the same set of samples as their constituent k-mers (36). We followed the same approach here.
As the elastic net includes predictors that are correlated due to population structure in the same model fitting, it will typically downweight these, to some extent incorporating this evolutionary history into its model. However, as we note below, effect size alone does not have the same interpretation as a P value from GWAS and must be considered along with minor allele frequency (MAF) and other confounders. To include a population structure explicitly in the model, we first divide the population into strains (or lineages). Each sample's contribution to the fitting is then downweighted by the prevalence of the strain in the elastic net so that repeated observations of the same genotype count for less, known as "sequence reweighting." We also use this to select the value shrinkage term . We designed a "leave-one-strain-out" (LOSO) crossvalidation rather than randomly leaving samples out. This aims to avoid correctly predicting the strain (which is frequently correlated with phenotype) rather than the phenotype itself, since such an approach is less robust when common strains dominate the data or when the fitted population model is not representative of a population in which predictions will be made. When applied together, we refer to this as a "weighted" model, as opposed to "unweighted" models which use neither of these adjustments.
Prediction within and between cohorts without sacrificing model interpretability. Whole-genome models can be used to construct a linear model to predict phenotypes in new data. In this section, we evaluate these predictions compared to those of other models and variant calling methods using a variety of data sets and phenotypes.
(i) Mycobacterium tuberculosis resistance is equally well predicted by the elastic net. We first evaluated the predictive performance of our models, with and without population structure control, compared to that of a more complex deep learning model. We used an M. tuberculosis data set with antibiotic resistance to four first-line antibiotics (rifampin, isoniazid, ethambutol, and pyrazinamide). As M. tuberculosis has no accessory genome and minimal core gene variation (37), comparison with more complex models and a SNP alignment is possible. Previous work has evaluated the use of a multitask deep neural network and, when comparing this to lasso regression, found comparable accuracy (5). Using the same input of ϳ6,500 SNPs and short insertion/ deletions across the allele-frequency spectrum for 3,566 samples (split into training and test data sets) led to average false-negative rates of 2% Ϯ 3% in the unweighted model and 3% Ϯ 4% in the weighted model and false-positive rates of 11% Ϯ 8% in the unweighted model and 12% Ϯ 10% in the weighted mode. The elastic net therefore gives similar performance to the lasso as well as the more complex neural network (see Table S1 in the supplemental material), as was also shown by the original study authors (5). It is, however, much easier and faster to run on standard hardware (run time of Ͻ1s on a central processing unit [CPU] versus ϳ3 min on a high-end graphics processing unit [GPU]) and gives results which are far more readily interpretable.
In this case, the weighted model generally performs slightly worse than an unweighted model. The population structure of this sample is relatively simple, with four distinct lineages. This is likely well captured implicitly by the unweighted model, and so the categorical weighting is of lower resolution. Sequence reweighting is instead expected to be more effective in data sets with more complex structures or when adding in samples which are genetically distant from the training set (38,39), which we explore further in the next section. Applying these weights allowed us to easily see that the majority of errors occurred in lineage I, which has deep branches forming genetically separated subclades, with generally perfect prediction in the other three lineages.
(ii) Prediction of pneumococcal resistance using different variant types. We also investigated the advantages of the use of unitigs over other variant calling methods.
Using the same Streptococcus pneumoniae antimicrobial resistance in children (SPARC) data set described above for ␤-lactam and erythromycin resistance, we compared computational resources and prediction accuracy using SNPs, k-mers, and unitigs (Table 1).
We found that for ␤-lactam resistance, all three variant types gave similar predictive accuracy, with the elastic net able to select a small proportion of the total input variants in each case and apparently fairly insensitive to the far greater noise present in the higher dimensional variant types. As this resistance is due to allelic variation in core genes, we expect all three types to tag the causal variation equally well. For erythromycin, where causal variants are not all found in core genes, we observed a reduction in the false-negative rate when using unitigs. Computational usage increased roughly as NM (N, number of samples; M, number of variants). For common variants, M reaches an asymptote for a given population: the main requirement is therefore based on N. For all methods, the CPU time was modest, but memory usage may pose a problem. SNPs are tractable on a laptop, but unitig analysis likely requires a computing cluster for the model fitting (using a fitted model on test data requires negligible resources). k-mers require an enormous amount of memory, which would not scale to larger data sets. Though the unitig analysis was easy to schedule on our cluster, future improvements to reduce memory use could include accessing the variants as they are needed from a disk or fitting the elastic net in chunks, with resampling (40,41).
(iii) Reduced intercohort accuracy is ameliorated with consistent genetic calls and population structure control. Random splits of single data sets in test and training data, while convenient for analysis, may mask inter-data set differences such as class imbalance (different resistance rates), unobserved lineages, and technical errors (variant calling) (10). To test a more realistic example, where a previously fitted model is used to predict resistance status in new unobserved data, we set up a prediction experiment using genomic data from three large very different pneumococcal cohorts with ␤-lactam resistance: SPARC (603 U.S. children covering introduction of vaccine); Maela (3,162 unvaccinated infants and mothers); global pneumococcal sequence (GPS) (5,820 globally distributed samples, mostly vaccinated). We counted unitigs for each population and used these to train a predictive model. These models were evaluated on the data they were trained on and on the other two cohorts by using consistently named unitigs from unitig-caller ( Table 2). The resources used were as follows: SPARC, 5 Gb random access memory (RAM), 0.6 h; Maela, 30 Gb RAM, 2.5 h; GPS, 3.1 h. The majority (ϳ80%) of the CPU time used was for reading variants from text files, making subsequent fitting faster. The distributions of unitig sizes are shown in Fig. S1.
Between-cohort predictive accuracy was considerably lower than within-cohort accuracy but still outperformed an intercept-only model in all cases. The use of unitigs proved successful: repeating the SPARC-Maela comparisons with SNPs led to extremely poor predictions for every sample, as the selected SNPs were called as missing in the other cohort, leading to a mean value prediction for every sample (true negatives, 1,661; false positives, 0; false negatives, 1,282; true positives, 0). To fix this issue with SNPs would likely require a labor-intensive mapping and joint recalling of variation, whereas with sequence elements, the simple search implemented in unitig-caller can Depending on the specific model and data set combination, errors can much more commonly be type I or type II, possibly reflecting class imbalance, despite overall resistance rates in the pneumococcus being stable (42). The GPS cohort gave the worst performing model, despite it being the largest collection. This is a very genetically diverse sample, which introduces more potential for confounding lineage effects to enter the model. Furthermore, this cohort is a mix of sequences isolated from cases of asymptomatic carriage and disease, whereas SPARC and Maela contain only asymptomatic carriage cases. The GPS cohort is enriched with more-virulent strains, which have more frequently faced treatment with antibiotics and have a higher rate of resistance (1,43).
We also note that the area under the curve (AUC) of the receiver operating characteristic (ROC) is misleadingly high (0.9185/0.9728 for the weighted/unweighted GPS model) and would encourage the reporting of error rates as more intuitive summaries of accuracy for bacterial traits such as resistance.
We found that sequence reweighting generally reduced prediction accuracy for this phenotype, although it is the LOSO strategy in particular which gave slightly more representative accuracy estimates for out-of-cohort prediction (when comparing within data set with between data set prediction, R 2 was 72% higher with sequence reweighting versus 82% higher without sequence reweighting), and more of the selected variants were in the causal loci.
(iv) Virulence phenotypes can be predicted with sequence reweighting, preventing overestimation of accuracy. Most work on prediction of bacterial phenotypes has focused on antibiotic resistance, but many more complex phenotypes relating to bacterial virulence are now available. For these phenotypes, which are under weaker or no selection, instead of a few strong effects, multiple smaller effects are expected in the genome (44)(45)(46). Therefore, a model which may include more of these effects, which would be missed with a P value threshold, may be expected to perform well.
We applied our method to predict the duration of asymptomatic carriage in a subset of the Maela cohort, which can easily be visualized in the manner of a linear regression (Fig. 1). We show the observed versus predicted values for the training and test sets, both with and without sequence reweighting. In the unweighted training set, R 2 (and heritability [h 2 ]) was 0.89 ( Fig. 1, top left), but the test R 2 was only 0.27 (Fig. 1, bottom left), showing clear overfitting. With sequence reweighting and LOSO, the training and test estimates were much closer (0.37 and 0.28, respectively) ( Fig. 1, right). In this case, sequence reweighting gave a more realistic heritability estimate. h 2 was previously estimated to be 0.634 using phylogenetic pairs, and 0.445 using restricted maximum likelihood (REML)-these may be overestimates, especially as the revised estimates introduced here used more information from the genome. We also tested virulence prediction in two streptococcal species, which would be a useful application for routine pathogen surveillance but has not been as thoroughly explored as resistance prediction. Using Streptococcus pneumoniae isolated from Dutch adults, we fitted a model which selected 9,701 unitigs. This model was able to predict meningitis versus carriage genomes (test FPR, 0.059; FNR, 0.12) and gave a similar h 2 estimate to that originally reported (0.65 versus 0.70). Comparing tissue infection with carriage of Streptococcus pyogenes gave a model with 5,817 unitigs, which had a higher error rate than the S. pneumoniae model (test FPR, 0.24; FNR, 0.25), but the phenotype also had a correspondingly lower h 2 estimate of 0.343. Detailed performance of these models is given in Table S2.
Power and false-discovery rate compared to GWAS using simulated phenotypes. To test the characteristics of the elastic net compared to GWAS approaches, we simulated phenotype data from the Maela population (3,162 S. pneumoniae genomes) ( Table 3) using previously defined SNP variation (47). We chose either 5, 25, 100, or 300 true causal variants with an effect size of 4 (similar to a penetrant antimicrobial resistance variant) either • chosen uniformly at random across the genome, after linkage disequilibrium (LD) pruning (no variants with R 2 Ͼ 0.9).
• chosen uniformly at random from 1 to 3 prespecified genes (pbpX-pbp2x, penA-pbp2b, and pbp1a).  Each data set has a name by which it is referred to in the text. Most data sets have multiple phenotypes available, especially where multiple different antibiotic resistances are routinely phenotyped. Data sets without a training/test split were not evaluated for internal prediction ability as they were instead used with more stringent external validation data sets or were used for GWAS only, and all available samples were used to fit the model. b NU, not used.

Lees et al.
We chose the first setting to emulate a polygenic trait, with many variants of roughly equal effects associated across the genome. LD pruning was only used to select causal variants, and all variants were used as input to the model. The second setting more closely resembles antibiotic resistance, where multiple alleles in either one or a small number of genes contribute to the effect, with multiple occurrences independent of genetic background. We ran the elastic net (␣ ϭ 0.01) and lasso regression (␣ ϭ 1) as well as both GWAS models (fixed effects and linear mixed model) previously implemented in pyseer. Variants were output by the genome-wide model if they had a nonzero coefficient and by the GWAS models if their P value exceeded a significance threshold of 0.05 after Bonferroni correction. For each simulated data set and method, counting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), we calculated the power, the proportion of true causal variants in the output: TP/ (TP ϩ FN). This allowed us to analyze the overlap of selected variables which gave good prediction with those found in a GWAS, which are individually associated with a phenotype of interest and may therefore provide mechanistic insight into the trait being analyzed. We also calculated the proportion of false positives in the output, the number of variants selected in the output which are not true causal variants divided by the total number of variants being tested: FP/(TP ϩ FP ϩ FN ϩ TN). This is a measure of the "noise" in the selected predictor set. First, our simulations were able to show that using the correlation filtering step (reducing input size by 25%) reduced power on average by 4% and 8% in the worst case (see Table S3), where many small-effect variants are spread across the genome and with no appreciable power loss with smaller or more concentrated causal variants. The sample correlation values are positively skewed due to population structure, and so filtering all variants with a sample correlation below the mean value rather than a quantile leads to an unacceptably high loss of power, as many causal variants would be removed. This quantile filter can therefore be used on large data sets to reduce CPU and memory usage with little effect on the variants selected, but if possible, the full set of variants should generally be modeled in its entirety.
Over all of our simulations, we found that either the elastic net or a fixed-effect GWAS had the highest power depending on the setting, and both always had higher power than the linear mixed model ( Fig. 2 and Fig. S2). This is consistent with expectations from prior literature in multicellular organisms (25,26,40). The elastic net performed better in situations where the heritability was low or causal variants were spread out across the genome. This is expected for less penetrant traits such as carriage duration (44) and transmissibility (48). There was slightly lower power for all methods with binary phenotypes, and this decrease was more pronounced in the linear mixed model, possibly due to being the only model that used a Gaussian error structure in both settings.
However, in exchange for reduced power, the linear mixed model consistently showed the best control of false positives in all settings, always Ͻ5%. In contrast, the fixed-effect model had a much greater false-positive rate than any other method, which grew both with sample size and heritability. The elastic net's false-positive rate was typically Ͻ5% and was robust across the ranges of heritability tested, though it increased slightly with larger sample sizes as more variants were included in the fitted model. With such a large number of variants, even a small false-positive rate can be problematic, and so combining selected variants with a ranking by P value from a GWAS is important. It is possible to do this with least angle regression (49) in lasso regression, but due to the large number of variants, a P value from GWAS is most convenient.
We also tested lasso regression on the same footing, which is the other extreme of the ␣ setting in the elastic net. For smaller numbers of causal variants, performance was similar to that of the linear mixed model both in terms of power and false-positive rate, in some cases having slightly higher power. However, when the number of causal variants was higher, the amount of sparsity introduced was too high, reducing power below that of other methods (though false-positive rate was low in all settings for the same reason). As the number of causal variants is generally not known a priori, we would therefore always recommend the elastic net with a small ␣ over the lasso.
These results show that the variants selected by the elastic net are causal at similar rates to those with GWAS methods. The elastic net's selected variants can be used as an effective trade-off between the regimes of the two commonly used GWAS models, having higher power than the linear mixed model and a lower false-positive rate than fixed-effect models. As many bacterial GWAS results must be followed up with lab work, these results suggest a dual approach of variable selection with the elastic net followed by ranking results with the linear mixed model may be useful when the first set of variables selected is large. This is possible in a single step in pyseer.
Whole-genome model of bacterial phenotypes enables heritability estimation and combination with association mapping. (i) Variant selection for pneumococcal antimicrobial resistances. Next, we tested whether our method selects causal variants for some phenotypes where these are known and compared our results to those from GWAS approaches. First, we analyzed a well-studied phenotype and data set: sensitivity/resistance to ␤-lactams in the SPARC cohort of 603 S. pneumoniae  genomes. Resistance is mainly conferred by allelic variation of three genes (pbp1a, pbp2b, and pbp2x), which are easily detected by most GWAS and machine learning methods with SNP calls as input (47,50) Though the regions are always correctly identified, the specific variants detected are not identical between methods (51). Figure 3 shows the results of this analysis. With both the elastic net and linear mixed model plus cutoff, penA-pbp2b and pbpX-pbp2x are clearly the strongest hits. pbp1a is also selected by both methods, and while it can be seen on both Manhattan plots, it is slightly clearer in the gene summary plot for the elastic net, due to the larger number of SNPs selected in the gene. Taking hits with a P value above a threshold results in a very clean result for the linear mixed model (LMM) with this strongly selected phenotype; only a few noncausal genes are included, usually with only a few SNPs and much lower ranking than the causal genes. The elastic net selects many more noncausal SNPs across the MAF spectrum, though combining with P values and number of hits allows these to be effectively filtered. It should be noted that effect size does not appear to be an effective filtering criterion without taking into account minor allele frequency, which may have implications for other machine learning methods where a P value cannot easily be integrated. Both methods calculate comparable heritability estimates for this trait, for the LMM, h 2 ϭ 0.89, and for the elastic net, h 2 ϭ 0.81.
We then undertook a more challenging analysis, using unitigs to investigate antimicrobial resistances in the larger Maela cohort (3,162 S. pneumoniae genomes). We previously attempted GWAS on this data set using a fixed-effect GWAS model in the original description of our SEER software (12). We did not analyze tetracycline resistance or chloramphenicol resistance, as these are driven by single elements which were easily detected with the previous method. We instead used trimethoprim and erythromycin resistances. Trimethoprim resistance is expected to have two causal loci (folA-dyr and folP) due to being administered jointly with sulfamethoxazole as co-trimoxazole (52). Indeed, using our method on trimethoprim, the two causal loci are clearly identified and are the most highly ranked on a Manhattan plot (Fig. S3); applying sequence reweighting makes little difference to the result. Erythromycin resistance has multiple causal mechanisms (ermB, mel, and mef) which were not easily found in our previous attempt. Again, the erythromycin results contained many peaks in their Manhattan plots (see Fig. S4). When we mapped the unitigs directly to resistance genes, we found significant results in ermB (9 hits, minimum [min] P ϭ 10 Ϫ47 ), mel (12 hits, min P ϭ 5 ϫ 10 Ϫ42 ), and mef (6 hits, min P ϭ 5 ϫ 10 Ϫ42 ). While this was clearly more successful than our previous analysis, when considering the noise when mapping to a single reference, these causal mechanisms would not stand out (see Fig. S5). So, while our method reduced the computational burden of the analysis through the use of unitigs, it was not able to easily resolve the causal mechanisms in this challenging example. This suggests that both single variant tests and whole-genome models would struggle to arrive at true causal predictions under such circumstances. More flexible black box machine learning type approaches may help to improve prediction accuracy in these cases but were outside the scope of this study due to difficulty in interpreting the models in terms of causal variants.
(ii) Heritability and mapping of gonococcal resistance. We also applied our method to a combined cohort of 1,595 Neisseria gonorrhoeae genomes where resistance to five different antibiotics has been measured. These data were previously used to do GWAS using an LMM, with selected loci then entering a reduced dimension epistasis analysis (53).
The mapping of resistance genes for these antibiotics using our approach was similar to this GWAS. The original analysis looked at ϳ8,700 SNPs with a MAF of Ͼ0.5%; we used 5.3 ϫ 10 5 unitigs with a MAF of Ͼ1%. Azithromycin (AZI) had 4,612 unitigs selected, with the top hits mapping to the four 23S rRNA sequences in the genome. The original analysis only identified a SNP in one of these repeated rRNA sequences, likely due to the impossibility of mapping variation in these repeats at a single base level: this is an advantage of being able to report multiple mappings of sequences at the final stage. Cefixime (CFX) identified the penA region, as in the original analysis, and also  suggested an association in the promoter of opaD. Ciprofloxacin (CIPRO) had hits throughout the genome, as in the original analysis and similar to the analysis of erythromycin described above-combining the LMM with the elastic net may reduce candidate regions in these cases. Penicillin (PEN) had a hit in the porB region, as in the original analysis, along with hits in lgtE, mexB (and efflux pump), and a prophage. Tetracycline (TET) similarly had a replicated hit in the porB region, along with the cysN promoter, an alternative pilE allele, and rsmE (a ribosomal methyltransferase). These Manhattan plots can be seen in Fig. S6. The WHO N strain (54) contains plasmids with blaTEM and tetM, causal for PEN and TET, respectively, to which we can also map unitigs rather than needing to recall variation with respect to this reference panel. This confirms further hits to these genes. Our method therefore broadly replicated the results from the LMM and added new candidate hits due to testing unitigs rather than just SNPs, as expected.
We also calculated the narrow sense heritability h 2 using our elastic net and unitig method and compared these to those calculated with previous methods (Fig. 4). Our estimates were very similar to those of h SNP 2 from the original paper, though consistently slightly higher (3% Ϯ 7%), which may be a result of including more of the population variation through unitigs. Using a simple estimate of shared sequence content as the kinship matrix led to a likely overestimate of heritability. For these antibiotics, we expect high h 2 , approaching 1, as we discover further causal mechanisms and include them in the model (55). These estimates are consistent with this expectation but are difficult to evaluate quantitatively. It is challenging to evaluate the accuracy of heritability estimates, as the true biological value cannot easily be measured in bacteria and FIG 4 Heritability estimates for antibiotic resistance in the combined N. gonorrhoeae data set, using different methods. For each of the five antibiotic resistances measured in this data set, we report the narrow-sense heritability (h 2 ) from our elastic net method and unitigs (gray), the limix method implemented in pyseer, using sequence distance (gold) or phylogeny (blue), and the restricted maximum likelihood (REML) approach used in the original publication (green). For limix estimates, 95% confidence intervals (CIs) were calculated with FIESTA (82). These are not shown for the phylogeny method as they span a range wider than the plot (0.11 to 1).

Pangenome-Wide Associations and Prediction in Bacteria
® measurement via simulation is often circular, where methods used to generate the simulations necessarily perform best.

DISCUSSION
In this paper, we developed a microbe-specific implementation of a machine learning tool and showed how this can be used to better understand the link between bacterial genetic variation and phenotypic variation. We argue that the sophistication of the prediction model itself is generally less important than three other factors: the data set itself, creating a method with careful genomic data management, and incorporating knowledge specific to bacterial populations. We addressed these issues as follows. Pangenomic variation was covered using a unitig definition of population variation, which we showed to be scalable, unlike k-mers, and better suited to analyzing accessory genome variation and inter-data set consistency, unlike SNPs. Population structure was accounted for explicitly using sequence reweighting and leave-onestrain-out cross-validation. We maintained a clear link between our resulting models and underlying genetics by combining linear models with a suite of tools to interpret the variants selected in the model. This had the further advantage that it significantly reduced the number of sequence elements to be processed after association. Using selected unitigs allowed for a much smoother use of the interactive plotting software phandango (56), which is one of the fastest ways to interpret bacterial GWAS results.
Our method is interpretable, and selected variants contain a high proportion of causal variants in simulations. We compared power and false-positive rate to selection with a P value threshold using simulations, and using real data from two species showed how this method can be combined with GWAS approaches to understand resistance and epistasis. LMMs are the best GWAS approach but mostly excel in cases where a trait is polygenic and has evolved multiple times, as is the case with many causal antibiotic resistance markers (11). If causal variants are more closely correlated with lineage, as may be the case for less strongly selected phenotypes, the LMM has reduced power, whereas the elastic net includes these groups of variants more effectively than previous fixed-effect models. This has recently been shown in an independent set of simulations (57). It is also worth noting that in highly clonal settings, the LMM can suffer from a very high false-positive rate unless the kinship matrix is carefully chosen (57,58). Depending on the genetic architecture of the phenotype, which is rarely known a priori, one of these two methods may have a more desirable power/ false-positive trade-off. The option to combine selection with the elastic net and ranking from the LMM appears to be useful in some challenging intermediate cases. An alternative approach is to associate LMM variants with lineages themselves marginally associated with the phenotype, as proposed in the bugwas package (11). The elastic net would be preferable where individual lineage associations are weak but causal, and the entire lineage block would be rejected.
We also obtain useful estimates of trait heritability, some of which show evidence that previous approaches may have overestimated this quantity. For the purposes of prediction, on a simple data set, we replicated the result that regularized linear models perform similarly to more complex deep learning methods. We also applied our method to a range of data sets from different species and phenotypes, including resistance, carriage duration, and virulence. Though our models generally performed well when measured on error rates, an experiment with models on three separate cohorts showed how accuracy falls outside the target data set. External data sets may have different strain compositions due to different biases toward more-or less-virulent strains, geographical separation, vaccine use, or antibiotic consumption in the population. We would reiterate the caveat that while these models can be useful, high accuracy on test data should not be taken as a general measure of confidence (9). Batch differences such as genotyping methods between cohorts exaggerate this problem, and so a consistent approach (such as the one we provide here) should be used. Unsurprisingly, curated resistance sets-the result of decades of research-still generally perform better, although even this in silico method loses accuracy between data sets (34). Less-well-understood and potentially polygenic phenotypes such as virulence offer an attractive target for our model, as we demonstrated on two streptococcal pathogens.
Along with these theoretical advances, our package has a number of practical advantages. All of the elements of our method are freely available, well documented, and part of a continuous unit-testing framework. Users can construct and evaluate models easily, without the need for programming experience, with options which retain the flexibility to modify the model parameters. There is no need for specialist hardware such as the graphics cards needed to fit large deep learning models. The models themselves are saved in a human-readable format, are easy to share and reuse, and have minimal hardware-specific requirements.
Our method does remain limited due to its reliance on the elastic net. Higher order interactions are difficult to include, as they make the size of the input space increase greatly, whereas in other machine learning approaches such as random forests, these are included naturally in the model structure. Simultaneous selection of both hyperparameters ␣ and is more challenging due to a greatly increased search space in cross-validation, and so we rely on a heuristic selection of ␣. For larger data sets, the requirement to store the entire variant matrix in memory to fit the elastic net can easily exceed available memory. This could be solved by storing this matrix as a file on a disk and memory mapping this file, or reading rows only as needed (40). Our implementation cannot currently incorporate prior information on input variants, such as known association with antimicrobial resistances, though extension to either an ensemble model or Bayesian regression would be possible. As we have introduced a general framework for variant input/output in pyseer, we hope to include further machine learning approaches which allow trade-off of these advantages and disadvantages in the future.
The effect sizes from the elastic net do not necessarily indicate the significance of individual variants, as they are optimized for prediction; even after reweighting sequences, nonzero effects form a much larger set than true causal variants and are spread across the genome rather than specifically mapping to a causal region. The combination with P values from a well-performing GWAS method, such as the linear mixed model, helps if the user requires this interpretation. We would expect the same to be true for other machine learning methods and would generally caution against making a "GWAS-type" inference based on predictor importance or similar measures. We also did not perform a thorough comparison with other available machine learning methods; there are many choices, with well-known characteristics, which have previously been shown to perform similarly well at this task (10).
More broadly, we have considered techniques routinely used in the analysis of modern data sets, which are generated frequently with high-throughput methods. These can be adapted to perform fundamental tasks in bacterial genomics in ways which are useful and that scale with our ambitions to discover causal drivers and predict phenotypes from genome variation. Collections of high-quality whole-genome sequences are now available at a scale that would have been unfathomable just a few years ago. Many of these data sets are publicly available already, and many more are being generated from new larger projects and routine surveillance by public health agencies. Care must be taken to ensure the unique properties of bacterial populations are properly modeled and that we use appropriate measures of success. Complex models should be compared to simple models (59,60), not just in terms of accuracy but also for their ability to look at underlying biology. In many cases, the limiting factor is unlikely to be model flexibility. With our pangenome-spanning penalized regression models, we hope to have made useful and usable contributions that respect these principles.

MATERIALS AND METHODS
Preparation of data sets. Table 3 shows a summary of the data sets used in this paper. Sequence assemblies were available from the original publications, with the exception of one N. gonorrhoeae study Pangenome-Wide Associations and Prediction in Bacteria ® (61). For this study, we downloaded the read data, removed adapter sequences with trimmomatic (62) v0.36, and assembled them with SPAdes v3.11 (63) using the --only-assembler and --careful options. For all data sets, we then called unitigs from each sample's sequence assembly using a k-mer length of 31, and low-frequency unitigs (AF Ͻ 1%) were discarded. For the Massachusetts and tuberculosis (TB) data sets, additional genetic data were available. For the TB data set, we used the variant call matrix provided by the study's authors (5). Not all phenotypes were available for all samples from this study. For the Massachusetts and Maela data sets, we used SNP calls from an earlier GWAS in this population (47). Where a split into training and test data was needed, this was performed at random in the ratio 2:1. When including a cluster assignment to account for population structure, we used the previous assignments from PopPUNK where available (64). For TB, we used the major lineage as the cluster. When MICs indicated antibiotic susceptibility, this phenotype was first log transformed before any downstream analysis. Other phenotypes were used as originally reported.
To generate simulated data used for testing power and false-discovery rate against a ground truth, we simulated phenotypes but used observed SNP genotype data from the Maela data set to ensure a realistic genetic model for bacterial population structure. Phenotypes were simulated using GCTA (16), either as continuous or as binary using a liability threshold model. Then, to assess the power of the methods with respect to the sample sizes, we randomly choose subsamples with 500, 1,000, 2,000, and 3,000 samples (with h 2 fixed at 0.5). Separately, using 3,000 samples, h 2 was varied at 0.1, 0.3, 0.6, and 0.9. We did not use fewer than 500 samples, as we expect a significant decrease in discovery power (12). LMM and fixed-effects models were run using default settings in pyseer (10 multidimensional scaling [MDS] components and kinship matrix estimated from shared distance to the root of a phylogeny). We used a fixed-effect size of 4, a relatively high effect size typical of that found for antimicrobial resistance SNPs, which have been found to range between roughly 2.5 to 7 for this data set (45), and higher than virulence effects which have been found to range between 0.1 and 1 in the Netherlands data set described above (43).
Elastic net model. We use a high-dimensional regression model which includes all genetic variants and covariates of interest x ជ. Typically, this is not possible using classical inference methods such as the least-squares estimator, as the number of genetic predictors m exceeds the number of samples N, leading to an underconstrained system. The elastic net defines such a function (33,65), mixing L1 (lasso) and L2 (ridge) penalties, which is minimized with respect to the values of the intercept b 0 and slopes b ជ : where N is the number of samples, w i are positive weights (with a sum equal to N), l() is the link function (linear or logit for continuous or binary phenotypes y i , respectively), Ͼ 0 is the magnitude of the penalty, and 0 Ͻ ␣ Ͻ 1 is the amount of mixing between L1 and L2 penalties. Minimizing this function reduces the squared distance between predicted and observed values when ϭ 0; as increases, predictors are shrunk toward zero to trade-off prediction accuracy with overfitting. Given the strong linkage disequilibrium present in bacterial populations (11,66), many genetic variants are strongly correlated across long distances. Randomly selecting a single representative variant from such a group will likely lead to greatly varied biological conclusions, and so including the entire block with suitable lineage annotation is preferred (11). For this reason, the elastic net has been shown to be especially useful when the variables are dependent (33). We compare this selection with the lasso in our simulations. The value of ␣ can be changed by the user, should they wish to opt for a sparser model.
Two important parameters which are not directly set using the data are and ␣. The amount of penalty is set by cross-validation, using a default of 10-fold to pick the value of , which maximizes the cross-validated R 2 value of the model over the test data. The user can change the number of folds. The amount of mixing (␣) between L1 penalties, which lead to sparse predictive models with mostly zero valued slopes, and L2 penalties, which lead to models with shrunken but nonzero predictors, could also be chosen by cross-validation to maximize prediction accuracy. However, we propose users choose a value of around 0.01 throughout (all experiments reported here use ␣ ϭ 0.01). A sparse model is preferred for prediction for speed and easier consistency between populations, but this leads to a loss of power in the context of GWAS, as potentially causally related predictors may be removed from the model ( Fig. 2 and S2 in the supplemental material). Introducing even a small L1 penalty in this way removes a large fraction of variants unconnected with the phenotype.
By virtue of the fact that all pangenomic variation enters this model, if a linear additive model of heritability is assumed, which it typically is in bacterial GWAS (44,45,67), the value of R 2 calculated from the fit of the elastic net, also serves as an estimate of the narrow-sense heritability h 2 . As R 2 measures the variance explained by the model's predictors, in this case all genetic features, this is equivalent to the proportion of phenotypic variance p 2 explained by genetic variation g 2 , the definition of h 2 . This provides a lower bound on h 2 , because the Lasso-type estimator is biased (68), and it tends to shrink some coefficients with weak effects toward zero, though these weak effects may have a significant effect on the trait variability.
Efficiently modeling the entire pangenome. Bacterial populations vary greatly in their sequence content, and mapping short variation within their core genes (coding sequences shared by all members of the population) is generally insufficient to capture all of the variation within the samples. In particular, accessory gene content has been shown to be associated with core variation (64), be associated with clinically relevant phenotypes (12,69), and be useful for predicting the evolution of the population (70,71).
Early bacterial GWAS methods used k-mers, sequence words of fixed or variable length, to assay variation throughout the population independent of gene annotation or variant calling method (11,12,35). The set of common k-mers (1% to 99% frequency) is vast, particularly in the large and genetically diverse populations which are most amenable to GWAS. Efforts to model all of these sequence elements simultaneously are potentially computationally intractable, as these words will not fit in main memory and model fitting takes an extremely long time. In the S. pneumoniae data sets, we observed a 13-to 18-fold reduction in the number of unitigs compared to that of k-mers.
We used two techniques to circumvent this issue while still including as much pangenomic variation as possible. Following the idea of screening methods in ultrahigh-dimensional data (72,73), we used the absolute value of the sample correlation as a screening criterion for each variant: where y and x are the standardizations of y and x (defined as phenotype and variant as above) such that This is the "correlation filter." Using a single threshold on the mean value for this correlation would lead to a large number of variants being removed before modeling, which is appealing computationally but, in our simulations, led to a loss of power. We instead removed the lowest quartile, which maintained power but did not reduce model size as much. The size of the quantile to remove can be set by the user.
We also followed the method used in DBGWAS (36), which after counting fixed-length k-mers, constructs a compressed de Bruijn graph (DBG) of the population. Nodes in this compressed graph are extensions of adjacent k-mers in the raw graph with the same population frequency vector and whose sequences are referred to as unitigs. These unitigs greatly reduce the redundancy present in raw k-mer counts by combining those with the same patterns and are generally easier to functionally interpret due to their longer length. We followed the same method as step 1 of the DBGWAS package, which uses the GATB library to construct a compressed DBG (74), and then reported frequency vectors of each unitig/node and unique pattern in a format readable by pyseer. We used a k-mer length of 31 throughout to count unitigs, as this was previously shown to maximize association power (36). This length can also be set by the user. We reimplemented this approach as a standalone package (https://github.com/ johnlees/unitig-counter), also including tools to extend unitigs by traversing neighboring nodes in the graph, and calculate distances between unitigs based on the graph using Dijkstra's algorithm.
Incorporating population structure. Population structure causes correlations between the genetic variants (unitigs) that make up x ជ. As all of these enter our model together, this effect may be implicitly controlled for without the need for further correction. We also wished to compare this to the use of an explicit correction term to test which approach is more effective. This can be included in the modeling step by a combination of three approaches: use of extra predictors in x ជ which account for population structure, modifying the per-sample weights w i , or by changing the folds used in cross-validation. We discuss each of these in turn.
Fixed-effect models typically use a principal components-type analysis to include the main axes of variation in the population as covariates. In a new data set, projection of variation onto these existing axes could be used but would require large overlap between variant calls in each data set to be accurate. Random-effect models use a kinship matrix to include the sample covariance matrix in each association. For a new data set, this would require calculation of covariance against the original data set, which reduces model portability.
We therefore opted to use a definition of population structure which does not introduce extra predictors. This makes application of the model more straightforward in new data sets. The use of a definition of clusters which naturally extends to new populations, which may have very different strain frequencies and/or large numbers of novel clusters, further increases robustness in the face of betweendata set variation. Any method which produces discrete cluster membership definitions independent of cluster frequency is suitable for this purpose, such as sequence type, clonal complex, or percentage identity cutoff. We opted to use the "strain" definition provided by the PopPUNK software throughout our analysis due to its speed and biological basis (64) However, our implementation allows any preferred definition of cluster membership to be used.
Cluster membership for each sample x is defined as C i (x), which is 1 if x is in cluster i or 0 otherwise. The model is fit to all the data except the first cluster, and then its accuracy is tested on the cluster not used in the fitting. This is repeated for every cluster to find the value of which maximizes fit accuracy over all these fits. This may be referred to as leave-one-strain-out (LOSO) cross-validation or leave-onecluster-out (LOCO) cross-validation. This is more realistic than random selection of folds, as random samples would maintain relative strain frequencies between training and test data, whereas new populations usually vary greatly in their genetic background (1,10).
Furthermore, we added the sequence-reweighting option in pyseer, which defines the sample weights as being inversely proportional to the cluster/strain size: This sequence reweighting is a commonly used definition in epistasis methods such as direct coupling analysis and correlation-based approaches, which have recently been successfully applied to genomewide variation in bacterial populations (38,39,75).

Pangenome-Wide Associations and Prediction in Bacteria
to which the appropriate link function l() is then applied to convert into a probability (which can be converted into a binary outcome using a threshold cutoff).
Predictors which are missing can either be ignored (set x i ϭ 0) or imputed (set x i ϭ x , where x is the allele frequency in the original data set). For the unitig caller approach described below, a missing call means a genuine absence in the data, and so we used the former approach. For variant calling methods where missing calls may be artifactual (such as SNP calls), we use the mean value imputation approach. The option -ignore-missing can be used to control this behavior manually.
To apply fitted models to new populations, consistency in the construction of the matrix of genomic variants x ជ between data sets is important to maximize prediction accuracy. This can be highly challenging for core SNPs due to the many possible methods and individual options to extract these from sequence reads. Sequence element presence or absence is simple to define, and so it is easier to enforce consistency of input between data sets. The main issue with using unitigs here is that DBGs of different sample sets will have different sequences at their nodes and, therefore, nonoverlapping unitig calls. To solve this issue, we defined unitig sequences once in the training population. Rather than building a new DBG in test data sets, we instead checked for the presence of unitigs previously defined in the training population. To do this efficiently, we created and saved an Ferragina and Manzini (FM) index for each input sequence (76), which is a substring index capable of supporting fast text searches, which we performed in parallel for each unitig query. We implemented this as a separate package unitig-caller (https://github.com/johnlees/unitig-caller) in Cϩϩ using the SeqAn library (77). An alternative implementation of both "counting" and "calling" unitigs is possible by constructing and querying sequences using a single population DBG. We implemented this in the same package by creating an interface to the Bifrost method (78).
For continuous phenotypes, we report R 2 to describe prediction accuracy. This is more difficult to interpret with binary phenotypes, especially in the presence of class imbalance and when the imbalance deviates from the population-wide prevalence. For binary phenotypes, we found that reporting positive statistics such as sensitivity and specificity, or especially area under the curve (AUC), led to reports in the top decile for almost all data sets and methods and were harder to intuitively compare between different models and data sets. We therefore report the false-negative rate and false-positive rate along with the totals selected.
Implementation. We implemented the association model in version 1.3.2 of the pyseer bacterial GWAS package, which is written in python (79). This takes care of reading variants in many common formats, including the output from unitig-counter, as well as providing tools to help interpret associated sequences. We used python bindings to the fortran glmnet package to actually fit the model, as the use of warm-starts more efficiently solves the above equation at an array of values of (65). Cross-validation, parallelized if requested by the user, is used to select the value of with the greatest R 2 value, as defined above.
Variant matrices are potentially very large, and so to optimize speed and memory use, we read these into a sparse matrix structure. Variants with allele frequency of Ͼ50% have their genotypes flipped to increase sparsity-these sites are flipped back during prediction. This sparse structure can be saved to disk to avoid repeated parsing of variant input files. Only haploid variant calls (0/1), allele frequency, and sample order are saved in this file. After extracting the nonzero coefficients from the fitted elastic net, the input variant file is reread with minimal parsing to output information about the selected variants. A SHA256 hash of the input file is calculated to ensure consistency with the original input file.
The fitted models are saved as an associative array, with variant names as keys (either sequence or alleles combined with chromosome and position) and allele frequencies and fitted slopes as values. New variant call files are read with minimal parsing to extract just those sites which appear in the model, and at the end-of-file, the appropriate imputation procedure is applied to model terms which were not found. This allows both rapid prediction in large new data sets and an easy and portable way to share predictive models.
For fitted models, slopes, P values (adjusted by any of pyseer's other models), and allele frequencies are included in the output. Where the true phenotype is known, prediction accuracy is reported using R 2 and a confusion matrix if the phenotype is binary. If clusters were provided, accuracy statistics for within each cluster are also included in the output.
We used the tools within pyseer to interpret unitigs. Specifically, we used bwa-mem with the shortest possible seed to map unitig sequences to a single reference genome (Streptococcus pneumoniae ATCC 700669 [80]; Neisseria gonorrhoeae FA1090 [81], and WHO_N [54]). Bonferroni corrections were calculated by using the number of unique unitig patterns as the number of tests.
The new code in pyseer includes automated tests and unit tests we wrote using test data distributed with the package. Documentation and a tutorial are available online (https://pyseer.readthedocs.io/en/ master/predict.html).

ACKNOWLEDGMENTS
We thank the developers of DBGWAS who wrote the majority of the code to count unitigs, which we modified for this application. We are particularly grateful to Leandro Ishi Soares de Lima and Rayan Chiki for helping test and debug code from DBGWAS and GATB used in unitig-counter. We also wish to thank the developers of SeqAn 3.0 for their help implementing the FM index in the unitig-caller package. We thank Benjamin Schubert and Deborah Marks for providing the phylogeny for the N. gonorrhoeae data set and Kevin Ma for advice on appropriate mapping of significant unitigs in this species. Daniel Falush provided useful feedback on an earlier presentation of this work. We thank Laurent Jacob for providing detailed critical feedback at multiple stages of the manuscript.