Reduced ambiguity and improved interpretability of bacterial genome-wide associations using gene-cluster-centric k-mers

The wide adoption of bacterial genome sequencing and encoding both core and accessory genome variation using k-mers has allowed bacterial genome-wide association studies (GWAS) to identify genetic variants associated with relevant phenotypes such as those linked to infection. Significant limitations still remain because of k-mers being duplicated across gene clusters and as far as the interpretation of association results is concerned, which affects the wider adoption of GWAS methods on microbial data sets. We have developed a simple computational method (panfeed) that explicitly links each k-mer to their gene cluster at base-resolution level, which allows us to avoid biases introduced by a global de Bruijn graph as well as more easily map and annotate associated variants. We tested panfeed on two independent data sets, correctly identifying previously characterized causal variants, which demonstrates the precision of the method, as well as its scalable performance. panfeed is a command line tool written in the python programming language and is available at https://github.com/microbial-pangenomes-lab/panfeed.


INTRODUCTION
Recent years have shown an unprecedented surge in bacterial genome sequencing.The availability of these large amounts of data has given rise to new opportunities in the field of microbial genomics, namely genomic epidemiology, population genetics and genome-wide association studies (GWAS).For the latter the combination of ever larger sample sizes (e.g.>200) and dedicated bioinformatic software has allowed the association of genetic variants to a number of bacterial phenotypes, some of which having high relevance to infection biology.These include virulence and pathogenicity [1], antimicrobial resistance [2,3], host range [4] and carriage duration [5].In many cases, statistically associated variants were verified to be causal for the phenotype through forward genetic techniques [1,6], proving the power of the approach and its usefulness to elucidate crucial aspects of bacterial infections [7,8].
While for human GWAS there are already a wide variety of computational approaches being used to run large-scale analyses and for the interpretation of hits [9][10][11], the field of bacterial GWAS cannot directly make use of these advancements because of several specificities in the genomic structure of microbes [7,8].The clonal reproduction of bacteria leads to a stronger population structure in closely related strains.This in turn increases the linkage disequilibrium between genetic variants that contribute to a phenotype and those that do not, thus significantly increasing the risk of false positives.Horizontal gene transfer (HGT), on the other hand, increases the genomic plasticity of bacteria, leading to large differences in the genetic content across isolates.This makes the use of reference genomes, as is typically done in human-based studies, less appealing, since the large genomic heterogeneity cannot be captured by using a single reference genome; this is especially true for the extreme case of accessory genes not encoded in the chosen reference [12].An approach that mitigates this reference bias is the use of pangenome graphs, which are often constructed using de Bruijn graphs (DBG) via k-mers [13], unitigs [14] or gene clusters [15] as the smallest construction units.Pangenomes divide a species' entire genomic information into two categories; the core genome, which contains sequences present within all or most (e.g.>95 %) strains of a species and an accessory genome that includes sequences occurring in a limited number of strains [16][17][18].Thus far, k-mers and unitigs have proven to deliver a comprehensive set of causal variants, while providing a means to move away from single-strain reference genomes.The drawback of such an approach is a lower interpretability of the resulting genetic variants, given that the local context of each variant is lost in the creation of the pangenome graph.Additionally, the creation of a 'global' (i.e.derived from the full pangenome) DBG means that k-mers sharing the same sequence across multiple gene clusters might lead to false-positive associations [13].Some computational tools have been developed to partially overcome limitations in interpretability, namely DBGWAS [14] and CALDERA [19].DBGWAS (de Bruijn graph genome-wide association study), for instance, constructs a compacted DBG (cDBG) and analyses the resulting unitigs for association with the phenotype.As the original cDBG is quite large, DBGWAS outputs sub-graphs that focus on identified associations and their surroundings.CALDERA offers a similar approach to DBGWAS.It produces a cDBG and selects subgraphs to be tested for their associations to the phenotype, which reduces the computational workload.Both methods rely on the construction of a DBG, which scales poorly with the number of samples and is prone to artefacts due to the presence of repetitive regions in the input sequences.
Even though available bacterial GWAS methods have enabled discovery of variants related to many relevant phenotypes, interpretation and visualization of results tends to be cumbersome, as post-GWAS processing of variants has often been implemented as an afterthought.Furthermore, the often used Manhattan plot only allows for visualization of variants that map to a specific reference, limiting interpretability of variants in the accessory genome of a species.Although different variants may be mapped to different reference genomes, displaying all variants at the same time remains impractical.While DBGs may give a good overview of the complexity of genomic diversity, they tend to still be difficult to interpret and scale poorly with increasing sample sizes.
To address these limitations we have developed a computational method (panfeed) to generate genetic variants from a collection of isolates to be used in a bacterial GWAS, with the main objective to circumvent the inclusion of confounding factors in the presence/absence pattern of variants, due to recurring k-mer sequences, that is k-mer sequences duplicated across gene clusters.Additionally, panfeed allows GWAS targeting promoter sequences to analyse promoters in the context of their downstream gene and an improved interpretability of the resulting associated variants.We leverage three key characteristics of bacterial genomes, namely the high density of coding regions and the empirical evidence that genes are often the main unit of molecular function and are transferred horizontally as a whole [20,21].Encoding genetic variants using k-mers in the context of each gene cluster of the pangenome would therefore allow one to account for both reference biases, since encoding variants using k-mers does not require the use of a reference, and for improved interpretability, since panfeed can retain the absolute and relative position of each k-mer in each input genome, thus allowing a fine-grained mapping of associated variants.panfeed uses as input a genes' presence/absence matrix (as the one given by Roary [17], panaroo [22] or ggCaller [15]) to create cluster specific k-mer sets and k-mer presence/absence patterns, which can be fed directly into bacterial GWAS software such as pyseer [23].The resulting association table can easily be correlated to the gene clusters, which enables fast interpretation and visualization of variants occurring in both core and accessory genomes.This is made possible by the gene-cluster-centric approach of panfeed, which reduces the biases due to repetitive regions introduced by a 'global' DBG.Global in this case meaning that sequences from the entire genome contributes to that one pattern, making only one presence/absence pattern possible for any one k-mer or unitig,

Impact Statement
In this study we present panfeed, a novel computational method designed to improve the interpretability of bacterial GWAS by generating genetic variants directly from gene clusters.This approach circumvents the limitations of current de Bruijn graphbased methods, namely the fact that some genetic variants are duplicated across multiple gene clusters and that the positional information of those variants has to be derived after the association, thus providing less ambiguous associations and better interpretability and visualization of associated variants.Our evaluations on two independent data sets demonstrate panfeed's comparable precision to other state-of-the-art methods and scalable performance, making it a valuable tool for investigating the genetic basis of bacterial phenotypes.
whereas cluster-centric k-mers only contribute to the pattern in their cluster of origin, meaning that the same k-mer sequence may have a different presence/absence pattern, and therefore a different association result depending on the gene cluster.To investigate the recurrence of k-mers in multiple clusters, we analysed k-mer diversity in 22 pangenome data sets across different species.Furthermore, we show that resulting associations, on a data set of 912 Escherichia coli strains, are comparable to those produced by using unitigs derived from a cDBG [14,24], at a similar speed and memory consumption.Lastly, we show how a two-pass approach allows for faster processing with limited memory consumption and low disc space usage, which suggests that the approach can be scaled up to very large sample sizes.

METHODS panfeed algorithm
panfeed iterates over the gene presence/absence matrix produced by software, such as Roary [17], panaroo [22] or ggCaller [15], specifically the ' gene_ presence_ absence.csv' file.For each gene cluster, the full nucleotide sequence of the gene is retrieved from the input GFF3 file of each of the input samples, optionally including sequences upstream and downstream of the start and stop codon, respectively.The canonical k-mers with a user-specified value of k (31 by default) are then generated, and a presence/ absence vector across samples for each k-mer within the gene cluster is created, as well as the gene presence/absence vector.The algorithm also tracks the position of each k-mer in a user-provided list of strains, in absolute genomic coordinates and relative to the start codon of the gene.The data is then written into three separate plain-text tab separated files (TSV).The first file is ' kmers.tsv' , containing the absolute and relative position of each generated k-mer for the samples indicated by the user.The second file is ' kmers_ to_ hashes.tsv' , which contains the pairing between all generated k-mers and the identifier of their presence/absence pattern across all samples; we note that k-mers can share the same presence/absence pattern and that is therefore more computationally efficient to test each unique pattern rather than all k-mers.The last file is ' hashes_ to_ patterns.tsv' , which contains a rectangular presence/absence binary matrix for each unique pattern; this file can directly be used by tools such as pyseer (option '--rtab') to run a GWAS analysis.The first two files can be then used to directly map a set of patterns such as those passing the significance threshold to their precise location in the input samples.A visual representation of panfeed's algorithm flow is shown in Fig. 1.
panfeed can be sped up by taking advantage of the presence of multiple CPU cores; if at least three cores are provided by the user, then the algorithm is divided in three components that run in parallel, using N cores.The first one is the reader process, which cycles through each gene cluster and pushes the resulting sequences and gene coordinates in a queue, which is consumed by the second component, with N-2 separate workers, which extract the k-mers from each gene cluster and their coordinates and pushes them in a second queue.The last component is the writer process, which writes the three output files described above.panfeed is written using the python programming languages and has the following dependencies, with the listed versions used in the benchmarks presented here: numpy v1.23.3 [25], pandas v1.5.0 [26], pyfaidx v0.7.1 [27], matplotlib v3.5.2 [28] and seaborn v0.11.2 [29].The code is available on github (https://github.com/microbial-pangenomes-lab/panfeed)and as a package on pypi (https://pypi.org/project/panfeed/) and on conda (https://anaconda.org/bioconda/panfeed).

k-mer diversity and duplication statistics
The extent of recurring (i.e.duplicated across gene clusters) k-mers in different gene clusters within data sets was determined using 21 pangenomes from 21 different bacterial species (Table S2, available in the online version of this article).The selection process for the 21 data sets included utilizing up to 400 genome sequences within selected species available in the RefSeq and GenBank NCBI databases.If less than 400 assemblies were available they were all downloaded, otherwise a random sample of 300 assemblies was collected.Annotated assemblies were downloaded using ncbi-genome-download.For the analysis, pangenomes of the data sets were produced using panaroo (with the arguments '--clean-mode strict' and '--remove-invalid-genes'), of which the gene presence/absence matrix could then be input into panfeed.panfeed was used to count cluster-specific k-mers with the '--no-filter' option allowing k-mers with the same presence/absence pattern as their cluster of origin to be recorded, which would otherwise be filtered out.

BSI data set
To evaluate the results of a GWAS using panfeed, both unitig-counter, a DBG-based method to produce joint core and accessory genome variation [14,23], and panfeed were used to generate k-mer-based genetic variants, which were fed to pyseer to find significant associations with the entry of the pathogen through the urinary tract for 912 E. coli strains causing bloodstream infections [30] (BSI study).The data set contains among other things, information on the portal of entry of the pathogen into the host, as well as other metadata for the patient the pathogen was extracted from.This metadata, which includes for instance other comorbidities of the patient, was used as covariates in the GWAS.Resulting k-mers/unitigs were filtered using a Bonferroni correction threshold (alpha/family wise error rate 0.05) on the number of uniquely tested patterns.The k-mers of panfeed included those having the same presence/absence pattern as their gene cluster of origin, as the '--no-filter' option was used.The overlap between gene cluster k-mers generated by panfeed and unitigs generated by unitig-counter was assessed with bedtools v2.30.0 [31] intersect, looking for overlaps of at least one base pair, disregarding the strandedness of sequences.The same command as in the original GWAS analysis by Denamur et al. [30] was used to find associations with pyseer with both k-mers and unitigs.

Neisseria data set
panfeed was also tested for its ability to interpret bacterial GWAS results on a data set of 1556 N. meningitidis strains [6].Unlike the original study, we ran an analysis pooling the discovery and the replication data sets.Similarly to previous runs of panfeed, the '--no-filter' option was used.Figs 2b and 3 were created using the convenience script provided with panfeed.Sequences for Fig. 1.Structure of panfeed's algorithm.Blue ovals represent sections of the algorithm that can be parallelized if the user provides at least three cores; the number of worker processes will be N-2, with N being the total number of provided cores.
gene cluster 'group_319' (fHbp) had to be shifted in a number of strains due to misannotation of start codon positions in the GFF files (Fig. S2).

Benchmarks
The benchmarks for fsm-lite v1.0, unitig-counter v1.1.0 14,24and panfeed were conducted using the BSI (blood stream infection) study data set (see above).Subsets of 250, 450, 650 and the full 912 strains were created by randomly picking strains three times, meaning that every repetition contained a different strain composition, to minimize random effects.panfeed was set to record the k-mer positions for 237 strains, one for each of the lineages in the data set as determined by poppunk v2.4.0 [32].All benchmarks were performed on the same system, a workstation running Ubuntu 20.04.6 LTS with 36 Intel i9-10980XE cores, 128 GB RAM and a solid state disc (SSD 970 EVO Plus 2TB).The benchmarks for panfeed and unitig-counter were repeated with one core and four cores, while fsm-lite only supports the use of one core.The code used to benchmark panfeed, unitig-counter and fsm-lite can be found on github (https://github.com/microbial-pangenomes-lab/panfeed_benchmark_scripts).
We also tested an approach to optimize panfeed's running time and memory consumption, which we term 'two-pass approach' .The approach is implemented by giving panfeed no target strains to log k-mer positions for.This results in only the ' hashes_ to_ patterns.tsv' and ' kmers_ to_ hashes.tsv' files being populated, the former of which may be used as input for pyseer.Using panfeed's convenience scripts, clusters containing significantly associated k-mers can afterwards be determined and used in the '--genes' option for a second pass of panfeed on the data set.This second run of panfeed subsequently logs the k-mer positions for the clusters given in the '--genes' option and the strains that were given as targets.A detailed scheme of the algorithm's flow including the steps followed by the two-pass approach is available in Fig. S1.

Gene-cluster-specific k-mers for fine-grained mapping of significant associations
panfeed uses the gene-cluster-centric pangenome creation of tools such as panaroo to reduce the complexity of the input strains' pangenome, which removes the reference bias while allowing an easier interpretation of association results.Compared to DBGbased approaches, panfeed avoids the creation of a global graph.Therefore, cluster-centric k-mers produced by panfeed may have multiple associated presence/absence patterns when they occur in multiple clusters.In contrast to this, global k-mers or unitigs may only occur once in a DBG and can therefore only have one presence/absence pattern, to which all occurrences of the k-mer or unitig contribute to, regardless of the gene it occurs in.Because of this, cluster-centric k-mers circumvent the introduction of artefacts in their presence/absence patterns due to repetitive sequences or truncated gene variants, if they are sufficiently differing in terms of length and nucleotide sequence, as they are filtered out in the pangenomic clustering step.Additionally, panfeed extracts k-mers and their respective presence/absence patterns within each gene cluster, as well as the absolute coordinates and relative position.The absolute coordinate is the position the k-mer in relation to the DNA molecule in which it is encoded, while the relative position gives the distance of the k-mer to the gene start, for the specific strain.The resulting unique presence/ absence patterns can be used for efficiently testing their association with a particular phenotype.Interpretability of associated patterns is aided by the fact that k-mers are already mapped to gene clusters, at base-resolution level.Some example visualizations of the result of GWAS analyses enabled by panfeed are shown in Figs 3 and 4.

k-mer duplication levels vary across bacterial pangenomes
To assess the extent of k-mers occurring in multiple clusters, we analysed 21 data sets, comprising 100+ strains each.We observed a relatively large variation in the proportion of duplicated k-mers across gene clusters, with a median value of 2.74 % and a 25 and 75 percentile value of 1.39 and 4.39 %, respectively (Fig. 2 and Table S2).The lowest value belonged to the pangenome of Streptococcus sanguinis (0.48 %) and the highest (11.67 %) to Streptococcus mutans.Depending on the bacterial pangenomes being considered one would then expect that roughly 3 % of k-mers will be duplicated across gene clusters.We then computed the proportion of gene clusters that have at least one duplicated k-mer, with a median value of 12.67 %, a 25 and 75 percentile value of 8.31 and 16.01 %, respectively.Mycoplasma pneumoniae had the lowest proportion of gene clusters with at least one duplicated k-mer (1.72 %), while Streptococcus mutans had the highest (38.15 %).Taken together this analysis indicates that a non-trivial proportion of k-mers is duplicated across a large proportion of gene clusters, meaning that if they were generated in a global DBG to then run statistical associations they could result in unrelated gene clusters being implicated for a bacterial phenotype.This in turn suggests that assigning a gene-cluster-specific presence/absence pattern to k-mers, as panfeed does, could ameliorate this confounding factor.

Associations with gene-cluster-centric k-mers produce similar results as using global unitigs
To assess whether panfeed unitig-counter come to similar association results, we tested the overlap in significantly associated variants between unitigs output by unitig-counter and panfeed's k-mers, using a previously published bacterial data set on E. coli bloodstream infections [30] (BSI, see Methods).The data set contains 912 E. coli strains, for which the mode of entry into the host is available as well as confounding factors, such as patient comorbidities.In total, we found 1 865 486 unique presence/absence patterns using panfeed and 2 132 621 using unitig-counter.Out of those we found a significant association with the target phenotype for 22 patterns, correlating to 498 unique panfeed-generated k-mers, which occur 158 692 times in the data set's genomes and 88 unitigs occurring 29 728 times across the strains, respectively.While the total numbers of significantly associated k-mers and unitigs differ, k-mers overlap with 79.9 % of the unitig positions and unitigs overlap with 89.6 % of k-mer positions (Table 1), indicating a large agreement between the two variant sets, meaning that they overwhelmingly map to the same regions in the input genomes.Overall we found these sequences to map to several gene clusters, 21 with global unitigs and three for panfeed's k-mers (Table S1).We found that the three gene clusters associated through panfeed were also found using global unitigs; the most prominent gene cluster being papGII (named 'group_5515' by panaroo's automatic gene annotation), which encodes for an adhesin that is present at the tip of type P pili [33].This adhesin has previously been linked to pyelonephritis and the urinary tract as portal of entry.The other two shared gene clusters were also genes encoding for fimbrial proteins such as papF (gene cluster 'group_7980') and papD2/papD3 (gene cluster 'papD_2~~~papD_3'), indicating how both approaches correctly identified the putative causal variant for entry through the urinary tract.
We next leveraged panfeed's ability to easily generate visualizations of the associations' result to compare its performance with that obtained using global unitigs (Fig. 3).Additionally, when using panfeed's convenience scripts to connect k-mers and their association P-value, the presence of the entirety of gene cluster 'group_5515' was flagged as significant, meaning that the k-mers derived from the papGII gene have associated with the phenotype with a low P-value.While this is not indicated in the visualization, it can be assessed from the table connecting k-mers, gene clusters and their respective association P-values.The association of papGII to the phenotype is in agreement with the hypothesis that the presence of the whole papGII cluster is causal for determining the entry through the urinary tract, rather than short variants in the gene [30,34].The same association analysis using global unitigs resulted in similar areas of the gene being highlighted as highly significant as well as other regions being highlighted as less significant, while disregarding the possible association of the gene's presence or absence with the phenotype of interest.In  total, 67 significantly associated unitigs were present in gene cluster 'group_5515' .We note that these significantly associated unitigs from cluster 'group_5515' also in the clusters 'papG' , 'papG_1~~~papG' , and 'papG~~~papG_2' , all annotated as papGII, albeit with different frequencies in the pangenome, which, in the case of cluster 'papG' , include truncated gene fragments.
As indicated previously, this results in a difference in presence/absence patterns between k-mers and global unitigs, as panfeed's k-mers are cluster-specific, meaning that the same k-mers in different clusters are only being considered as present in the cluster they occur in.This is in contrast to global unitigs, of which the presence or absence is being considered over the entire genome, allowing the same unitig occurring in different gene clusters to contribute to the overall presence/absence pattern, thus potentially confounding the analysis by implicating unrelated gene clusters in the association results.

Simpler pangenome fine mapping of associated variants
The second data set we used to test panfeed consists of 1556 N. meningitidis strains, for which phenotypic data on the induction of invasive meningococcal disease (IMD) in their hosts was measured [6].Earle et al. found a SNP in the 5′ untranslated region of fHbp (fHbp -7 T) that was significantly associated with IMD-positive strains in a discovery and replication data set consisting of a total of 1556 strains.The GWAS analysis of Earle et al. found k-mers surrounding the SNP to have significant P-values, suggesting a causative link to the specific phenotype, which was then validated experimentally to influence the expression of fHbp, which in turn promotes immune evasion.As panfeed includes an option to incorporate promoter regions of genes in the association study, we also found significant k-mers spanning the same SNP, although with P-values just below the bonferroni-correction threshold.We then used panfeed's convenience scripts to display the results of the association for this fHbp gene cluster and verify that it could correctly identify the experimentally verified SNP (Fig. 4).The visualization allowed us to easily identify the genetic variant encoded by the k-mers as an SNP, given that the gene cluster is present in all strains and that the local context has a high level of homology across the isolates, which is not as straightforward when using global unitigs as the input genetic variants.

panfeed has similar time and resource efficiency as DBG-based variant callers
We compared panfeed's speed and memory consumption to fsm-lite and unitig-counter, which are programmes commonly used to generate k-mers and unitigs for microbial GWAS.The benchmarks were performed using the 912 strain E. coli BSI data set [30].To compare the programmes' performances at differently sized data sets, we generated four subsets with 250, 450, 650 and 912 strains.
While both panfeed and unitig-counter have the option to be run with multiple cores, fsm-lite could only be run using one core.When running on one core, panfeed's memory consumption is comparable to that of unitig-counter at 3.8 GB for 912 strains, while the running time for larger data sets exceeds that of fsm-lite at 3 h 46 min.Employing four cores enables panfeed to drastically reduce the running time to 2 h 5 min, while the memory consumption increases to a temporary maximum of 16 GB as a trade-off (Fig. 5).
Users running panfeed on resource-constrained systems should therefore use a single core or no more than three.An alternative two-pass approach can be used, as described in Methods, to reduce both panfeed's running time and memory consumption further.In this case, the first run of panfeed no k-mer positional information to file and is therefore used to generate the presence/absence matrix used as input for the associations.This first step on four cores takes 1 h 18 min to finish and uses 14.7 GB of peak memory.The second run, in which k-mers are logged only for those gene clusters that have at least one significant association, is completed in ~1 min with a peak memory consumption of 2.1 GB using one core.This two-pass approach additionally decreases the disc space that is needed to store the k-mer positional data from 500+GB to 75.7 MB uncompressed, by only including k-mers from relevant gene clusters.The two-pass approach is therefore comparable in running time to the generation of unitigs, and likely able to scale up as data-set sizes increase.

DISCUSSION
We have developed panfeed to address common difficulties in bacterial GWAS, using the principle that in bacteria a gene is the fundamental unit for many relevant phenotypes.This removes the necessity to map k-mers or unitigs back onto a reference genome, circumventing the drawbacks of using reference genomes and most of all reducing the impact of recurring nucleotide sequences in multiple gene clusters.The latter problem affects a relatively large proportion of all k-mers (around 3 %), which in turn affects an even larger proportion of gene clusters (around 13 %), based on our sample of 21 diverse bacterial pangenomes.
When comparing GWAS results of workflows using panfeed's gene-centric k-mers versus DGB-derived unitigs, we were able to find most significant associations that were also found using global unitigs.While significantly associated unitigs were present in many more gene clusters, these often represent groups containing integrases or transposases that are usually present on mobile genetic elements (MGE) and facilitate horizontal gene transfer through the excision and integration of MGEs into the bacterial chromosome.Although these genes contribute to the spread of MGE, they do not necessarily contribute to the phenotype in question, but rather occur in linkage with genes in the MGE that are causing changes in the phenotype.The approach we developed does not suffer from biases that may be introduced from the construction of a DBG, but is obviously dependent on the approach used to annotate the input genomes and especially to cluster the resulting genes.An ideal data set would then have harmonized gene annotations across isolates.Recent advances in pangenome-aware gene calling and pangenome estimation might then be more appropriate; the recently described ggCaller [15] algorithm's output is in fact compatible with panfeed.We have shown however that a substantial amount of k-mers in various data sets are reoccurring in multiple gene clusters, which is addressed by the gene clustering of programmes such as panaroo.
Mapping, annotation and visualization of k-mer-based associations is a current limiting factor for wider adoption of bacterial GWAS.While encoding genetic variants as k-mers allows both core and accessory genome variation to be encoded, on the other hand, the information on the nature and position of the genetic variant underlying a k-mer is not directly available.With panfeed however we have introduced ways to ameliorate these problems; each k-mer is explicitly mapped to gene clusters at single-base resolution, and both the nucleotide sequence and association statistics can be easily visualised via a heatmap.We have also shown how our approach allows us to distinguish core and accessory genome variation by using the BSI [30] and Earle et al. [6] data sets.For the latter, in particular, we were able to find a localized association signal in the fHbp gene around the experimentally validated SNP fHbp -7 , which was easily identified when displaying the nucleotide sequences at the positions with lowest association P-values.Although the capability of panfeed to include non-coding regions in GWAS is limited to upstream and downstream regions of genes specified in the GFF files, the incorporation of promoter regions, in the context of their genes, in association studies, will likely help to illuminate a thus far underexplored field.We have therefore shown how panfeed can improve the current limitation on k-mer's association interpretability, which will be important as data sets increase in size and complexity with the wider adoption of genome sequencing for applications in microbiology.

Fig. 2 .
Fig. 2. Proportion of multicluster k-mers and clusters containing them.(a) The proportion of k-mers occurring in multiple gene clusters (multicluster k-mer) per number of unique k-mers for the respective pangenome.(b) The proportion of gene clusters containing at least one multicluster k-mer per total number of gene clusters for each pangenome.

Fig. 3 .
Fig. 3. Association results comparison at single nucleotide resolution.Significantly associated unitigs (A, C, D, and E) / k-mers (b) are coloured accordingto their -log10 P-value, ranging from the significance threshold up to 1E −10 .Unitigs and k-mers whose association P-value is below the significance threshold are displayed in light grey.The X-axis represents the position relative to the gene start codon for each isolate's genome, while each row represents a different isolate.The black vertical line displays the annotated gene start.The black horizontal line splits the graph between strains associated with urinary tract entry (top) and strains that are not (bottom).If a strain does not encode for the gene, the sequence will be displayed as dark grey.Panels (a) and (b) refer to gene cluster 'group_5515', panels (c), (d) and (e) display the significantly associated unitigs in clusters 'papG', 'papG_1~~~papG', and 'papG~~~papG_2', respectively.

Fig. 4 .
Fig. 4. Fine mapping of association results.(a) Significantly associated k-mers and their relative position with respect to the fHbp gene.Each row represents a different isolate.Some of the nucleotide sequences have been shifted due to a misannotated start codon.The black horizontal line separates IMD-associated strains (top) and carriage-associated strains (bottom).Dark grey areas are displayed, if a gene has a length below the longest gene within the specific cluster or the k-mers were filtered out due to having the same presence/absence pattern as the gene cluster.(b) Actual nucleotide sequences (encoded with different colours) across all isolates surrounding the fHbp -7 T SNP, which is marked with a red box.

Fig. 5 .
Fig. 5. Time and memory consumption benchmarks of panfeed compared to fsm-lite and unitig-counter.Panels (a) and (c) were run using one core, while the benchmarks in panels (b) and (d) used four cores.Solid line represents the average over three replicates, shaded region the 95 % confidence interval.

Table 1 .
Numbers of resulting unitigs and k-mers associated with entry through the urinary tract in the BSI data set and their overlap