Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines

Abstract Background Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.


Biotechnology and Biological Sciences
Research Council (GB) (BB/P013740/1) Not applicable Abstract: Background Accurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter , Escherichia and Klebsiella .

Results
We evaluated the performance of 209 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli , but less dominant for clonal species such as Mycobacterium tuberculosis .

Conclusions
The accuracy of SNP calling for a given species is compromised by increasing intraspecies diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highestperforming pipelines often employed the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup or Strelka. Accurately identifying single nucleotide polymorphism (SNPs) from bacterial DNA is 53 essential for monitoring outbreaks (as in [1,2]) and predicting phenotypes, such as 54 antimicrobial resistance [3], although the pipeline selected for this task strongly impacts the 55 outcome [4]. Current bacterial sequencing technologies generate short fragments of DNA 56 sequence ('reads') from which the bacterial genome can be reconstructed. Reference-based 57 mapping approaches use a known reference genome to guide this process, using a 58 combination of an aligner, which identifies the location in the genome each read is likely to 59 have arisen from, and a variant caller, which summarises the available information at each 60 site to identify variants including SNPs and indels (see reviews for an overview of alignment 61 [5,6] and SNP calling [7] algorithms). This evaluation focuses only on SNP calling; we did 62 not evaluate indel calling as this can require different algorithms (see review [8]). 63

Resources
The output from different aligner/caller combinations is often poorly concordant. For 64 example, up to 5% of SNPs are uniquely called by one of five different pipelines [9] with 65 even lower agreement upon structural variants [10]. 66 67 Although a mature field, systematic evaluations of variant calling pipelines are often limited 68 to eukaryotic data, usually human [11][12][13][14][15] but also C. elegans [16] and dairy cattle [17] (see 69 also review [18]). This is because truth sets of known variants, such as the Illumina Platinum 70 Genomes [19], are relatively few in number and human-centred, being expensive to create 71 and biased toward the methods that produced them [20]. As such, to date, bacterial SNP 72 calling evaluations are comparatively limited in scope (for example, comparing 4 aligners 73 with 1 caller, mpileup [21], using Listeria monocytogenes [22]). 74 75 Relatively few truth sets exist for bacteria and so the choice of pipeline for bacterial SNP 76 calling is often informed by performance on human data. Many evaluations conclude in 77 favour of the publicly-available BWA-mem [23] or commercial Novoalign 78 (www.novocraft.com) as choices of aligner, and GATK [24,25] or mpileup as variant callers, 79 with recommendations for a default choice of pipeline, independent of specific analytic 80 requirements, including Novoalign followed by GATK [26], and BWA-mem followed by 81 either mpileup [14], GATK [12],or VarDict [11]. 82 83 This study evaluates a range of SNP calling pipelines across multiple bacterial species, both 84 when reads are sequenced from and aligned to the same genome, and when reads are aligned 85 to a representative genome of that species. 86 87 SNP calling pipelines are typically constructed around a read aligner (which takes FASTQ as 88 input and produces BAM as output) and a variant caller (which takes BAM as input and 89 produces VCF as output), often with several pre-and post-processing steps (for instance, 90 cleaning a raw FASTQ prior to alignment, or filtering a BAM prior to variant calling). For 91 the purpose of this study, when evaluating the two core components of aligner and caller, we 92 use 'pipeline' to mean 'an aligner/caller combination, with all other steps in common. '  While simulation studies can offer useful insight, they can be sensitive to the specific details 135 of the simulations. Therefore, we also evaluated performance on real data to verify our 136 conclusions. We used 16 environmentally-sourced and genomically diverse Gram-negative 137 species of the genera Citrobacter, Enterobacter, Escherichia and Klebsiella, along with two 138 reference strains, from which closed hybrid de novo assemblies were previously generated 139 using both Illumina (short) and ONT (long; Oxford Nanopore Technologies) reads [54]. For 140 this aspect of the study, we quintupled the scope of the evaluation from the initial set of 41 141 pipelines and also present results for a larger set of 209 pipelines. 142 143 All pipelines aim to call variants with high specificity (i.e. a high proportion of non-variant 144 sites in the truth set are correctly identified as the reference allele by the pipeline) and high 145 sensitivity (i.e. a high proportion of true SNPs are found by the pipeline). The optimal trade-146 off between these two properties may vary depending on the application. For example, in 147 transmission inference, minimising false positive SNP calls (i.e. high specificity), is likely to 148 be most important, whereas high sensitivity may be more important when identifying variants 149 associated with antibiotic resistance. We therefore report detailed performance metrics for all 150 pipelines, including recall (sensitivity), precision (positive predictive value, the proportion of 151 SNPs identified that are true SNPs), and the F-score, the harmonic mean of precision and 152 * 254 genomes * 41 pipelines). The number of reads simulated from each species and the 167 performance statistics for each pipelinethe number of true positives (TP), false positives 168 (FP) and false negatives (FN), precision, recall, F-score, and total number of errors (i.e. FP + 169 FN) per million sequenced basesare given in Supplementary Table 3, with the distribution  170 of F-scores illustrated in Figure 2A. 171   172 Median F-scores were over 0.99 for all but four aligner/callers with small interquartile ranges 173 (approx. 0.005), although outliers were nevertheless notable (Figure 2A), suggesting that 174 reference genome can affect performance of a given pipeline. 175 176 Table 1 shows the top ranked pipelines averaged across all species' genomes, based on 7 177 different performance measures and on the sum of their ranks (which constitutes an 'overall 178 performance' measure, lower values indicating higher overall performance). Supplementary 179 Table 4 shows the sum of ranks for each pipeline per species, with several variant callers 180 consistently found among the highest-performing (Freebayes and GATK) and lowest-181 performing pipelines (16GT and SNVSniffer), irrespective of aligner. 182

183
If considering performance across all species, Novoalign/GATK had the highest median F-184 score (0.994), lowest sum of ranks (10), the lowest number of errors per million sequenced 185 bases (0.944), and the largest absolute number of true positive calls (15,778) (Table 1). 186 However, in this initial simulation, as the reads are error-free and the reference genome is the 187 same as the source of the reads, many pipelines avoid false positive calls and report a perfect 188 precision of 1. 189 190 Evaluating SNP calling pipelines when the genome for alignment diverges from the source 191 of the reads 192 Due to the high genomic diversity of some bacterial species, the appropriate selection of 193 reference genomes is non-trivial. To assess how pipeline performance is affected by 194 divergence between the source and reference genomes, SNPs were re-called after mapping all 195 reads to a single representative genome for that species (illustrated in Figure 1 Supplementary Table 6, with an associated ranked summary in Supplementary  204   Table 7. 205 In general, aligning reads from one strain to a divergent reference leads to a decrease in 206 median F-score and increase in interquartile range of the F-score distribution, with pipeline 207 performance more negatively affected by choice of aligner than caller ( Figure 2B). 208

209
Although across the full range of genomes, many pipelines show comparable performance 210 ( Figure 2B), there was a strong negative correlation between the Mash distance and F-score 211 (Spearman's rho = -0.72, p < 10 -15 ; Figure 3). The negative correlation between F-score and 212 the total number of SNPs between the strain and representative genome, i.e. the set of strain-213 specific in silico SNPs plus inter-strain SNPs, was slightly weaker (rho = -0.58, p < 10 -15 ; 214 Supplementary Figure 1). This overall reduction in performance with increased divergence 215 was more strongly driven by reductions in recall (i.e., by an increased number of false 216 negative calls) rather than precision as there was a particularly strong correlation between 217 distance and recall (Spearman's rho = -0.94, p < 10 -15 ; Supplementary Figure 2). 218 219 Three commonly used pipelines -BWA-mem/Freebayes, BWA-mem/GATK and 220 Novoalign/GATKwere among the highest performers when the reference genome is also 221 the source of the reads (Table 1 and Supplementary Table 4). However, when the reference 222 diverges from the reads, then considering the two 'overall performance' measures across the 223 set of 10 species, Snippy instead has both the lowest sum of ranks (20) and the highest 224 median F-score (0.982), along with the lowest number of errors per million sequenced bases 225 (2.6) ( Table 1). 226 227 Performance per species is shown in Table 2, alongside both the overall sum and range of 228 these ranks per pipeline. Pipelines featuring Novoalign were, in general, consistently high-229 performing across the majority of species (that is, having a lower sum of ranks), although 230 were outperformed by Snippy, which had both strong and uniform performance across all 231 species (Table 2). By contrast, pipelines with a larger range of ranks had more inconsistent 232 performance, such as minimap2/SNVer, which for example performed relatively strongly for 233 N. gonorrhoeae but poorly for S. dysenteriae (Table 2). 234 235 While, in general, the accuracy of SNP calling declined with increasing genetic distances, 236 some pipelines were more stable than others. If considering the median difference in F-score 237 between SNP calls made using the same versus a representative genome, Snippy had smaller 238 differences as the distance between genomes increased ( Figure 4). 239

240
The highest ranked pipelines in Table 2 had small, but practically unimportant, differences in 241 median F-score and so are arguably equivalently strong candidates for a 'general purpose' 242 SNP calling solution. For instance, on the basis of F-score alone the performance of 243 Novoalign/mpileup is negligibly different from BWA-mem/mpileup ( Figure 5). However, 244 when directly comparing pipelines, similarity of F-score distributions (see Figure 2B) can 245 conceal larger differences in either precision or recall, categorised using the effect size 246 estimator Cliff's delta [59,60]. Thus, certain pipelines may be preferred if the aim is to 247 minimise false positive (e.g. for transmission analysis) or maximise true positive (e.g. to 248 identify antimicrobial resistance loci) calls. For instance, although Snippy (the top ranked 249 pipeline in Table 2) is negligibly different from Novoalign/mpileup (the third ranked 250 pipeline) in terms of F-score and precision, the former is more sensitive ( Figure 5) (Figure 3A), 268 there was a strong negative correlation between Mash distance and the median F-score across 269 all pipelines (Spearman's rho = -0.83, p = 3.36x10 -5 ; Figure 6A), after excluding one 270 prominent outlier (E. coli isolate RHB11-C04; see Supplementary Table 8).  271   272 Notably, the median precision of each pipeline, if calculated across the divergent set of 273 simulated genomes, strongly correlated with the median precision calculated across the set of 274 real genomes (Spearman's rho = 0.83, p = 2.81x10 -11 ; Figure 6B). While a weaker correlation 275 was seen between simulated and real datasets on the basis of recall (Spearman's rho = 0.41, p 276 = 0.007), this is consistent with the high diversity of Enterobacteriaceae, and the accordingly 277 greater number of false negative calls with increased divergence (Supplementary Figure 2). 278

279
Overall, this suggests that the accuracy of a given pipeline on simulated data is a reasonable 280 proxy for its performance on real data. While the poorer performing pipelines when using 281 simulated data are similarly poorer performing when using real data, the top ranked pipelines 282 differ, predominantly featuring BWA-mem, rather than Novoalign, as an aligner 283 (Supplementary Table 10). In both cases, however, among the consistently highest 284 performing pipelines is Snippy.

Reference genome selection strongly affects SNP calling performance 296
Here we initially evaluated 41 SNP calling pipelines, the combination of 4 aligners with 10 297 callers, plus one 'all-in-one' tool, Snippy, using reads simulated from 10 clinically relevant 298 species. These reads were first aligned back to their source genome and SNPs called. As 299 expected under these conditions, the majority of SNP calling pipelines showed high precision 300 and sensitivity, although between-species variation was prominent. 301 302 We next expanded the scope of the evaluation to 209 pipelines (representing the addition of 303 12 aligners, 4 callers, and 2 'all-in-one' pipelines, SpeedSeq and SPANDx) and introduced a 304 degree of divergence between the reference genome and the reads, analogous to having an 305 accurate species-level classification of the reads but no specific knowledge of the strain. For 306 the purposes of this study, we assumed that reference genome selection was essentially 307 arbitrary, equivalent to a community standard representative genome. Such a genome can 308 differ significantly from the sequenced strain, which complicates SNP calling by introducing 309 inter-specific variation between the sequenced reads and the reference. Importantly, all 310 pipelines in this study are expected to perform well if evaluated with human data, i.e. when 311 there is a negligible Mash distance between the reads and the reference. For example, the 312 Therefore, one major finding of this study is that, irrespective of the core components within 342 a SNP calling pipeline, the selection of reference genome has a critical effect on output, 343 particularly for more recombinogenic species. This can to some extent be mitigated by using 344 variant callers that are more robust to increased distances between the reads and the 345 reference, such as Freebayes (employed by Snippy and SpeedSeq). 346 347 A sub-optimal choice of reference genome has previously been shown to result in mapping 348 errors, leading to biases in allelic proportions [75]. Heterologous reference genomes are in 349 general sub-optimal for read mapping, even when there is strict correspondence between 350 orthologous regions, with short reads particularly vulnerable to false positive alignments [76]. 351 There is also an inverse relationship between true positive SNP calls and genetic distance, 352 with a greater number of false positives when the reads diverge from the reference genome 353 [22]. 354

Study limitations 356
The experimental design made several simplifying assumptions regarding pipeline usage. 357 Most notably, when evaluating SNP calling when the reference genome diverges from the 358 source of the reads, we needed to convert the coordinates of one genome to those of another, 359 doing so by whole genome alignment. We took a similar approach to that used to evaluate 360 Pilon, an all-in-one tool for correcting draft assemblies and variant calling [44], which made 361 whole genome alignments of the M. tuberculosis F11 and H37Rv genomes and used the 362 resulting set of inter-strain variants as a truth set for benchmarking (a method we also used 363 when evaluating each pipeline on real data). While this approach assumes a high degree of 364 contiguity for the whole genome alignment, there are nevertheless significant breaks in 365 synteny between F11 and H37Rv, with two regions deemed particularly hypervariable, in 366 which no variant could be confidently called [44]. For the strain-to-representative genome 367 alignments in this study, we considered SNP calls only within one-to-one alignment blocks 368 and cannot exclude the possibility that repetitive or highly mutable regions within these 369 blocks have been misaligned. However, we did not seek to identify and exclude SNPs from 370 these regions as, even if present, this would have a systematic negative effect on the 371 performance of each pipeline. To demonstrate this, we re-calculated each performance metric 372 for the 209 pipelines evaluated using real sequencing data after identifying, and masking, 373 repetitive regions of the reference genome with self-self BLASTn (as in [77]). As we already 374 required reference bases within each one-to-one alignment block to be supported by both 375 nucmer and Parsnp calls (that is, implicitly masking ambiguous bases), we found that repeat-376 masking the reference genome had negligible effect on overall F-score although marginally 377 improved precision (see Supplementary Text 1). However, it is important to note that the 378 parameters used for repeat-masking will determine which paralogues will be successfully 379 masked. For the purpose of this study, we used reasonably conservative parameters (detailed 380 in Supplementary Text 1) and so expect to have primarily masked more similar paralogues. 381 The likelihood of mis-mapping (and thereby false positive SNP calling) would increase 382 among more divergent paralogues, although optimising parameters to detect these is non-383 trivial. More lenient repeat-masking parameters, in masking more divergent positions, would 384 also reduce the number of true SNPs it is possible to call. 385 386 Furthermore, when aligning reads from one genome to a different genome, it is not possible 387 to recover all possible SNPs introduced with respect to the former, as some will be found 388 only within genes unique to the original genome (of which there can be many, as bacterial 389 species have considerable genomic diversity; see Supplementary Table 5). Nevertheless, 390 there is a strong relationship between the total number of SNPs introduced in silico into one 391 genome and the maximum number of SNPs it is possible to call should reads instead be 392 aligned to a divergent genome (Supplementary Figure 3). In any case, this does not affect the 393 evaluation metrics used for pipeline evaluation, such as F-score, as these are based on 394 proportional relationships of true positive, false positive and false negative calls at variant 395 sites. However, we did not count true negative calls (and thereby assess pipeline specificity) 396 as these can only be made at reference sites, a far greater number of which do not exist when 397 aligning between divergent genomes. 398 399 While the programs chosen for this study are in common use and the findings generalisable, it 400 is also important to note that they are a subset of the tools available (see Supplementary Text 401 1). It is also increasingly common to construct more complex pipelines that call SNPs with 402 one tool and structural variants with another (for example, in [78] We have also assumed that given the small genome sizes of bacteria, a consistently high 421 depth of coverage is expected in non-simulated datasets, and so have not evaluated pipeline 422 performance on this basis (discussed further in Supplementary Text 1). In any case, a 423 previous study found that with simulated NextSeq reads, variant calling sensitivity was 424 largely unaffected by increases in coverage [55]. It has also been reported that random 425 polymerase errors have minimal effect on variant calls for sequencing depths greater than 20-426 fold, and that these are primarily of concern only when calling minor variants [75]. 427 428 Finally, so as to approximate 'out of the box' use conditions, we made a minimal effort 429 application of each program with no attempt at species-specific optimisation. Had we 430 optimised the individual components of an analytic pipeline (which, although often structured 431 around, are not limited to one aligner and one caller), we could conceivably reduce the high 432 variance in F-score when SNP calling from real data which, in this study, was notably 433 divergent (see

Recommendations for bacterial SNP calling 460
Our results emphasise that one of the principal difficulties of alignment-based bacterial SNP 461 calling is not pipeline selection per se but optimal reference genome selection (or, 462 alternatively, its de novo creation, not discussed further  Table  484 2), correlating strongly with the total number of SNPs between the strain genome and the 485 representative genome (Spearman's rho = 0.97, p < 10 -15 ), and to a reasonable degree with 486 the proportion of bases unique to the strain genome (Spearman's rho = 0.48, p < 10 -15 ). More 487 closely related genomes would have lower Mash distances and so be more suitable as 488 reference genomes for SNP calling. This would be particularly appropriate if, for example, 489 studying transmission events as a closely-related reference would increase specificity, 490 irrespective of the aligner or caller used. For larger studies that require multiple samples to be 491 processed using a common reference, the choice of reference genome could be one which 492 'triangulates' between the set of samplesthat is, has on average a similar distance to each 493 sample, rather than being closer to some and more distant from others. 494 495 Using a highly divergent genome (such as the representative Enterobacter genomes in the 496 real dataset, each of which differs from the reads by a Mash distance > 0.1; Supplementary 497 If considering the overall performance of a pipeline as the sum of the 7 different ranks for the 535 different metrics considered, then averaged across the full set of species' genomes, the 536 highest performing pipelines are, with simulated data, Snippy and those utilising Novoalign 537 in conjunction with LoFreq or mpileup (Table 2), and with real (more divergent) data, those 538 utilising NextGenMap or SMALT in conjunction with LoFreq, mpileup or Strelka 539 (Supplementary Table 10). 540 541 Some of the higher-performing tools apply error-correction models that also appear suited to 542 bacterial datasets with high SNP density, despite their original primary use case being in 543 different circumstances. For instance, SNVer (which in conjunction with BWA-mem, ranks 544 second to Snippy for N. gonorrhoeae; see Table 2) implements a statistical model for calling 545 SNPs from pooled DNA samples, where variant allele frequencies are not expected to be 546 either 0, 0.5 or 1 [46]. SNP calling from heterogeneous bacterial populations with high 547 mutation rates, in which only a proportion of cells may contain a given mutation, is also 548 conceptually similar to somatic variant calling in human tumours, where considerable noise is 549 expected [75]. This is a recommended use case for Strelka, which performed highly on real 550 (and particularly divergent) data, being among the top-performing pipelines when paired with 551 many aligners (Figure 7). 552 553 Irrespective of pipeline employed, increasing Mash distances between the reads and the 554 reference increases the number of false negative calls (Supplementary Figure 2). 555 Nevertheless, Snippy, which employs Freebayes, is particularly robust to this, being among 556 the most sensitive pipelines when evaluated using simulated data ( Figure 5 and 557 Supplementary Figure 4). Notably, Freebayes is haplotype-based, calling variants based on 558 the literal sequence of reads aligned to a particular location, so avoiding the problem of one 559 read having multiple possible alignments (increasingly likely with increasing genomic 560 diversity) but only being assigned to one of them. However, as distance increases further, it is 561 likely that reads will cease being misaligned (which would otherwise increase the number of 562 false positive calls) but rather they will not be aligned at all, being too dissimilar to the 563 reference genome. 564

565
With an appropriate selection of reference genome, many of these higher-performing 566 pipelines could be optimised to converge on similar results by tuning parameters and post-567 processing VCFs with specific filtering criteria, another routine task for which there are many 568 different choices of application [101][102][103][104]. In this respect, the results of this study should be 569 interpreted as a range-finding exercise, drawing attention to those SNP calling pipelines 570 which, under default conditions, are generally higher-performing and which may be most 571 straightforwardly optimised to meet user requirements. 572

573
Conclusions 574  575 We have performed a comparison of SNP calling pipelines across both simulated and real 576 data in multiple bacterial species, allowing us to benchmark their performance for this 577 specific use. We find that all pipelines show extensive species-specific variation in 578 performance, which has not been apparent from the majority of existing, human-centred, 579 benchmarking studies. While aligning to a single representative genome is common practice 580 in eukaryotic SNP calling, in bacteria the sequence of this genome may diverge considerably 581 from the sequence of the reads. A critical factor affecting the accuracy of SNP calling is thus 582 the selection of a reference genome for alignment. This is complicated by ambiguity as to the 583 strain of origin for a given set of reads, which is perhaps inevitable for many recombinogenic 584 species, a consequence of the absence (or impossibility) of a universal species concept for 585 bacteria (but see [105] The coordinates of each SNP inserted into a given genome are, by definition, genome-(that 625 is, strain-) specific. As such, it is straightforward to evaluate pipeline performance when 626 reads from one genome are aligned to the same reference. However, in order to evaluate 627 pipeline performance when reads from one genome are aligned to the genome of a divergent 628 strain (that is, the representative genome of that species), the coordinates of each strain's 629 genome need to be converted to representative genome coordinates. To do so, we made 630 whole genome (core) alignments of the representative genome to both versions of the strain 631 genome (one with and one without SNPs introduced in silico) using nucmer and dnadiff, 632 components of MUMmer v4.0.0beta2 [56], with default parameters (illustrated in Figure 1). 633 For one-to-one alignment blocks, differences between each pair of genomes were identified 634 using MUMmer show-snps with parameters -Clr -x 1, with the tabular output of this program 635 converted to VCF by the script MUMmerSNPs2VCF.py 636 (https://github.com/liangjiaoxue/PythonNGSTools, accessed 16 th August 2018). The two 637 resulting VCFs contain the location of all SNPs relative to the representative genome (i.e. 638 inclusive of those introduced in silico), and all inter-strain variants, respectively. We 639 excluded from further analysis two strains with poor-quality strain-to-representative whole 640 genome alignments, both calling < 10% of the strain-specific in silico SNPs (Supplementary  641   Table 11). The proportion of in silico SNPs recovered by whole genome alignment is detailed 642 in Supplementary Table 11 and is, in general, high: of the 254 whole genome alignments of 643 non-representative to representative strains across the 10 species, 222 detect > 80% of the in 644 silico SNPs and 83 detect > 90%. For the purposes of evaluating SNP calling pipelines when 645 the reference genome differs from the reads, we are concerned only with calling the truth set 646 of in silico SNPs and so discard inter-strain variants (see below). More formally, when using 647 each pipeline to align reads to a divergent genome, we are assessing the concordance of its 648 set of SNP calls with the set of nucmer calls. However, it is possible that for a given call, one 649 or more of the pipelines are correct and nucmer is incorrect. To reduce this possibility, a 650 parallel set of whole genome alignments were made using Parsnp v1. between the adapter sequences), which in real data is often variable, being sensitive to the 670 concentration of DNA used [109]. For read length x, we assumed an insert size of 2.2x, i.e. 671 for 300bp reads, the insert size is 660bp (Illumina paired-end reads typically have an insert 672 longer than the combined length of both reads [110]). The number of reads simulated from 673 each genome is detailed in Supplementary Table 3 and is equivalent to a mean 50-fold base-674 level coverage, i.e. (50 x genome length)/read length. 675 676 Perfect (error-free) reads were simulated from each SNP-containing genome using wgsim 677 parameters -e 0 -r 0 -R 0 -X 0 -A 0 (respectively, the sequencing error rate, mutation rate, 678 fraction of indels, probability an indel is extended, and the fraction of ambiguous bases 679 allowed). 680 681 Each set of reads was then aligned both to the genome of the same strain and to the 682 representative genome of that species (from which the strain will diverge), with SNPs called 683 using 41 different SNP calling pipelines (10 callers each paired with 4 aligners, plus the self-684 contained Snippy). The programs used, including version numbers and sources, are detailed 685 in Supplementary Table 1, with associated command lines in Supplementary Text 1. All 686 pipelines were run using a high-performance cluster employing the Open Grid Scheduler 687 batch system on Scientific Linux 7. No formal assessment was made of pipeline run time or 688 memory usage. This was because given the number of simulations it was not tractable to 689 benchmark run time using, for instance, a single core. The majority of programs in this study 690 permit multithreading (all except the callers 16GT, GATK, Platypus, SNVer, and 691 SNVSniffer) and so are in principle capable of running very rapidly. We did not seek to 692 optimise each tool for any given species and so made only a minimum effort application of 693 each pipeline, using default parameters and minimal VCF filtering (see below). This is so that  For each pipeline, we calculated the absolute number of true positive (TP; the variant is in the 806 simulated genome and correctly called by the pipeline), false positive (FP; the pipeline calls a 807 variant which is not in the simulated genome) and false negative SNP calls (FN; the variant is 808 in the simulated genome but the pipeline does not call it). We did not calculate true negative 809 calls for two reasons. Firstly, to do so requires a VCF containing calls for all sites, a function 810 offered by some variant callers (such as mpileup) but not all. Secondly, when aligning reads 811 to a divergent genome, a disproportionately large number of reference sites will be excluded, 812 particularly in more diverse species (for example, gene numbers in N. gonorrhoeae differ by 813 up to a third; see Supplementary Table 5). 814

815
We then calculated the precision (positive predictive value) of each pipeline as TP/(TP+FP), 816 recall (sensitivity) as TP/(TP+FN), miss rate as FN/(TP+FN), and total number of errors 817 (FP+FN) per million sequenced bases. We did not calculate specificity as this depends on 818 true negative calls. We also calculated the F-score (as in [55]), which considers precision and 819 recall with equal weight: F = 2 * ((precision * recall) / (precision + recall)). The F-score 820 evaluates each pipeline as a single value bounded between 0 and 1 (perfect precision and 821 recall). We also ranked each pipeline based on each metric so thatfor examplethe 822 pipeline with the highest F-score, and the pipeline with the lowest number of false positives, 823 would be rank 1 in their respective distributions. As an additional 'overall performance' 824 measure, we calculated the sum of ranks for the 7 core evaluation metrics (the absolute 825 numbers of TP, FP and FN calls, and the proportion-based precision, recall, F-score, and total 826 error rate per million sequenced bases). Pipelines with a lower sum of ranks would, in 827 general, have higher overall performance. 828

829
We note that when SNPs are called after aligning reads from one strain to that of a divergent 830 strain, the SNP calling pipeline will call positions for both the truth set of strain-specific in 831 silico SNPs and any inter-strain variants. To allow a comparable evaluation of pipelines in 832 this circumstance, inter-strain calls (obtained using nucmer and Parsnp; see above) are 833 discarded and not explicitly considered either true positive, false positive or false negative. 834 While the set of true SNPs when aligning to a divergent strain will be smaller than that when 835 aligned to the same strain (because all SNPs are simulated in genic regions but not all genes 836 are shared between strains), this will not affect proportion-based evaluation metrics, such as 837  Table 1) were evaluated using two different genomes for read alignment: the 866 original genome from which the reads were simulated and a divergent genome, the species-867 representative NCBI 'reference genome'. In the latter case, it will not be possible to recover 868 all of the original in silico SNPs as some will be found only within genes unique to the 869 original genome. Accordingly, to evaluate SNP calls, the coordinates of the original genome 870 need to be converted to those of the representative genome. To do so, whole genome 871 alignments were made using both nucmer and Parsnp, with consensus calls identified within 872 one-to-one alignment blocks. Inter-strain SNPs (those not introduced in silico) are excluded. 873 The remaining subset of in silico calls comprise the truth set for evaluation. There is a strong The median F-score across the complete set of 41 pipelines, per strain, decreases as the 893 distance between the strain and the reference genome increases (assayed as the Mash 894 distance, which is based on the proportion of k-mers shared between genomes). Each point 895 indicates the median F-score, across all pipelines, for the genome of one strain per species (n 896 = 254 strains). Points are coloured by the species of each strain (n = 10 species). Summary 897 statistics for each pipeline are shown in Supplementary Quantitatively similar results are seen if assaying distance as the total number of SNPs 900 between the strain and representative genome, i.e. the set of strain-specific in silico SNPs 901 plus inter-strain SNPs (Supplementary Figure 1). 902 903 Figure 4. Stability of pipeline performance, in terms of F-score, with increasing genetic 904 distance between the reads and the reference genome. 905 The performance of a SNP calling pipeline decreases with increasing distance between the 906 genome from which reads are sequenced and the reference genome to which they are aligned. 907 Each point shows the median difference in F-score for a pipeline that calls SNPs when the 908 reference genome is the same as the source of the reads, and when it is instead a 909 representative genome for that species. Points are coloured according to the variant caller in 910 each pipeline, with those towards the top of the figure less affected by distance. Lines fitted 911 using LOESS smoothing. 912 Panel A shows that pipelines evaluated using real sequencing data show reduced performance 927 with increasing Mash distances between the reads and the reference genome, similar to that 928 observed with simulated data (see Figure 3A). Each point indicates the median F-score, 929 across all pipelines, for the genome of an environmentally-sourced/reference isolate (detailed 930 in Supplementary Table 8). Panel B shows that pipelines evaluated using real and simulated 931 sequencing data have comparable accuracy. Each point shows the median precision of each 932 of 41 pipelines, calculated across both a divergent set of 254 simulated genomes (2-36 strains 933 from ten clinically common species) and 18 real genomes (isolates of Citrobacter, 934 Enterobacter, Escherichia and Klebsiella). The outlier pipeline, with lowest precision on both 935 real and simulated data, is Stampy/Freebayes. Raw data for this figure are available in 936 Supplementary Tables 6 (simulated genomes) and 9 (real genomes).  F-score Table 1 Click here to access/download; Table;Table 1.xlsx  22  10  24  20  25  11  32  15  20  19  32  26  21  31  22  25  28  32  14  25  30  27  34  21  20  22  25  29  33  30  29  30  26  24  26  31  28  28  27  26  14  23  35  27  31  30  23  34  25  34  28  31  17  29  37  20  32  32  24  35  27  35  34  28  27  37  30  32  33  34  37  24  33  8  41  39  38  33  31  38  36  33  29  28  40  37  38  35  39  39  34  39  29  38  35  36  39  33  39  36  31  38  41  36  40  37  40  40  36  40  35  40  41  41  38  41  37  41 ameters, shows improved performance.