Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP

Bacteria are highly diverse, even within a species; thus, there have been many studies which classify a single species into multiple types and analyze the genetic differences between them. Recently, the use of whole-genome sequencing (WGS) has been popular for these analyses, and the identification of single-nucleotide polymorphisms (SNPs) between isolates is the most basic analysis performed following WGS. The performance of SNP-calling methods therefore has a significant effect on the accuracy of downstream analyses, such as phylogenetic tree inference. In particular, when closely related isolates are analyzed, e.g. in outbreak investigations, some SNP callers tend to detect a high number of false-positive SNPs compared with the limited number of true SNPs among isolates. However, the performances of various SNP callers in such a situation have not been validated sufficiently. Here, we show the results of realistic benchmarks of commonly used SNP callers, revealing that some of them exhibit markedly low accuracy when target isolates are closely related. As an alternative, we developed a novel pipeline BactSNP, which utilizes both assembly and mapping information and is capable of highly accurate and sensitive SNP calling in a single step. BactSNP is also able to call SNPs among isolates when the reference genome is a draft one or even when the user does not input the reference genome. BactSNP is available at https://github.com/IEkAdN/BactSNP.

1. mean_branch_len=$(echo "scale=7; 10 / <root_genome_size> + 0.0000001" | bc -l) 2. evolveagene -f [root_genome.txt] -t ran -n 10 -b $mean_branch_len -i 0.1 -d 0.025 In order to simulate the situation where target isolates are closely related, EvolveAGene was executed so that about 10 substitutions were generated per branch. Since the target number of substitutions was exceedingly small compared to the genome size and the number of simulated substitutions depended on random numbers (the random seed could not be specified), the resulting average number of substitutions, s, often deviated from 10.
Therefore, we repeated EvolveAGene until the following inequality was satisfied.
9.5 ≤ s ≤ 10.5 Indels were also simulated by the option -i and -d with the default values, which represent the relative frequency of insertion and deletion against substitutions, respectively. EvolveAGene only accepts a root sequence whose length is a multiple of 3 considering the codon position, and thus, the last one or two bases were deleted before input when the root genome size was not a multiple of 3.
Though SNP callers detect SNPs on the position of reference genome, the true SNP positions are simulated on the root genome by EvolveAGene. In order to check whether the detected SNP positions in the reference sequence correspond to the correct SNP positions simulated in the root sequence, the correct SNP positions needed to be moved to the region where the root and reference sequences were aligned in a one-toone manner. Therefore, we moved the correct SNP positions to random positions in regions where nucmer generated one-to-one alignments ≥ 1 kbp in length between the reference-root sequences and edited the simulated genomes so that they had SNPs on the new positions by using our original program, move_snp (available at https://github.com/IEkAdN/BactSNP/tree/master/benchmark). Executed commands were as follows: Then, illumina paired-end reads for every simulated genome were simulated by art_illumina [4] using the following commands:
At last, SNP callers were executed using the simulated reads and the reference genome. Then, it was checked whether the detected SNPs matched with the simulated correct SNPs.

Common-region PPV
In order to compare PPVs among tools which mask the region where SNP calling is difficult in different ways, we introduced 'Common-region PPV', i.e., PPV calculated based on the SNPs detected in the region where all tools except Cortex [5] and CFSAN SNP Pipeline (CFSAN) [6] determined alleles for all isolates without masking. Because Cortex and CFSAN were not able to output information for non-variant regions, they were not considered when calculating Common-region PPV. The average ratio of the common region to the reference genome size is shown in Table S5.

Execution of SNP callers
Before executing each tool, low-quality regions were trimmed from reads by Platanus_trim (version 1.0.7) [7], and mapping was performed by BWA-MEM (version 0.7.15) [8] for tools that require mapping results as input. Duplicate reads generated by PCR duplication were removed by Picard MarkDuplicates (version 2.18.17) [9]. The commands were as follows:  [23] was executed as described in its own paper [9]. We used BWA-MEM (version 0.7.15) and GATK (version 3.8-0-ge9d806836) as the internal mapper and variant caller and set 3 and 0.9 as the values for 'minimum coverage threshold' and 'minimum acceptable proportion' (i.e., minimum allele frequency), respectively. The exact config file was generated by our original script, NASP_get_config.sh (available at https://github.com/IEkAdN/BactSNP/tree/master/benchmark). We used GATK version 3.8, because GATK version 4.0.11.0 failed in NASP. SNPs among isolates were extracted from the output of nasp using our original program, get_snp_nasp (available at https://github.com/IEkAdN/BactSNP/tree/master/benchmark). The commands were as follows: (i) PHEnix PHEnix (SHA-1 hash for the revision was 68d35c2) [24] was executed with the parameters recommended in its document. We used BWA-MEM (version 0.7.15) and GATK (version 3.8-0-ge9d806836) as the internal mapper and variant caller and set 10 and 0.9 as the values for 'min_depth' (i.e., minimum coverage depth) and 'ad_ratio' (i.e., minimum allele frequency), respectively. We used GATK version 3.8, because GATK version 4.0.11.0 failed in PHEnix. SNPs among isolates were extracted from the output of vcf2fasta using our original program, get_snp_phenix (available at https://github.com/IEkAdN/BactSNP/tree/master/benchmark). The commands were as follows: --sample-name <isolate_name> --filters "mq_score:30,min_depth:10,ad_ratio:0.9" based SNP callers sometimes call them as SNPs, even though they do not represent polymorphisms between the reference isolate and the sequenced isolate in the corresponding region. However, in a situation like that shown in Fig. S3, the allele frequency of such a false-positive SNPs should be low, because only reads from one of the repetitive segments support the alternative allele. Therefore, the allele frequency filter is considered to be effective against such false positives.

Benchmark using TreeToReads
In order to assess the performance of SNP callers under the situation where variants between the reference-root pair are also simulated, and structural variants (SVs) do not exist between the pair, we carried out another supplementary benchmark using TreeToReads (TTR) [25]. The input for TTR is a config file, and its main elements are a phylogenetic tree and a base genome which corresponds to one isolate in the tree. TTR simulates variants against the base genome and generates genome sequences of the other isolates in the tree.
By TTR, we simulated a situation similar to the first benchmark, i.e., target isolates of SNP calling were closely related and the reference isolate had about 99.9, 99, 98, or 97% identity against them. As the input base genome, we used the reference genome of the first reference-root pair in the 99.9% case of the original benchmark (Table S1), regardless of the target identities. The input tree was generated based on the tree simulated by EvoleveAGene against the first reference-root pair for each species and identity in the original benchmark. EvolveAGene outputs the simulated tree among the target isolates in newick format where branch lengths were represented as the number of substitutions. We converted the number of substitutions into the ratio of substitution to the root genome size and added a branch from the root to the reference isolate to the tree using our original script, TTR_get_newick.sh (available at https://github.com/IEkAdN/BactSNP/blob/master/benchmark). The branch length between the reference-root pair was calculated based on one-to-one alignments between the reference-root pair by using MUMmer and our original script TTR_run_paup.sh (available at https://github.com/IEkAdN/BactSNP/blob/master/benchmark), which executes PAUP (version 4.0a (build 164) for Unix/Linux) [26] internally. The commands were as follows: The other elements of the config file for TTR are the relative rates of substitutions from A to C, A to G, A to T, C to G, C to T, and G to T, the number of variable sites simulated on the base genome, and the parameters for mutation distribution and indel simulation. The relative rates of substitutions were calculated by TTR_run_paup.sh as described above. The number of variable sites was calculated using the following command so that the target isolates had about 99.9, 99, 98, or 97% identity against the reference genome: The parameters for mutation distribution and indel simulation were set as shown in the example config file of TTR (https://github.com/snacktavish/TreeToReads/blob/master/example_indels.config). The exact config file was generated by our original script, TTR_get_config.sh (available at https://github.com/IEkAdN/BactSNP/blob/master/benchmark), and TTR was executed as follows: TTR outputs aligned and unaligned fasta files that contain the simulated genome sequence of each isolate. The correct SNP data and illumina paired-end reads were generated from these fasta files using our original script TTR_fasta2snp (available at https://github.com/IEkAdN/BactSNP/tree/master/benchmark) and art_illumina [4], respectively. We used the same options for art_illumina as our original benchmark. The commands were as follows: PPV and sensitivity occasionally did not show a monotonic decline with increasing divergence, but the upticks were small in most cases (Table S6). PPV of Cortex showed a relatively large uptick (≥ 5%) at 97% case in E. coli, but this was because the total number of detected SNPs was small in both 97% and 98% cases. 1 out of 12 detected SNPs in the 97% case and 4 out of 16 detected SNPs in the 98% case were false-positive. PPV of GATK also showed a relatively large uptick (≥ 5%) at the 97% case in S. aureus, but this was because dense false positives in a region (25 false-positives in an 878 bp region) decreased the PPV at the 98% case.

Detailed description of BactSNP algorithm
First, BactSNP trims the low-quality regions and sequence adapters from sequence reads using Platanus_trim with the following command: Second, BactSNP de novo assembles the genome of each isolate using Platanus [27] with the following commands: The option -u is set to 0 to disable the bubble-crush function, designed for diploid organisms. Scaffolding and gap closing are not executed, because these procedures increase the number of mismatches between the true sequence and the assembled sequence, causing false-positive SNPs. Assembled contigs are aligned to the reference genome using nucmer with the following command: Using the nucmer alignment, the pseudogenome is generated for each isolate using our original program. Here, the pseudogenome means a genome sequence in which each site corresponds to a site of the reference genome in a one-to-one manner. Insertions against the reference are ignored, and deletions against it are reflected as '-' in the pseudogenome (Fig. S4). Basically, each allele of the pseudogenome is determined as an allele of the aligned contig. When multiple contigs are aligned to the same site of the reference genome and the aligned alleles are not unique, or when a site of the pseudogenome is near any indels, the allele is called as '-'. In other words, the allele is called as '-' if the site does not satisfy either of the following conditions: (1) n allele = 1 (2) d indel > 5 bp, where n allele is the number of alleles aligned at the site, and d indel is the distance from the nearest indel to the site.
In addition to assembling the reads, BactSNP maps reads to the reference genome by BWA-MEM and removes PCR-duplicated reads by MarkDuplicates in Picard with the following commands: (3) c allele ≥ 10 (4) c allele / c all ≥ 0.9 When BactSNP calculates c all , it counts all reads, even those supporting deletion against the reference genome. Lastly, SNPs among isolates are determined by using the pseudogenomes generated in one-to-one manner.
Values assigned to the option --max_snp were determined in the same way as in the simulation benchmarks.    [30]. Gray bars indicate the mapped reads; the region in which all nucleotides are colored (i.e. adenine, bright green; cytosine, blue; guanine, bright brown; thymine, red) indicates the unaligned part due to soft-clipping; the dotted line in the middle indicates the position where a false-positive SNP was called. We defined 'soft-clip region' as a region where five or more soft-clipped reads were mapped in at least one isolate, as shown by the blue rectangle; unaligned regions of reads were counted as if they were mapped adjacent to the aligned part without indels.