Genotyping of evolving prokaryotic populations

1 2 Genomic heterogeneity of bacterial species is observed and studied in experimental evolution 3 experiments, clinical diagnostics and occurs as micro-diversity of natural habitats. The challenge for 4 genome research is to accurately capture this heterogeneity with the currently used short sequencing 5 reads. Recent advances in NGS technologies improved the speed and coverage and thus allowed for 6 deep sequencing of bacterial populations. This facilitates the quantitative assessment of genomic 7 heterogeneity, including low frequent alleles or haplotypes. However, false positive variant predictions 8 due to sequencing errors and mapping artifacts of short reads need to be prevented. 9 We therefore created VarCap, a workflow for the reliable prediction of different types of variants even 10 at low frequencies. In order to predict SNPs, indels and structural variations, we evaluated the 11 sensitivity and accuracy of different software tools using synthetic read data. The results suggested 12 that the best sensitivity could be reached by a combination of different tools. We identified possible 13 reasons for false predictions and used this knowledge to improve the accuracy by post-filtering the 14 predicted variants according to properties such as frequency, coverage, genomic 15 environment/localization and co-localization with other variants. This resulted in the reliable prediction 16 of variants above a minimum relative abundance of 2%. 17 VarCap is designed for being routinely used within experimental evolution experiments or for clinical 18 diagnostics. The detected variants are reported as frequencies within a vcf file and as a graphical 19 overview of the distribution of the different variant/allele/haplotype frequencies. The source code of 20 VarCap is available at https://github.com/ma2o/VarCap. In order to provide this workflow to a broad 21 community, we implemeted VarCap on a Galaxy webserver (Afgan et al. 2016), which is accessible at 22 http://galaxy.csb.univie.ac.at. 23 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2449v1 | CC BY 4.0 Open Access | rec: 14 Sep 2016, publ: 14 Sep 2016


INTRODUCTION 1 2
The genotyping of heterogeneous populations of one prokaryotic species is an increasingly important 3 method to address microbiological questions regarding population composition and dynamics under 4 prevalent selective pressures. This approach is e.g. used in experimental evolution experiments 5 (Barrick and Lenski 2013) and studies of host -pathogen systems (Gardy et al. 2011, Bos et al. 2011 McElroy, Thomas, and Luciani 2014). Recent developments in Next-Generation-Sequencing (NGS) 7 technologies allow for sequencing at high coverage within a short timeframe, however limited to short 8 read length.

9
The classical approach of assembling genomes out of short DNA reads preferably reconstructs the 10 most abundant genotype into genome contigs and scaffolds. In order to retrieve haplotype frequency 11 information, reads need to be mapped onto the assembly or a reference genome. Variant calling is 12 then performed on the alignment of the reads. The predicted variants can be phased into haplotypes 13 or alleles if a whole haplotype reconstruction is not possible due to insufficient linkage of the variant 14 sites. The variant prediction, however, can lead to false positives due to sequencing errors, such as 15 indels and substitutions. The reads may be misplaced during mapping due to their short length and 16 thus can lead to false positive variant calls (Li 2014). Sequencing errors can be partially reduced by 17 quality filtering and error correction (Yang, Chockalingam, and Aluru 2013). As a consequence, the 18 substitution error rate for Illumina could be decreased below one percent while indel homopolymer 19 errors showed to accumulate logarithmically with the length of the stretches (Minoche,Dohm,and 20 Himmelbauer 2011) and can thereby reliably identified.

21
Most of the genotyping studies of prokaryotes so far have been done by resequencing of clonal 22 bacterial cultures (Maharjan et al. 2013, Blount et al. 2012. Deep sequencing of non-clonal cultures 23 was mainly done for metagenomic profiling of communities (Qin et al. 2010) and only to a minor 24 extend for the characterization of allele frequencies (Eyre et al. 2013). The genotyping of non-clonal 25 variants in heterogeneous populations, however, remains challenging (DePristo et al. 2011, Nielsen et 26 al. 2011, Kofler and Schlötterer 2014, Pulido-Tamayo et al. 2015. 27 In order to get a most complete picture of the different haplotype or allele frequencies it is fundamental 28 to exploit high coverage sequencing data and to detect all types of variants, which are SNPs, Indels 29 and structural variations (SV). Therefore it is necessary to integrate several variant calling software 30 tools, which utilize different approaches for the detection of the different kinds of variants.

31
Commonly used tools to identify SNPs are Samtools/bcftools and GATK , McKenna et 32 al. 2010). These tools were developed with the assumption to detect variants within diploid organisms, 33 which limits their detection power for haploid prokaryotes. Therefore we also considered the more 34 generic tool VarScan2 (Koboldt et al. 2012), which can predict SNP frequencies in low and high coverage data, and LoFreq-Star (Wilm et al. 2012), which was specifically developed to predict SNPs 1 at low frequencies in high coverage datasets. They all work on read alignments or mpileup files and 2 use read and mapping quality scores as well as strand bias filters to reliably detect SNPs. In addition 3 Samtools/bcftools and VarScan2 can also be used to identify small indels. Pindel (Ye et al. 2009) uses 4 a pattern growth algorithm to detect small and large indels from 1bp up to 10kb. Large indels and 5 structural variations (SV), such as translocations, duplications and inversions, are detected by 6 Breakdancer and Delly (K. , Rausch et al. 2012), as they make use of insert size 7 deviations, paired end information and split read information to find variations larger than 300 bp. As 8 an alternative, Cortex_var (Iqbal et al. 2012) does not rely on mapped reads but uses de novo 9 assembled contigs, which are compared to each other or to a reference in order to identify most kinds 10 of variants. All those approaches have been designed for different degrees of heterozygosity, ranging 11 from diploid genomes to multiploid populations with low abundant genotypes.

12
The genotyping of prokaryote populations in experimental evolution experiments is typically based on 13 many NGS datasets with high coverage. There is therefore a demand for fully automated software for 17 ). Large insertions hereby also mimic the process of horizontal gene transfer. As 18 structural variations are reported to be crucial for bacterial genome evolution, we also added few 19 translocation, duplication and inversion sites to challenge the detection software.

20
The SNPs were positioned as single seeds, to which the remaining SNPs were randomly assigned 21 within a distance of 1 to 15 bases (max 4 allowed) with decreasing probability in order to create 22 mutation hotspots.  At the moment there is no software tool or method, that could detect all different types of variants 7 simultaneously, which are relevant for prokaryotic genomes. Therefore we separately evaluated 8 variant detection tools for SNPs, InDels and structural variants (SV). Representative methods for 9 these three targets were selected according to their underlying methodologies. In order to identify the 10 variant calling tools that most sensitively and reliably detect low abundant variant, we initially utilized 11 our most basic variation model (mixvar_1  (Yost et al. 2013), whereas LoFreq-Star, Varscan2 and Cortex_var should be 23 able to detect low frequency variants from high coverage sequencing data. Variants were simulated 24 and evaluated at minimum relative abundance (MRA) cutoffs of 10%, 5% and 2%. This means that 25 ideally all variants present at and above those frequencies should be detected. At a MRA of 10%, 26 variants were detected by all SNP calling software tools at a similar sensitivity (Fig. 1 A). According to 27 the expectations, the detection rate of GATK and Samtools/bcftools was worse compared to the other 28 programs when the MRA was reduced to 5%, 2% and 1% (Fig. 1 A). At a low MRA of 1% LoFreq-Star 29 shows less sensitivity than Varscan2. This is to be expected, as LoFreq-Star builds its own error 30 model and detection threshold to avoid FP and therefore detects no variants below that threshold Varscan2 and Pindel were used for the detection of small Indels, and Pindel, Breakdancer, Delly and 1 Cortex_var for the detection of larger Indels. For small Indels, the MSA approach used by Varscan2 2 should perform at a similar rate as the pattern growth algorithm used by Pindel. Pindel, however, is 3 designed to detect indels from 1-10000bp as it uses a mapping/pattern growth/split read approach. 4 Therefore it should be able to detect the positions of small and large indels with base pair precision.

5
Breakdancer and Delly are designed for the detection of Indels larger than 300bp. They use paired 6 end read information for Indel detection, therefore the position of the large Indels may not be reported 7 at bp resolution. Cortex_var is expected to be less sensitive because of the de-novo assembly 8 approach, however it can supply more information than the mapping approaches, including e.g. 9 position, length and sequence of an insertion.

10
The detection rate of Indels showed little effect to different MRA values (Fig. 1 B) (except 11 Samtools/bcftools, see discussion above). Instead, the sensitivity is related to the methodology 12 underlying the software. We observed that Varscan2 can only detect very short Indels (1bp)  For the detection of SV we used Pindel, Breakdancer and Delly, and we added Cortex_var specifically 25 for inversions. They differ slightly in their methodological approaches, therefore we expected Delly to 26 be superior to Breakdancer because of the additional split read alignment. Moreover we expect a 27 limitation of Pindel at larger rearrangements, because the pattern growth algorithm is used within 28 defined limits (up to 10kb). All tools should be able to detect inversions, however they are reported as 29 being harder to detect than other SVs. Breakdancer and Delly detected SV, like duplications and 30 transpositions, regardless of the MRA with high sensitivity (>90%). As expected, the detection rate of 31 Pindel is lagging behind (80%) according to of the suggested internal limits of 10kb. However, the 32 pattern growth method of Pindel was more precise in terms of position and length of the SV as it 33 always hit the exact starting position while Breakdancer and Delly can be off up to 70 bases (Fig. 1 C).

34
We additionally found, that large Indels were called at the sites of translocations events (Fig. 1 C).

35
This is not entirely unexpected, as a translocation consists of an excision and the consecutive 36 insertion of the excised genomic fragment. The excision can also be seen as a deletion of a fragment 1 and is therefore a partial detection of a more complex type of variant.
2 Inversions, however, could only be detected at a minor fraction as break positions by Pindel (70% as 3 break positions) and as inversion by Cortex_var (10%) (Fig. 1 C inv) For larger variants or SV we observed that a combination of pattern growth, split read and paired end 10 read information approaches, which are used by Pindel, results in high sensitivity. This method works 11 well within defined limits (1-10kb). By using only paired end information (Breakdancer), it is possible to 12 detect larger variants at the cost of a lower length limit (300bp) and a coarser resolution of the variant 13 position. Cortex_var, however, was inferior in sensitivity but revealed more information about the 14 detected variants by using a de-novo approach. This information can be used to correctly identify the 15 type, position, length or sequence of the variant. Therefore we use Pindel, Breakdancer and 16 Cortex_var for large Indels and Breakdancer, Delly, Pindel and Cortex_var for SV.

17
Due to the different variant calling abilities of the different tools at low frequencies, we combined 18 different tools to increase the sensitivity (Fig. 2 A). Beyond sensitivity we also monitored the precision 19 of the different tools for each type of variant in order to avoid methods that have excessive numbers of 20 FP (Fig. 3). As a consequence, Cortex_var was used to predict InDels and inversions but not for 21 SNPs as it accumulated many false positive SNPs in certain areas at low frequencies. Taking together 22 all selected software tools we were able to detect all variants, except inversions, at a MRA of down to 23 2% with high sensitivity (Fig.  4).

24
3.2 VarCap -a variant calling workflow with high sensitivity and specificity 1 2 3.2.1 False Positives due to sequencing errors 3 4 False positives occur due to sequencing errors, which are typically present at and below a rate of 1%, 5 therefore we expect them to cause FP calls at and below this relative abundance. In order to study the 6 influence of sequencing errors on different software detection tools, we analyzed 7 differentially 7 composed samples at MRAs of 2% and 1% (mono_02-07). At a MRA of 2% we observed a false 8 positive rate for SNPs, small Indels and Duplications of 0.5 to 1 FP per Megabase (Mb) (Fig. 5 B MRA   9 2). At a lower MRA of 1%, we observed an increase in FP ( Table 1). At a MRA of 1%, we could nearly 10 completely find all types of variants, except inversions, which we could identify at a rate of 95%.

11
However, the false positive rate for SNPs increased to 80 FP per Mb, while the FP rate for other 12 variants stayed below one FP per Mb (Fig. 5 A,B MRA 1). This clearly demonstrates that false 13 positive SNPs are caused by sequencing errors, while the other types of variants stayed at the low 14 rate (~1FP/Mb).

15
In order to get more insights about the other FP, we examined them in detail at both MRAs. We found 16 that FP of small Indels locate within repetitive regions of the genome. These regions are almost 17 identical areas of the genome at a size that is longer than the insert size of the reads and have an edit 18 distance of 3 or less. Due to their similarity, variant reads can be mapped to similar regions and cause 19 FP calls there.

20
In order to evaluate how MAA and coverage influence the FP rate, we simulated sequencing coverage 21 from 80 to 1600x (using mixvar_1) and used MAAs from 4 to 20 to filter FP ( Fig. 6 ). We detected, 22 that it is necessary to use an MAA cutoff in addition to an MRA cutoff to avoid FP calls at lower 23 coverages ( Fig. 6, see MRA2 at coverage 160x).

25
In order to identify and exclude false positives we apply the following filters: To avoid FP SNP calls 26 caused by sequencing errors we apply a MRA of 2%. To avoid FP due to reads mapped to repetitive 27 regions, we mask nearly identical regions according to the properties described above within the 28 reference genome and tag variants that are found within these regions. In order to resolve FP that are 29 caused by incomplete detection of the true variant type, we prioritize larger over smaller variants. Mismapped reads have been reported as the cause of FP (Li 2014). Thereby incomplete reference genomes lead to reads getting mapped to similar regions and cause FP calls there. To review this 1 finding at a MRA of 2%, we mapped reads without variants back onto an artificially shortened 2 reference genome. We observed ~180 FP SNPs/75 FP per Mb which were present at different 3 abundances (20%, 8%, 3%) and grouped into hotspots (Fig. 7 A). False positive variants were not 4 observed when mapping the reads to the correct reference (Fig. 7 B). This finding strongly supports 5 our assumption that wrongly mapped reads cause FP variant calls. A closer investigation of the 6 relevant regions revealed the presence of neighboring break positions, which may indicate both: either 7 a larger structural variation or mismapped reads due to an incomplete reference genome.

8
To identify possible false positives due to mismapped reads, we implemented the following filtering 9 steps: As suggested in prior discussion of this topic (Li, 2014) we used the coverage information at the 10 variant sites to tag possible false positives. However, coverage information alone is too coarse for the 11 resolution of low frequent FP. Therefore we additionally monitor break positions that flank or reside at 12 the variant positions to identify regions with mismapped reads. As all FP were present as small 13 clusters or hotspots, we tagged regions that hosted more than 4 SNPs within a sliding window at the 14 double length of the insert size and were accompanied by a break position (BP) as possible FP 15 causing regions. With the application of these filters we could identify and exclude the FP calls ( Fig. 7 16 C).

17
A closer look at inversions revealed, that they were mostly not identified as inversions but the start and 18 the end point of the inversion were marked as break positions (Supp. Table 1 We observed that a gain in variant calling sensitivity decreased the accuracy. Therefore we added a 27 post-filtering step to the workflow in order to eliminate possible FP. We incorporated a post-processing 28 step for each variant that aims to eliminate FP due to sequencing errors, repetitive regions, partially 29 detected variants and mismapped reads due to reference incompleteness. As a consequence of the 30 dissimilar variant detection rates of some methods, we decided to use more than one tool for each 31 type of variant. In order to gain accuracy and robustness, for high confidence variants, a variant call 32 had to be supported by at least two different tools. This step further contributed to an improved 33 accuracy at low MRA cutoffs (1%), while the detection rate was only slightly diminished ( Table 1).
3.3 Genotyping of diverse synthetic prokaryotic populations 1 2 3.3.1 Detection rates in different genomes 3 4 Genomes exhibit different properties, such as G+C content and size, which could potentially affect the 5 sensitivity and accuracy of variant calling. Therefore we evaluated our variant calling workflow on six 6 different genomes. These organisms consisted of five bacteria and one archaeon, with differing G+C 7 content ranging from 26 to 72 percent as well as a differing genome size ranging from 0.68 to 8.66 8 Mb. The workflow was used with a MRA of 2% as well as at a MAA of 8 reads supporting a variation.

9
In concordance to our previous results we could detect most of the (simulated) variants (>90%).
10 However, at a MRA of 2% we could not observe any dependency on G+C content or genome size 11 while the MAA of 8 reads resulted in fewer variant detections at high G+C content and genome size 12 (Fig. 8). Therefore we decided to use the more flexible MRA as a minimum boundary for variant 13 detection as it showed less influence to different genome properties. 14 15 3.3.2 Detection rates in a distantly evolved population 16 17 More distantly evolved populations carry higher variant frequencies as tested before, which could 18 affect the sensitivity of variant calling. Therefore ALFSim (Dalquen et al. 2012) was used to simulate a 19 more distantly evolved population by integrating evolutionary changes (SNPs, Indels and duplications) 20 into the P. amoebophila genome. The evolved genome showed a similarity to the reference around 21 99%, as it contained around 21000 SNPs, 100 Indels and three gene duplications.

22
We evaluated the sensitivity of the variant calling by VarCap at a low abundant subpopulation of 4%.

23
We used a MRA of 3%, 2% and 1% as well as a MAA of eight reads (equals a MRA of 2% in a 400x 24 covered genome). Depending on the minimum abundance requirements, we were able to detect 25 between 90% and 99% of all SNPs, between 74% and 94% of all indels and 2 out of 3 duplications.

26
The true positive detection rate of SNPs increased to 98%, while the false positive rate remained 27 below 0.3% when lowering the MRA from 3 to 2%. However, if we lowered MRA further to 1%, we 28 increased the TP rate to 99% while augmenting the FP rate close to 400 FP/Mb (Fig. 9A). At a MRA 29 of 2% we could locate most FP within repetitive regions and recent duplications (Fig. 9B), while at a 30 MRA of 1% we detected mainly FP caused by the sequencing error rate (Fig. 9B). At a MRA of 2%, 31 we were able to detect over 90% of all Indels including all small Indels (size=1), without experiencing 32 false positives (Fig. 9A). With regard to duplications we were able to find two of them at most MRAs, 33 while missing out the shortest one constantly (Fig. 9C SV(DUP)). These findings confirm that we are 34 able to achieve a high accuracy even if the evolved genomes are rather dissimilar. However, a novel finding was that also recent duplications can lead to wrongly placed reads as they are similar to 1 repetitive regions. Therefore we also included tagging of duplicated regions as possible regions for FP 2 calls into our workflow. 3 4 3.4 Detecting variants in a real bacterial population after long term cultivation 5 6 In order to predict variant frequencies within an evolving population, the variant calling workflow was 7 applied to a long-term cultivation experiment of P. amoebophila. Different MRA cutoffs from 20% to 8 2% were used and revealed that variants were present at frequencies down to 2% (Fig. 10A, outer 9 rings). Variants within repetitive regions (Fig. 10A, inner connective lines) were tagged for further 10 inspection. At a MRA of 2% we observed a total number of 71 variants, which comprised of 34 SNPs, 11 20 Indels and 17 structural variants. The SNPs and small Indels were annotated using SNPEff 12 (Cingolani et al. 2012). This revealed, that around 83% of them were situated within coding regions 13 (Supp . Table 3). At a MRA of 2% we could find three Indels present at a subpopulation of 2%. They 14 are located within homopolymeric regions of length 10 (data not shown) and thus were tagged as 15 probable FP for further manual inspection.

DISCUSSION 1 2
Population genomics of microbes is most powerful if we meet the challenge to detect all types of 3 genomic variations even at low frequency. We therefore developed, evaluated and validated VarCap, 4 a workflow that allowed us to reliably identify variants even within low abundant alleles. We tested the 5 capabilities of the relevant variant calling tools and observed substantial sensitivity differences 6 between the different methods. In order to improve the overall sensitivity we decided to integrate 7 different tools for variant detection into a combined workflow, in which every variant can be detected 8 by more than one caller. As more tools are likely to introduce more errors, we also optimized the 9 overall accuracy. Detecting sequencing errors and mis-mapped reads was key to control the rate of 10 false positives. We observed that for SNPs a MRA of 2% was sufficient (MAA of 8 reads) to keep a 11 safety margin to false positives appearing at a MRA of 1%. This implies, that for detecting a 12 subpopulation present at 2% we need a minimum sequencing coverage of 400x. Sequencing 13 experiments should therefore aim for at least 500x to account for reads removed by quality filtering 14 and fluctuations in coverage along the genome. We could not detect any FP Indels within our 15 simulated data but detected several spurious Indels in homopolymer regions of the re-sequencing 16 experiment. These are probably sequencing/PCR artifacts that are not introduced by read simulators.

17
Based on our findings Indels below a MRA of 10% should be tagged as potentially false positive if they 18 are located within a homopolymeric region. Mismapped reads can occur within repetitive regions, 19 undetected duplications, or incomplete reference genomes. Therefore we flag repetitive regions 20 greater than the insert size in order to mark variants appearing within these regions for further 21 inspection. Unnoticed duplications or incomplete references cause reads to get mapped to similar 22 regions, which can be observed by higher coverage and/or variant accumulation within these areas. In 23 order to overcome false positives by misplaced reads, we suggest tagging variants that at least fulfill 24 two of the three following rules: I) Either variants lie within regions with a coverage of 20% above the 25 average and/or II) if there is a break position detected at or within read length of the variant site and/or 26 III) if they lie within a repeat region and/or IV) if more than 5 variants lie within the length of one insert 27 size. Furthermore, for extracting high confidence variants, we requested each variant to be confirmed 28 by at least two callers.

29
We observed that insertions and especially inversions were harder to detect than the rest of the 30 variations. This is not unexpected, as current methods for their prediction need sufficient support by 31 reads, which may get lost at low frequencies. In the simulated evolution data we missed the shortest 32 duplication constantly. This may be related to a combination of callers working at their operational 33 limits (300 bp) and a diverging evolution of the duplicated sequence due to newly introduced SNPs.

34
According to our results, we could establish rules for filtering out errors and help with the interpretation 35 of different types of variations (e.g. SNP, duplications). Using these rules we have built a fully automated workflow that reliably predicts rare variants in deep sequencing data.     . It shows the detection rate of different SNP (Fig. 1A), Indels (small denotes small Indel, Fig. 1B) and SV callers (Fig. 1C) with respect to the MRA frequencies of 20, 10, 5, 2 and 1%. For breakdancer, pindel, delly and cortex, two values are given: detection rate of all Indels and specific detection rate for deletion or insertion only.         Coverage plots of simulated and resequenced data.The simulated reads without variants were mapped back to an incomplete reference (7A) and the complete reference (7B). The blue circles denote the total coverage along the genome while the green diamonds show the coverage of the FP variants and the red circles the total coverage at the FP positions. As a comparison we show the coverage distribution of sequenced reads against the complete reference in orange in the background of 7B. The coverage peaks at 1220000 and 2150000 are due to additionally mapped mitochondrial reads. The lightblue and orange lines show the average coverage distribution along the genome. 149 of 154 of the FP from 7A could be tagged and filtered by the properties coverage (COV), within repetitive region (REP), within SNP accumulating region (SAR) and located close to a break position (BP) as shown in 7C, the remaining 5 were single calls and thus eliminated by the constraint of 2 callers per variant.  Observed detection rates of variants which were simulated using a genome evolution software (ALFSim) and detected at different minimum abundances. 9A shows the sensitivity at MRAs of 3,2 and 1%. 9B shows the False Positives for SNPs as counts per Magabase at the different MRAs. At these minimum abundances no FP for Indels and SV were detected. 9C shows the FP per Megabase after filters have been applied. shows the prevalence of variations at MRAs of 20%, 10%, 5% or 2%, which are visible in the four differently colored outer circles and the presence of repetitive regions within the reference genome (inner connective lines). 10B shows a more detailed view of variations found at MRAs of 20%, 10%, 5% and 2%.