Nanopore quality score resolution can be reduced with little effect on downstream analysis

Abstract Motivation The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores. Availability and supplementary information Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.


Introduction
The precision for quality score representation in FASTQ files has historically been relatively large, where each score takes a value within a set of more than 40 possible values. The precise definition of this set depends on the sequencing technology; sequencers from Oxford Nanopore Technologies (https://nanoporetech.com/) use a set of 94 different values (represented in FASTQ files as ASCII codes 33-126). This high precision makes quality score sequences hard to compress and, thus, most of the information stored in a losslessly compressed FASTQ files corresponds to quality scores (substantially more than the genomic information itself). With growing concerns about the economics of storage space and bandwidth for transmission of genomic data, this strategy is being questioned. For example, in the context of short-read sequencing technologies, Illumina has reduced the set of possible values for quality scores on their most modern equipment to just four different values (Ill, 2017) and suggests similarly quantizing the values produced by other Illumina devices (Ill, 2014). It has also been observed that it is possible to use lossy compression algorithms, in which the decompressed data may differ to some extent from the original, without generally affecting the results of commonly used pipelines (Cánovas et al., 2014;Ochoa et al., 2013Ochoa et al., , 2016. Nanopore sequencing, however, is a much more recent technology and few specific data compressors suitable for nanopore data are available, developed by our group (Dufort y Á lvarez et al., 2020(Dufort y Á lvarez et al., , 2021 and others (Kokot et al., 2022;Meng et al., 2021). Moreover, the lossy compression of quality scores for nanopore data has only been explored in (Kokot et al., 2022), where the impact of quality score information loss is assessed for some downstream analyses. Specifically, in Kokot et al. (2022), it is shown that this information loss has no or little impact on the construction of consensus sequences with Racon (Vaser et al., 2017) for long CHM13 reads, either for HiFi or for Nanopore data. The impact is also shown to be small for variant calling accuracy with HiFi reads, evaluated with DeepVariant (Poplin et al., 2018) on HG002 HiFi reads against the human reference genome GRCh38; variant calling for nanopore data is not evaluated.
Adding additional new evidence to, and reinforcing the conclusions of Kokot et al. (2022), in this article, we evaluate the impact of nanopore quality score quantization on various downstream applications, by comparing the results obtained with certain nonquantized nanopore data sets, referred to as original data, and with quantized versions of the same data sets. Particularly we selected data sets relevant for microbiology and human genetics, both areas of high interest and increasing use of this sequencing technology. The quantizers that we evaluate are mostly simple functions, where the set of possible quality scores is partitioned into a small set of disjoint intervals, each of which is mapped to a fixed value. For example, one such quantizer maps quality scores between 0 and 7 to the fixed score 5, and all quality scores larger than 7 to the fixed score 15. We also evaluate a slightly more complex scheme, which uses higher quantization resolution for quality scores associated with repetitive regions of a read. A precise definition of all evaluated quantizers is given in Section 2.1. We point out that these quantizers have not been optimized for any specific application; as it turns out, these definitions suffice to obtain a performance very close to that obtained with non-quantized data for all tested applications. A more thorough optimization of the quantizers is likely to be applicationdependent and is deferred to future research.
To assess the effect of quality score quantization on typical bioinformatic downstream analyses, we focused on frequently used Nanopore applications where quality scores have a role. Long reads are regularly used in genome assembly, but most assemblers do not use quality scores (Kolmogorov et al., 2019;Koren et al., 2017;Ruan and Li, 2020). Structural variation identification (Cretu Stancu et al., 2017;Heller and Vingron, 2019;Sedlazeck et al., 2018;Tham et al., 2020) and variant calling (Edge and Bansal, 2019;Shafin et al., 2021) are also among Nanopore most frequent research applications, but generally just the latter makes use of quality scores. Given the high error rates of Nanopore sequencing data, strategies involving polishing algorithms are almost mandatory, and they are sensitive to quality scores. Based on these points, we carried on assembly polishing for the ZymoBIOMICS Microbial Community Standard (Zymo Research Corporation, Irvine, CA, USA. Product D6300, Lot ZRC190633) nanopore reads using Racon (Vaser et al., 2017) both alone and combined with Medaka (https://github.com/nanoporetech/medaka), and MarginPolish (Shafin et al., 2020) both alone and combined with HELEN (Shafin et al., 2020). We also evaluated the quality of human genome assembly from nanopore reads polished with MarginPolish both alone and combined with HELEN, for several coverage levels. For variant calling, we executed PEPPER-Margin-DeepVariant (Shafin et al., 2021) on sample HG003 for several coverage level scenarios. Details on the methods for evaluating assembly polishing and variant calling pipelines are presented in Sections 2.2 and 2.3, respectively.
As shown in Section 3, all the experimented pipelines yield very similar performance on the original and quantized versions of each data set, indicating that the effect of information loss caused by quality score quantization is not significant in practice. For the assembly polishing of a mock microbial community, setting all quality scores to the fixed value 10 results in a number of mismatches that is, on average over three independent runs, <1.2% higher that obtained with the original data. A quantizer that uses four different values for quality scores results in a number of mismatches that is, on average, even smaller than that obtained with the non-quantized data. This same quantizer, applied to human data, results in a number of mismatches that is essentially equivalent to that obtained with the original data; even for a very low coverage (10% of the original reads), the increment in the number of mismatches is only about 0.7%. Moreover, we show that the quantization of quality scores yields very large improvements in compression performance even for traditional non-specialized compressors such as gzip (see Sections 2.4 and 3). For example, we report on human variant calling results at 90Â coverage where using 8 different values for quality scores yields precision and recall metrics for single-nucleotide polymorphisms (SNPs) that differ by <10 À4 from those obtained with the original (non-quantized) data set, while saving more than 33% of the required storage space. Even using only two different values for quality scores, which saves almost 70% of storage space, the precision differs by <10 À4 and the recall by <10 À3 from those obtained with the original data set.

Quantization of quality scores
We tested various quantizers, each mapping quality scores to values on a (small) set referred to as the quantization alphabet. We denote by Q i a quantizer for a quantization alphabet of size i, i > 1, where the quantization Q i ðxÞ of a quality score x depends solely on x. The specific definitions of Q 2 , Q 4 and Q 8 are presented in Table 1. All these quantizers collapse a large set of high-quality scores into a single value, and define a finer partition for lower scores, which occur more frequently (Delahaye and Nicolas, 2021). Our experiments suggest that slight variations of quantization level thresholds and quantization alphabet in the definitions of these quantizers have little effect on their performance. For example, modifying Q 2 so that the values in f0 . . . 6g are mapped to 1, and those in f7 . . . 93g are mapped to 13, produces a variation of only 0:13% on the resulting F1 score for SNPs variant calling at 30Â coverage (see Section 2.3). We also tested constant quantizers, which map every quality score to a fixed prescribed value. We denote by F z a quantizer that assigns the value F z ðxÞ ¼ z to every quality score x. In our experiments, we take z ¼ 10, which is a common threshold for filtering low-quality reads (see, e.g. Shafin et al., 2021).
In addition, since repetitive patterns of bases are particularly difficult to sequence by nanopore technologies (Delahaye and Nicolas, 2021), we evaluate quantizers where the quantization of a quality score x in a position j of a read depends not only on x but also on the bases called in positions close to j. We say that a substring of the base call sequence of a read is a repetitive sequence if it is comprised of either h consecutive copies of the same symbol, i.e. an homopolymer of length h, or d consecutive copies of the same pair of symbols, i.e. a repeat of d dinucleotides. For two quantization functions, f, f 0 , we denote by hf ; f 0 i d a quantizer that quantizes x to f(x), if the position j of the quality score x in a read is more than d bases away from a repetitive sequence in this read, and to f 0 ðxÞ otherwise. For example, the quantizer hF 10 ; Q 8 i d sets all quality scores away from repetitive regions to the fixed value 10, and applies Q 8 to quality scores that are within or close to repetitive regions. For the experiments reported in Section 3, the parameters h, d and d are set to 5, 4 and 5, respectively. Using the surrounding bases to select an appropriate quantizer resembles previous works on quality score compression for short-read technologies, where the quantization granularity for all quality scores mapped to a given locus is selected based on the bases mapped at that position (Voges et al., 2018). As they showed in Voges et al. (2018) and we show in Section 3, using the information from the bases to select the quantization level has no negative effect on the subsequent downstream analyses, which further suggests that the bases provide insightful information to guide the selection of distortion level for the quality values.

Assembly polishing
We evaluated the impact of quality score quantization on the genome assembly polishing for the ZymoBIOMICS Microbial Community Standard (Zymo Research Corporation, Irvine, CA, USA. Product D6300, Lot ZRC190633). For the experiments, we used the data generated by the authors of (Nicholls et al., 2019) on Oxford Nanopore GridION, R10.3 pore (https://nanopore.s3.climb. ac.uk/mock/Zymo-GridION-EVEN-3Peaks-R103-merged.fq.gz). The organisms in this community are 8 bacteria, each present at 12%, and 2 yeasts, each present at 2%. We assembled the genomes with Flye (Kolmogorov et al., 2020) and kept the eight largest contigs, which correspond to the eight bacteria. We polished this raw assembly using various polishing pipelines: • Racon: 2 iterations of Racon (Vaser et al., 2017).
Both Racon and MarginPolish are polishers that make use of quality scores. Medaka and HELEN, on the other side, are designed to refine the base polish obtained by Racon and MarginPolish, respectively.
We tested each polishing pipeline on the original (non-quantized) data, and quantized versions using F 10 , Q 2 , Q 4 and Q 8 . To account for randomness within the executed algorithms, we ran three executions of each combination of polisher and data set.
As a reference for the assessment of the results we used a combination of scaffolds obtained by SPAdes (Bankevich et al., 2012) from Illumina reads as described in Nicholls et al. (2019) (http://nanopore.s3.climb.ac.uk/mockcommunity/v2/Zymo-Isolates-SPAdes-Illumina.fasta). We evaluated the quality of the final assembly for each run of a polishing pipeline on a data set using MetaQUAST (Mikheenko et al., 2015) to obtain the number of mismatches per kilobase pairs (kbp) with respect to the combined reference.
To analyze the results of assembly polishing on quantized quality scores data in a different setting, we followed some of the experiments reported in Shafin et al. (2020) for human genome assembly. Specifically, we polished the assembly generated by wtdbg2 (Ruan and Li, 2020) for one of the flows for sample HG00733 (1000Genomes Project Consortium et al., 2015 with the polishing pipelines MP and Helen. These polishing pipelines were executed both for the original FASTQ files and for the same data quantized with Q 4 , and we compared the number of mismatches per 100 kbp with respect to the reference GRCh38 using QUAST (Gurevich et al., 2013). We carried on this comparison for several coverage scenarios, which we obtained by randomly selecting a fraction (10%, 20%, 60% and 100%) of the dataset reads. We repeated each execution three times to account for algorithm randomization.

Variant calling
We compared the nanopore variant calling performance of PEPPER-Margin-DeepVariant on sample HG003, as reported in Shafin et al. (2021), against variant calling on quantized versions of the same data. We performed this comparison at various coverages, ranging from 20Â to 90Â, and for various quantizers described in Section 2.1.
Following (Shafin et al., 2021), we used GIAB v4.2.1 truth set against GRCh38 reference, and GIAB stratification v2.0 files with hap.py (http://github.com/illumina/hap.py) to derive stratified variant calling results. As a result, for each coverage and each quantized version of the data, we obtain a classification of the variants reported by PEPPER-Margin-DeepVariant into two categories: truepositives (TP), and false-positives (FP), and a set of true variants that PEPPER-Margin-DeepVariant failed to identify, i.e. false-negatives (FN). These variants are reported separately for SNPs and insertions/deletions (INDELs). From these reports, we summarize the performance of PEPPER-Margin-DeepVariant for each specific type of variant (SNP/INDEL), on various coverage and quantization settings, by the recall, precision and F1 score:

Compressibility
To evaluate the improvement in compressibility obtained by quantizing the quality scores of nanopore FASTQ files, we used sample HG003 (the same data set used for variant calling evaluation), which consists of three FASTQ files that add up to $520 GB. We compressed the original data set and quantized versions of this data set using the general purpose compressor gzip (https://www.gnu.org/ software/gzip/). For each evaluated quantizer, we calculated the compression ratio (CR), defined as the quotient between the size of the compressed data set and the size of the original data set (smaller ratios correspond to better compression performance). Table 2 presents the number of mismatches per 100 kbp for the assembly of a mock microbial community with respect to a combined reference (see Section 2.2), for various combinations of assembly polishing algorithm and quality score quantizer. The number of mismatches, averaged over three independent runs, is very similar for all tested combinations of polisher and quantizer, with differences that are generally smaller than those between different runs of the same polisher and quantizer, which arise from algorithm randomization in the polishing pipeline. Comparing the average number of mismatches obtained with a certain polisher using quantized data and using the original data, we observe that the maximum difference is obtained for Medaka with the most heavy quantizer, F 10 . Even in this case, the difference represents only 1.2% of the average number of mismatches obtained with the original data. Notably, the results for the quantizer Q 4 are even better than those obtained with the original data for all the evaluated polishing pipelines. Table 3 compares the number of mismatches per 100 kbp for a human genome assembly obtained with the original (non-quantized) data and the data obtained by applying Q 4 to quality scores. The comparison is performed for the polishers MP and Helen for the full data set (100%), and for downsampled versions of the data consisting of 10%, 20% and 60% of the reads in the full data set (see Section 2.2). For a 20% or larger fraction of the data, the results obtained after quantization are essentially equivalent to those obtained with the original data. Even for a very small fraction of the data (10%), the percent increment in the number of mismatches incurred by quantization, averaged over three independent runs, is 0.01% for MP and 0.7% for Helen.

Results
Tables 4 and 5 show the count of true-positives, false-positives, and false-negatives for SNPs and INDELs, respectively, obtained with PEPPER-Margin-DeepVariant for various quantization schemes and coverage levels. The tables also show the metrics Rec; Prec and F1 calculated from these counts. The complement to 1 of the F1 score, i.e. the difference 1 À F1, is shown for the same quantization schemes and coverage levels in Figures 1 and 2 for SNPs, and in Figure 3 for INDELs.
For SNPs, Table 4 shows high recall and precision values, specially for large coverage levels. In all cases, the difference between the F1 score obtained with the original data and that obtained with a quantized version of the same data is rather small. The largest difference for coverage 30Â and above is smaller than 10 À3 (for F 10 on 90Â data). For each of these coverage levels, the performance slightly improves in general with the quantization precision, and the performance for hQ 2 ; Q 8 i d lies between that of Q 2 and Q 4 (see Fig. 1). For the shallowest coverage, 20Â, the performance is noticeable worse (see Fig. 2 and note the different scale compared to Fig. 1). In this case, the F1 score does not always improve with the quantization precision. Following the performance comparison criteria in Shafin et al. (2021), we notice that PEPPER-Margin-DeepVariant on 90Â data still outperforms DeepVariant on Illumina short reads (35Â Illumina NovaSeq, F1 : 0:9963) for all quantizers except F 10 . Another interesting reference for comparison is the variation in performance obtained by using different variant calling pipelines. For example, using Longshot (Edge and Bansal, 2019) on the 30Â original data we obtain a precision of 0.9648 and a recall of 0.9743, which determine an F1 score of 0.9695 for SNPs. Since Longshot ignores quality scores, the results after quality score quantization are exactly the same. Notice that these metric values fall below those reported in Table 4 for every quantizer on 30Â data, including F 10 , which looses all quality score information from the original data. Thus, in this example, the choice of a variant calling pipeline causes larger performance variations than the quantization of quality scores.
For INDELs, in agreement with Shafin et al. (2021), the recall and precision values are significantly worse than those obtained for SNPs. The difference between the performance obtained with nonquantized and quantized data is still small, although more noticeable than the difference obtained for SNPs. For INDELs, the largest difference in F1 score occurs for F 10 on 90Â data, with a difference approximately equal to 0.002. For Q 4 , the maximum difference drops to 3:6 Â 10 À3 , and for Q 8 it is 4:7 Â 10 À4 . The variant calling performance for the data quantized with hQ 2 ; Q 8 i d lies, in general, between those obtained with Q 2 and Q 4 . Table 6 shows the CR obtained with gzip for various quantizers. The table also shows the space saving obtained by quality score quantization, calculated as the difference in size between the compressed original data set and the compressed quantized data set, expressed as a percent of the size of the compressed original data set. For example, the table shows that gzip compresses the original data set to roughly half the original size (the CR is 0.49), and Q 4 improves the CR to 0.25, yielding a compressed data set that is roughly half the size of the compressed original data set (a saving of 48.52%). For the coarsest quantizer, F 10 , the difference in compressed size is almost 70% of the original compressed data set, which in the evaluated data set amounts to more than 175 GB of saved space. For easy assessment of the relation between CR and Note: For each polishing algorithm and each quantizer, we show the result for three independent runs and the average of these three results. Note: For each combination of polisher, quantizer and coverage, we show the result for three independent runs and the average of these three results.
variant calling performance, Table 6 shows, for each quantizer, the F1 score obtained with quantized 90Â coverage data. The tradeoff between storage space and variant calling performance is also illustrated in Figure 4. For each quantizer and coverage level ' (excluding 20Â), we multiply the gzip CR for that quantizer by the coverage level ratio, '=90. This value represents the fraction of storage space required by the data with coverage level ', quantized and compressed with gzip, with respect to the full (uncompressed, 90Â) data set. These fractions of storage space are plotted in the figure against the F1 score obtained for SNPs in each case. Notice, for example, that higher coverage data quantized with Q 4 yields better performance than lower coverage non-quantized data (except at 90Â coverage), while requiring actually less storage space.

Conclusions
Our experiments on various usage scenarios for nanopore sequencing data, including different applications and coverage levels, show that the precision that is currently used for quality scores may be unnecessarily high. A quantizer like Q 4 , which uses only four different values for quality scores, yields results similar to those obtained with the original data in all tested scenarios. We point out that all these results were obtained with applications as they are provided, with no special tuning or training for quantized quality scores, nor with any quantizer optimization. Although such specific application tuning may improve the performance of these applications (for   1. Plot of 1 À F1 score for the SNPs obtained with PEPPER-Margin-DeepVariant for the original (non-quantized) data and for the data obtained by applying various quality score quantizers, for coverages 90Â; 50Â; 40Â and 30Â example through neural network retraining), the fact is that excellent results are obtained with no software adjustment and no quantizer optimization. The quantization of quality scores results in large storage space savings, even using a general purpose compressor, such as gzip. Fig. 2. Plot of 1 À F1 score for the SNPs obtained with PEPPER-Margin-DeepVariant for the original (non-quantized) data and for the data obtained by applying various quality score quantizers, for coverage 20Â. As a reference, the dotted line marks the value 1 À F1 for the original data with coverage 30Â Fig. 3. Plot of 1 À F1 score for the INDELs obtained with PEPPER-Margin-DeepVariant under various coverage scenarios, for the original (non-quantized) data and for the data obtained by applying various quality score quantizers Note: The third column presents the percent space saving of the gzip compression of each quantized version compared to the original data compressed with gzip. The right-most column shows, for each quantization level, the F1 score obtained with coverage level 90Â.