An improved filtering algorithm for big read datasets and its application to single-cell assembly

Background For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k-mers. Methods We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep. Results We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. Conclusions We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at https://git.informatik.uni-kiel.de/axw/Bignorm. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1724-7) contains supplementary material, which is available to authorized users.

The input to the filter algorithm is a dataset D = (n, m, R, Q) where for each i ∈ [n] we have: • m(i) ∈ N: a flag for an unpaired (m(i) = 1) or paired (m(i) = 2) dataset; • R(i, s) ∈ Σ * for each s ∈ [m(i)]: the set of reads in the dataset; • Q(i, s) ∈ Z |R(i,s)| for each s ∈ [m(i)]: the set of corresponding phred scores.
Each read i ∈ [n] consists of m(i) read strings R(i, 1), . . . , R(i, m(i)). For t ∈ [|R(i, s)|], we denote the nucleotide at position t in read string R(i, s) by R t (i, s) and its phred score by Q t (i, s). Note that in terms of read strings, D may contain the "same" read multiple times (perhaps with different quality values), that is, there can be i = j such that R(i) = R(j). Hence it is beneficial that we refer to reads by their indices 1, . . . , n.
Denote by x ∈ Σ * the genome from which the reads were obtained and g := |x| its length. (For the purpose of this exposition, we simplify by assuming the genome is a single string.) For each locus ∈ [g], the coverage c (D) of with respect to D is informally described as the number of read strings that were or could have been produced by the sequencing machine while reading a part of x that contains locus . More precisely, for each v ∈ Σ * define • c (v) := 1 if there is a substring w of x which contains locus and satisfies v ∼ = w or v ∼ =w; • c (v) := 0 otherwise.
Then, we define: As result of our algorithm a sub-dataset D = (n , m , R , Q ) of D should be determined such that n is much smaller than n without losing essential information. The goal is to produce an assembly of similar quality based on D instead of D. We only consider the natural approach to create D by making a choice for each i ∈ [n] whether to include read i in D or not, so in particular (R (1), . . . , R (n )) will be a sub-vector of (R(1), . . . , R(n)). When we include a read in D , we also say that it is accepted, whereas when we exclude it, we call it rejected. On an abstract level, a filtered dataset based on D can be specified by giving a set of indices A ⊆ [n] that consists of the accepted reads.
Many popular assemblers, such as SPAdes [1], Platanus [2], or Allpaths- LG [3], work with the de Bruijn graph that is based on k-mers. Fix a parameter k ∈ N; typically k ≥ 21 The set of k-mers of a string v ∈ Σ * , denoted M (v, k) ⊆ Σ k , is the set of all strings of length k that are substrings of v. Partly we need to consider a k-mer multiple times if it occurs in multiple places in the string, and the corresponding set is denoted:

S1.2 Formal description of CMS
Bignorm, like Diginorm, is based on the count-min sketch (CMS) for counting k-mers. CMS is a probabilistic data structure for counting objects from a large universe. We give a brief and abstract description.
Let a = (a 1 , . . . , a N ) ∈ N N be a vector, given implicitly as a sequence of updates of the form (p, ∆) with p ∈ [N ] and ∆ ∈ N. Each update (p, ∆) modifies a in the way a p := a p + ∆; where initially a = (0, . . . , 0). If ∆ = 1 in each update, then an interpretation of the vector a is that we count how many times we observe each of the objects identified by the numbers in [N ]. If N is large, e.g. if N is the number 4 k of all possible k-mers (we do not count k-mers with N symbols), then we may not be able to store a in RAM. (For example, the typical choice of k = 21 brings a into terabyte range; in our experiments we use k = 32.) Instead we fix two parameters: the width m ∈ N and the depth t ∈ N and store a matrix of m · t CMS counters c p,q with p ∈ [m] and q ∈ [t]. Moreover, we randomly draw t hash functions h 1 , . . . , h t from a universal family. Each h q maps from [N ] to [m]. Initially, all counters in the matrix are zero. Upon arrival of an update (p, ∆), for each row q ∈ [t] we update c hq(p),q := c hq(p),q + ∆. That is, for each row q, we use the hash function h q to map from the larger space [N ] (from which the index p comes) to the smaller space [m] of possible positions in the row. Denote Then, it can be proved [4] that a p is an estimate of a p in the following sense: clearly a p ≤ a p , and with probability at least 1 − e 1−t we have a p ≤ e m−1 N j=1 a j . The probability is over the choice of hash functions. For example, choosing t := 10 is enough to push the error probability, upper-bounded by e 1−t , below 0.013%.
In our application, N = 4 k is the number of possible k-mers (without N symbols) and we implement a bijection β : Σ k −→ [N ], so we can identify each k-mer µ by a number β(µ) ∈ [N ]. Upon accepting some read i, we update the CMS counters using all the updates of the form (β(µ), 1) with µ ∈ M (i, k) not containing the N symbol, that is, for each such µ we increase the count β(µ) by ∆ = 1. Then when all the reads 1, . . . , i − 1 have been processed, the required count c(µ, i) corresponds to the entry a β(µ) in the vector a used in the description of CMS, and for the estimate c(µ, i), we can use the estimate a β(µ) as given in (1).

S1.3 Formal description of algorithm
We give a detailed description of our enhancements (i) to (iv) that were briefly lined out on page ??. Although most of the settings are generic, in some places we assume that data comes from the Illumina.
We start with (i), (ii), and (iii). Fix a read i ∈ [n] and a read string s ∈ [m(i)]. Recall that for each t ∈ [ |R(i, s)| ] the nucleotide R t (i, s) at position t in the read string R(i, s) is associated with a quality value Q t (i, s) known as phred score. We want to assign a single value Q(i, s, µ, p) to each (µ, p) ∈ M * (i, s, k). We do so by taking the minimum phred score over the nucleotides in µ when aligned at position p, that is: (µ occurs on the right-hand side only implicitely through its length k).
Fix the following parameters: • N-count threshold N 0 ∈ N, which is 10 by default; • quality threshold Q 0 ∈ Z, which is 20 by default; • rarity threshold c 0 ∈ N, which is 3 by default; • abundance threshold c 1 ∈ N, which is 20 by default; • contribution threshold B ∈ N, which is 3 by default.
When our algorithm has to decide whether to accept or reject a read i ∈ [n], it performs the following steps. If the number of N symbols counted over all m(i) read strings in i is larger than N 0 , the read is rejected. Otherwise, for each s ∈ [m(i)] define the set of high-quality k-mers: and (µ does not contain N) .
We determine the contribution of R(i, s) to k-mers of different fre- b 0 (s) > k for at least one read string s, If the read is accepted, then for each µ ∈ M (i, k) the corresponding CMS counter is incremented, provided that µ does not contain the N symbol.
Afterwards processing of the next read starts. The motivation for condition (2) is as follows. According to [5], most errors of the Illumina platform are single substitution errors and the probability of appearance of an erroneous k-mer in the genome, caused by an incorrect reading of a nucleotide, is quite low. Thus, k-mers produced by single substitution errors are likely to have very small counter values in the CMS (less than c 0 times) and can be considered as rare k-mers. One such error can only effect at most k k-mers. So if we count more than k rare k-mers, they must be the result of factors other than one single substitution error. If we assume that the probability of multiple single substitution errors in a read is smaller than the probability of error-free rare k-mers, we should accept this read.
Condition (3) says that in the read i, there are enough (namely at least B) k-mers where each of them is too frequent to be a read error (CMS counters at least c 0 ) but not that abundant that it should be considered redundant (CMS counters less than c 1 ).
This concludes the description of (i), (ii), and (iii), particularly how we analyze the counts in C(i, s) = ( c(µ, i)) µ∈M (i,s,k) for each read i and s ∈ [m(i)], how we incorporate quality information, and how we handle the N symbol.
Finally, to accomplish (iv), we wrote a multi-threaded implementation completely in the C programming language. The parallel code uses OpenMP. For comparison, the implementation of the original Diginorm algorithm (included in the khmer-package [6]) features a singlethreaded design and is written in Python and C ++ ; strings have to be converted between Python and C ++ at least twice.  • the dataset filtered with Bignorm, Q 0 set to 20 for each of the 13 test datasets. For comprehensibility, we used the measurements for SPAdes on the unfiltered dataset as a reference. We present box plot diagrams for the measures largest contig, N50, genome fraction, total length, and assembler runtime (Walltime), grouped by assembler and by normalization used.

S2 Comparison of different assemblers
For largest contig and N50, IDBA_UD does not benefit from normalization while Velvet-SC clearly does. Both assemblers however, appear to be inferior to SPAdes (regardless of normalization). The results of Velvet-SC on normalized data are comparable to IDBA_UD (see Supplement Figure 1 and 2).
For genome fraction, both IDBA_UD and Velvet-SC profit from normalization, but Velvet-SC always performs worse than IDBA_UD, whose results are bested by SPAdes (see Supplement Figure 3).
For total length of the assembly, IDBA_UD (like SPAdes) does profit from normalization with Diginorm, but using Bignorm led to a deterioration. Velvet-SC benefits from both normalization algorithms, but its results are still inferior to IDBA_UD and SPAdes (see Supplement Figure 4).
For the assembler runtime (Walltime), all assemblers get substantially faster when using normalization (see Supplement Figure 5. The diagram shows that SPAdes on a dataset normalized with Bignorm is always faster that IDBA_UD on the same, unfiltered dataset. Remarks: • the y-axis is logarithmised for an informative plot • the runtime given for Velvet-SC is the sum of the runtimes of velveth and velvetg See Table S1 for the median speed-up by assembler and normalization algorithm.

S3 Total length of Assemblies
The distribution of the total length of the assemblies done with SPAdes and Q 0 set to 20 are shown in Supplement Figure 6. For the plot, all measures were divided by the length of the related reference genome as provided by JGI / UCD.

S4 Quality plots of datasets
As Bignorm uses the phred score in its decision function, it is not surprising that low quality datasets give low quality normalizations. Especially the decline rate of the phred score over the read position, as shown in Supplement Figure 7, turned out to have a big impact on the result of the normalization.

Caldi
Aceto ASZN2 Alphaproteo E. coli Arma Supplement Figure 7: Quality plots of selected datasets, generated using the FastQC tool and ordered by decreasing quality