Compact and evenly distributed k-mer binning for genomic sequences

Abstract Motivation The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers—ordered m-mers where m < k—are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. Results We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. Availability and implementation Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material. Supplementary information Supplementary data are available at Bioinformatics online.


Bin statistics generated by various minimizer orderings
Minimizer orderings • Random A random ordering of all m-mers.
• Frequency A sampled (1%) frequency ordering of all m-mers, from rare to common.
• Signature The minimizer signature ordering, as implemented by KMC2. Gives lower priority to minimizers starting with AAA or ACA, or containing AA anywhere, except for AA at the start.
• Universal lexicographic A lexicographic ordering of a compact universal hitting set.
• Universal random A random ordering of a compact universal hitting set.
• Universal frequency A sampled (1%) frequency ordering of a compact universal hitting set, from rare to common.
In all of the orderings, any ties between items of equal priority were resolved lexicographically.

Universal sets
To generate universal set orderings, compact universal hitting sets generated by the PASHA algorithm were used. These sets have the following sizes. Bin statistics (cow rumen) The following measurements were obtained using the first 100,000,000 reads from the cow rumen dataset (SRA accession SRR094926). This dataset has the following properties.
k Total k-mers Distinct k-mers 28 7,389,666,230 5,090,549,289 55 4,691,382,750 3,728,398,897 Throughout this document, bin sizes are measured as the total number of k-mers, including duplicates. Super-mer lengths are measured as the number of overlapping k-mers. For example, a super-mer of length 5 when k = 28 would be a sequence of length 32. Top 0.5% gives the total size of the largest 0.5% of the bins.
We also reproduce the values for m = 10 here for convenience.

Bin statistics (marine metagenome)
The following measurements were obtained using the first 100,000,000 reads from the marine metagenome dataset (SRA accession ERR599052). This dataset has the following properties.

Density plots of bin distributions
A kernel density estimate, using Gaussian kernels, has been used to generate these density plots from bins generated from the two 100,000,000 read datasets above. Bin sizes are measured as the total number of k-mers, including duplicates.
For convenience, we also reproduce the plots for k = 28 here, although they have already been given in the main paper.

Commands used to invoke KMC3 and Jellyfish
Here we give the commands used when comparing the traditional k-mer counters KMC3 and Jellyfish with Discount. A single machine with 64 CPUs, a single 4 TB HDD (standard persistent) disk, and 240 GB RAM was used. The machine was from the Google Cloud N1 series with Intel Xeon CPUs running at 2.7 -3.2 GHz (all-core turbo frequency).
The memory limit passed to the tools was 220 GB, to allow the rest to be used as disk cache by the operating system. Parameters were tuned to optimise speed while producing outputs as similar as possible to those from Discount. Inputs were uncompressed.
Discount was writing the k-mer counts output table as 4,000 separate files on the Google Cloud distributed filesystem (equivalent to the number of partitions used in Spark). When generating equivalent data, Jellyfish and KMC3 output a single large file.

Jellyfish
Jellyfish version 2.2.10 was used. The following command was used to determine the initial hash size: jellyfish mem -m 28 --mem=$((220 * 1024 * 1024 * 1024)) The number obtained (34359738368) was passed to the -s argument in the following command.