KmerKeys: a web resource for searching indexed genome assemblies and variants

Abstract K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.

• A pair of integers U and V such that U * V = 1 (mod A k ) and U is the nearest integer to A k / ϕ with no prime factors in common with A k , where ϕ is the golden ratio: multiplication by U (resp. V) hashes (resp. unhashes) k-mer keys.
• Integer L = ⌈A k / (N -H)⌉, where ⌈. ⌉ denotes the ceiling function: L is the number of distinct k-mers with the same "home address" (defined below).
• Integer B ≥ ⌈log 2 (H * L + 1)⌉: the number of bits per slot in the hash table. In our implementation we round B up to a multiple of 8, so that a slot is some integer number of bytes.
The N slots of hash Let x be a k-mer and let underlines denote hashed quantities, e.g. x = U * x (mod A k ) and x = V * x (mod A k ).
The slot at address q is either "empty" or corresponds to a hashed k-mer x q (and unhashed kmer x q ). Slot q is empty if T[q] = 0. If slot q is not empty (if T[q] ≠ 0) , then the corresponding stored (hashed) k-mer is computed as: The "home address" of (hashed) k-mer x is the unique non-negative integer q satisfying the equation: x = q * L + r with r in {0,1,…,L-1}. That is, q is the quotient and r the remainder upon division of x by L.
Matching up the previous two equations, we see that if k-mer x is stored in its home slot q, then the value at address q is T[q] = L -r in the set {1,2,…,L}.

Inserting and looking up keys
Inputs: hash table T, k-mer x Outputs: 1) address at which x is in T, 2) a Boolean signal denoting whether x is in T (output 1 is undefined if this signal is false), 3) a signal denoting whether overflow occurred (i.e. too many hash collisions occurred) def locatekey(T, x):

K-mer based representation of variants
As an example, consider a particular variant: chr1:17329 C/CAT --an insertion of the novel sequence AT following the 'C' at position chr1:17329 in GRCh38. Then the reference sequence in GRCh38, lacking the variant, is: ...CCTTTGTTACGCACCAGCC...
And a "patched" version instead containing the alternate allele is: where the inserted sequence is underlined. This set of novel k-mers overlaps any part of the variant allele, as shown in Figure 2B. Variant metadata can also include the variant coordinates in the reference genome as well as adjacent k-mers that are not affected by the variant, labeled "left_flank" and "right_flank" in Figure 2B. These extra metadata features make it straightforward to reconstruct a VCFstyle representation of a variant from its graphical representation, and to map the variant to new assemblies. By associating k-mers with variants, we enable fuzzy sequence-based search of a collection of variants. This function has some novel properties. For example, with a VCF file one can query a set of variants based on genome coordinates. However, it is less easy to do so based on the actual sequence containing the variation. Using a conventional approach, sequence-based variant querying requires one to first align the sequence to a reference genome and then use some other method to describe the set of variations. In contrast, our method permits associating short DNA sequences directly with k-mers and their metadata. A fuzzy search is achieved by querying all sequences within a given Hamming radius of a query sequence.
Fuzzy searching of k-mer-represented variants also enables robustness to the case of ambiguous variant definitions. For example, a substitution "AC/GG" may be represented as a pair of one-base-pair substitutions or a single two-base-pair substitution. By returning all approximately matching variants within some tolerance one can identify a maximum number of mismatches per k-mer.

Supplementary Figures
Supplementary Figure 1. The summary output of from the query of the coordinates, chr17:7671806-