Recursive n-gram hashing is pairwise independent, at best
Introduction
An n-gram is a consecutive sequence of n symbols from an alphabet . An n-gram hash function h maps n-grams to numbers in . These functions have several applications from full-text matching (Cohen, 1998a, Cohen, 1998b, Cohen, 1999), pattern matching (Tan et al., 2006), or language models (Cardenal-Lopez et al., 2002, Zhang and Zhao, 2002, Schwenk, 2007, Li and Zhao, 2007, Talbot and Osborne, 2007a, Talbot and Osborne, 2007b, Talbot and Brants, 2008) to plagiarism detection (Ribler and Abrams, 2000).
To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a single hash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose the input data so that the hash values are biased. Therefore, we randomly pick a function from a family of functions (Carter and Wegman, 1979).
Such a family is uniform (over L-bits) if all hash values are equiprobable. That is, considering h selected uniformly at random from , we have for all n-grams x and all hash values y. This condition is weak; the family of constant functions is uniform.1
Intuitively, we would want that if an adversary knows the hash value of one n-gram, it cannot deduce anything about the hash value of another n-gram. For example, with the family of constant functions, once we know one hash value, we know them all. The family is pairwise independent if the hash value of n-gram is independent from the hash value of any other n-gram . That is, we have for all n-grams , and all hash values y, z with . Pairwise independence implies uniformity. We refer to a particular hash function as “uniform” or “a pairwise independent hash function” when the family in question can be inferred from the context.
Moreover, the idea of pairwise independence can be generalized: a family of hash functions is k-wise independent if given distinct and given h selected uniformly at random from , then . Note that k-wise independence implies -wise independence and uniformity. (Fully) Independent families are k-wise independent for arbitrarily large k. For applications, non-independent families may fare as well as fully independent families if the entropy of the data source is sufficiently high (Mitzenmacher and Vadhan, 2008).
A hash function h is recursive (Cohen, 1997)—or rolling (Schleimer et al., 2003)—if there is a function F computing the hash value of the n-gram from the hash value of the preceding n-gram and the values of and . That is, we have
Ideally, we could compute function F in time and not, for example, in time .
The main contributions of this paper are:
- •
a proof that recursive hashing is no more than pairwise independent (Section 3);
- •
a proof that randomized Karp–Rabin can be uniform but never pairwise independent (Section 5);
- •
a proof that hashing by irreducible polynomials is pairwise independent (Section 7);
- •
a proof that hashing by cyclic polynomials is not even uniform (Section 9);
- •
a proof that hashing by cyclic polynomials is pairwise independent—after ignoring consecutive bits (Section 10).
We conclude with an experimental section where we show that hashing by cyclic polynomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithms presented.
Section snippets
Trailing-zero independence
Some randomized algorithms (Flajolet and Martin, 1985, Gibbons and Tirthapura, 2001) merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values (Durand and Flajolet, 2003). Naïvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈ distinct n-grams. Such estimates might be
Recursive hash functions are no more than pairwise independent
Not only are recursive hash functions limited to pairwise independence: they cannot be 3-wise trailing-zero independent. Proposition 1 There is no 3-wise trailing-zero independent hashing function that is recursive. Proof Consider the -gram . Suppose h is recursive and 3-wise trailing-zero independent, then
A non-recursive 3-wise independent hash function
A trivial way to generate an independent hash is to assign a random integer in to each new value x. Unfortunately, this requires as much processing and storage as a complete indexing of all values.
However, in a multidimensional setting this approach can be put to good use. Suppose that we have tuples in such that is small for all i. We can construct independent hash functions for all i and combine them. The hash function
Randomized Karp–Rabin is not independent
One of the most common recursive hash functions is commonly associated with the Karp–Rabin string-matching algorithm (Karp and Rabin, 1987). Given an integer B, the hash value over the sequence of integers is . A variation of the Karp–Rabin hash method is “Hashing by Power-of-2 Integer Division” (Cohen, 1997), where . In particular, the hashcode method of the Java String class uses this approach, with and (Sun Microsystems, 2004). A
Generating hash families from polynomials over Galois fields
A practical form of hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen (1997). GF(2) contains only two values (1 and 0) with the addition (and hence subtraction) defined by XOR, and the multiplication by AND, . is the vector space of all polynomials with coefficients from GF(2). Any integer in binary form (e.g., ) can thus be interpreted as an element of (e.g., ). If
Recursive hashing by irreducible polynomials is pairwise independent
Algorithm 3. The recursive General family. Require: an L-bit hash function over from an independent hash family; an irreducible polynomial p of degree L in 1: empty FIFO structure 2: (L-bit integer) 3: (L-bit integer) 4: for each character c do 5: append c to s 6: 7: 8: 9: if length(s) = n then 10: yield x 11: remove oldest character y from s 12: 13: end if 14: end for 1: function shift 2: input L-bit integer x 3: shift x left by 1 bit, storing result in an
Trading memory for speed: RAM-Buffered General
Unfortunately, General—as computed by Algorithm 3—requires time per n-gram. Indeed, shifting a value n times in requires time. However, if we are willing to trade memory usage for speed, we can precompute these shifts. We call the resulting scheme RAM-Buffered General. Lemma 2 Pick any in . The degree of is L. Represent elements of as polynomials of degree at most . Given any h in . we can compute in O(L) time given an
Recursive hashing by cyclic polynomials is not even uniform
Choosing for , for any polynomial , we have
Thus, we have that multiplication by is a bitwise rotation, a cyclic left shift—which can be computed in time. The resulting hash (see Algorithm 4) is called Cyclic. It requires only time per hash value. Empirically, Cohen showed that Cyclic is uniform (Cohen, 1997). In contrast, we show that it is not formally uniform:Algorithm 4. The recursive Cyclic
Cyclic is pairwise independent if you remove consecutive bits
Because Cohen found empirically that Cyclic had good uniformity (Cohen, 1997), it is reasonable to expect Cyclic to be almost uniform and maybe even almost pairwise independent. To illustrate this intuition, consider Table 3 which shows that while is not uniform ( is impossible), minus any bit is indeed uniformly distributed. We will prove that this result holds in general.
The next lemma and the next theorem show that Cyclic is quasi-pairwise independent in the sense
Experimental comparison
Irrespective of , computing hash values has complexity . For General and Cyclic, we require . Hence, the computation of their hash values is in . For moderate values of L and n, this analysis is pessimistic because CPUs can process 32- or 64-bit words in one operation.
To assess their real-world performance, the various hashing algorithms5 were written in C++. We compiled them with the GNU GCC 4.0.1 compiler on an Apple MacBook with two Intel
Conclusion
Considering speed and pairwise independence, we recommend Cyclic—after discarding consecutive bits. If we require only uniformity, Randomized Integer-Division is twice as fast.
Acknowledgments
This work is supported by NSERC Grants 155967, 261437 and by FQRNT Grant 112381. The authors are grateful to the anonymous reviewers for their significant contributions.
References (25)
Hardware-assisted algorithm for full-text large-dictionary string matching using n-gram hashing
Information Processing and Management
(1998)- et al.
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
(1985) - et al.
A fast and memory-efficient N-gram language model lookup method for large vocabulary continuous speech recognition
Computer Speech & Language
(2007) Continuous space language models
Computer Speech & Language
(2007)- Cardenal-Lopez, A., Diguez-Tirado, F.J., Garcia-Mateo, C., 2002. Fast LM look-ahead for large vocabulary continuous...
- et al.
Universal classes of hash functions
Journal of Computer and System Sciences
(1979) Recursive hashing functions for n-grams
ACM Transactions on Information Systems
(1997)An n-gram hash and skip algorithm for finding large numbers of keywords in continuous text streams
Software – Practice Experience
(1998)Massive query resolution for rapid selective dissemination of information
Journal of the American Society for Information Science
(1999)- Durand, M., Flajolet, P., 2003. Loglog counting of large cardinalities. In: ESA’03, vol. 2832 of LNCS, pp....
Cited by (22)
The universality of iterated hashing over variable-length strings
2012, Discrete Applied MathematicsCitation Excerpt :Given a strongly permuting iterated hash family, two strings differing by exactly one character never collide. Hashing by tabulation [34,5,4,14] has good universality, at the expense of the memory usage. We adapt this strategy to iterated hashing of variable-length strings.
aaHash: recursive amino acid sequence hashing
2023, Bioinformatics AdvancesSymmetric and Supersymmetric Polynomials and Their Applications in the Blockchain Technology and Neural Networks
2023, Proceedings - 2023 IEEE World Conference on Applied Intelligence and Computing, AIC 2023An arrangement of the number of K-grams in the performance of Rabin Karp algorithm in text adjustment
2022, Indonesian Journal of Electrical Engineering and Computer Sciencen-Grams exclusion and inclusion filter for intrusion detection in Internet of Energy big data systems
2022, Transactions on Emerging Telecommunications TechnologiesEfficient Algorithms for Time Series Prediction Method
2022, 2022 IEEE International Multi-Conference on Engineering, Computer and Information Sciences, SIBIRCON 2022