Recursive n-gram hashing is pairwise independent, at best

doi:10.1016/j.csl.2009.12.001

Computer Speech & Language

Volume 24, Issue 4, October 2010, Pages 698-710

https://doi.org/10.1016/j.csl.2009.12.001 Get rights and content

Abstract

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time $O (n)$ or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring $n - 1$ bits. Experimentally, we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp–Rabin hash families are not pairwise independent.

Introduction

An n-gram is a consecutive sequence of n symbols from an alphabet $Σ$ . An n-gram hash function h maps n-grams to numbers in $[0, 2^{L})$ . These functions have several applications from full-text matching (Cohen, 1998a, Cohen, 1998b, Cohen, 1999), pattern matching (Tan et al., 2006), or language models (Cardenal-Lopez et al., 2002, Zhang and Zhao, 2002, Schwenk, 2007, Li and Zhao, 2007, Talbot and Osborne, 2007a, Talbot and Osborne, 2007b, Talbot and Brants, 2008) to plagiarism detection (Ribler and Abrams, 2000).

To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a single hash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose the input data so that the hash values are biased. Therefore, we randomly pick a function from a family $H$ of functions (Carter and Wegman, 1979).

Such a family $H$ is uniform (over L-bits) if all hash values are equiprobable. That is, considering h selected uniformly at random from $H$ , we have $P (h (x) = y) = 1 / 2^{L}$ for all n-grams x and all hash values y. This condition is weak; the family of constant functions $(h (x) = c)$ is uniform.¹

Intuitively, we would want that if an adversary knows the hash value of one n-gram, it cannot deduce anything about the hash value of another n-gram. For example, with the family of constant functions, once we know one hash value, we know them all. The family $H$ is pairwise independent if the hash value of n-gram $x_{1}$ is independent from the hash value of any other n-gram $x_{2}$ . That is, we have $P (h (x_{1}) = y \land h (x_{2}) = z) = P (h (x_{1}) = y) P (h (x_{2}) = z) = 1 / 4^{L}$ for all n-grams $x_{1}, x_{2}$ , and all hash values y, z with $x_{1} \neq x_{2}$ . Pairwise independence implies uniformity. We refer to a particular hash function $h \in H$ as “uniform” or “a pairwise independent hash function” when the family in question can be inferred from the context.

Moreover, the idea of pairwise independence can be generalized: a family of hash functions $H$ is k-wise independent if given distinct $x_{1}, \dots, x_{k}$ and given h selected uniformly at random from $H$ , then $P (h (x_{1}) = y_{1} \land \dots \land h (x_{k}) = y_{k}) = 1 / 2^{kL}$ . Note that k-wise independence implies $k - 1$ -wise independence and uniformity. (Fully) Independent families are k-wise independent for arbitrarily large k. For applications, non-independent families may fare as well as fully independent families if the entropy of the data source is sufficiently high (Mitzenmacher and Vadhan, 2008).

A hash function h is recursive (Cohen, 1997)—or rolling (Schleimer et al., 2003)—if there is a function F computing the hash value of the n-gram $x_{2}, \dots, x_{n + 1}$ from the hash value of the preceding n-gram $(x_{1}, \dots, x_{n})$ and the values of $x_{1}$ and $x_{n + 1}$ . That is, we have $h (x_{2}, \dots, x_{n + 1}) = F (h (x_{1}, \dots, x_{n}), x_{1}, x_{n + 1}) .$

Ideally, we could compute function F in time $O (L)$ and not, for example, in time $O (Ln)$ .

The main contributions of this paper are:

•
a proof that recursive hashing is no more than pairwise independent (Section 3);
•
a proof that randomized Karp–Rabin can be uniform but never pairwise independent (Section 5);
•
a proof that hashing by irreducible polynomials is pairwise independent (Section 7);
•
a proof that hashing by cyclic polynomials is not even uniform (Section 9);
•
a proof that hashing by cyclic polynomials is pairwise independent—after ignoring $n - 1$ consecutive bits (Section 10).

We conclude with an experimental section where we show that hashing by cyclic polynomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithms presented.

Section snippets

Trailing-zero independence

Some randomized algorithms (Flajolet and Martin, 1985, Gibbons and Tirthapura, 2001) merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values (Durand and Flajolet, 2003). Naïvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈ $2^{k}$ distinct n-grams. Such estimates might be

Recursive hash functions are no more than pairwise independent

Not only are recursive hash functions limited to pairwise independence: they cannot be 3-wise trailing-zero independent.

Proposition 1

There is no 3-wise trailing-zero independent hashing function that is recursive.

Proof

Consider the $(n + 2)$ -gram $a^{n} bb$ . Suppose h is recursive and 3-wise trailing-zero independent, then $\begin{matrix} P (zeros (h (a, \dots, a)) ⩾ L ⋀ zeros (h (a, \dots, a, b)) ⩾ L ⋀ zeros (h (a, \dots, a, b, b)) ⩾ L) \\ = P (h (a, \dots, a) = 0 ⋀ F (0, a, b) = 0 ⋀ F (0, a, b) = 0) \\ = P (h (a, \dots, a) = 0 ⋀ F (0, a, b) = 0) \\ = P (zeros (h (a, \dots, a)) ⩾ L ⋀ zeros (h (a, \dots, a, b)) ⩾ L) \\ = 2^{- 2 L} by trailing-zero pairwise independence \\ \neq 2^{- 3 L} \end{matrix}$

A non-recursive 3-wise independent hash function

A trivial way to generate an independent hash is to assign a random integer in $[0, 2^{L})$ to each new value x. Unfortunately, this requires as much processing and storage as a complete indexing of all values.

However, in a multidimensional setting this approach can be put to good use. Suppose that we have tuples in $K_{1} \times K_{2} \times \dots \times K_{n}$ such that $| K_{i} |$ is small for all i. We can construct independent hash functions $h_{i} : K_{i} \to [0, 2^{L})$ for all i and combine them. The hash function $h (x_{1}, x_{2}, \dots, x_{n}) = h_{1} (x_{1}) \oplus h_{2} (x_{2}) \oplus \dots \oplus h_{n} (x_{n})$

Randomized Karp–Rabin is not independent

One of the most common recursive hash functions is commonly associated with the Karp–Rabin string-matching algorithm (Karp and Rabin, 1987). Given an integer B, the hash value over the sequence of integers $x_{1}, x_{2}, \dots, x_{n}$ is $\sum_{i = 1}^{n} x_{i} B^{n - i}$ . A variation of the Karp–Rabin hash method is “Hashing by Power-of-2 Integer Division” (Cohen, 1997), where $h (x_{1}, \dots, x_{n}) = \sum_{i = 1}^{n} x_{i} B^{n - i} \mod 2^{L}$ . In particular, the hashcode method of the Java String class uses this approach, with $L = 32$ and $B = 31$ (Sun Microsystems, 2004). A

Generating hash families from polynomials over Galois fields

A practical form of hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen (1997). GF(2) contains only two values (1 and 0) with the addition (and hence subtraction) defined by XOR, $a + b = a \oplus b$ and the multiplication by AND, $a \times b = a \land b$ . $GF (2) [x]$ is the vector space of all polynomials with coefficients from GF(2). Any integer in binary form (e.g., $c = 1101$ ) can thus be interpreted as an element of $GF (2) [x]$ (e.g., $c = x^{3} + x^{2} + 1$ ). If $p$

Recursive hashing by irreducible polynomials is pairwise independent

Algorithm 3. The recursive General family.
Require: an L-bit hash function $h_{1}$ over $Σ$ from an independent hash family; an irreducible polynomial p of degree L in $GF (2) [x]$
1:	$s \leftarrow$ empty FIFO structure
2:	$x \leftarrow 0$ (L-bit integer)
3:	$z \leftarrow 0$ (L-bit integer)
4:	for each character c do
5:	append c to s
6:	$x \leftarrow shift (x)$
7:	$z \leftarrow {shift}^{n} (z)$
8:	$x \leftarrow x \oplus z \oplus h_{1} (c)$
9:	if length(s) = n then
10:	yield x
11:	remove oldest character y from s
12:	$z \leftarrow h_{1} (y)$
13:	end if
14:	end for

1:	function shift
2:	input L-bit integer x
3:	shift x left by 1 bit, storing result in an $L + 1$

Trading memory for speed: RAM-Buffered General

Unfortunately, General—as computed by Algorithm 3—requires $O (nL)$ time per n-gram. Indeed, shifting a value n times in $GF (2) [x] / p (x)$ requires $O (nL)$ time. However, if we are willing to trade memory usage for speed, we can precompute these shifts. We call the resulting scheme RAM-Buffered General.

Lemma 2

Pick any $p (x)$ in $GF (2) [x]$ . The degree of $p (x)$ is L. Represent elements of $GF (2) [x] / p (x)$ as polynomials of degree at most $L - 1$ . Given any h in $GF (2) [x] / p (x)$ . we can compute $x^{n} h$ in O(L) time given an $O (L 2^{n})$

Recursive hashing by cyclic polynomials is not even uniform

Choosing $p (x) = x^{L} + 1$ for $L ⩾ n$ , for any polynomial $q (x) = \sum_{i = 0}^{L - 1} q_{i} x^{i}$ , we have $x^{i} q (x) = x^{i} (q_{L - 1} x^{L - 1} + \dots + q_{1} x + q_{0}) = q_{L - i - 1} x^{L - i - 2} + \dots + q_{L - i + 1} x + q_{L - i} .$

Thus, we have that multiplication by $x^{i}$ is a bitwise rotation, a cyclic left shift—which can be computed in $O (L)$ time. The resulting hash (see Algorithm 4) is called Cyclic. It requires only $O (L)$ time per hash value. Empirically, Cohen showed that Cyclic is uniform (Cohen, 1997). In contrast, we show that it is not formally uniform:

Algorithm 4. The recursive Cyclic

Cyclic is pairwise independent if you remove $n - 1$ consecutive bits

Because Cohen found empirically that Cyclic had good uniformity (Cohen, 1997), it is reasonable to expect Cyclic to be almost uniform and maybe even almost pairwise independent. To illustrate this intuition, consider Table 3 which shows that while $h (a, a)$ is not uniform ( $h (a, a) = 001$ is impossible), $h (a, a)$ minus any bit is indeed uniformly distributed. We will prove that this result holds in general.

The next lemma and the next theorem show that Cyclic is quasi-pairwise independent in the sense

Experimental comparison

Irrespective of $p (x)$ , computing hash values has complexity $Ω (L)$ . For General and Cyclic, we require $L ⩾ n$ . Hence, the computation of their hash values is in $Ω (n)$ . For moderate values of L and n, this analysis is pessimistic because CPUs can process 32- or 64-bit words in one operation.

To assess their real-world performance, the various hashing algorithms⁵ were written in C++. We compiled them with the GNU GCC 4.0.1 compiler on an Apple MacBook with two Intel

Conclusion

Considering speed and pairwise independence, we recommend Cyclic—after discarding $n - 1$ consecutive bits. If we require only uniformity, Randomized Integer-Division is twice as fast.

Acknowledgments

This work is supported by NSERC Grants 155967, 261437 and by FQRNT Grant 112381. The authors are grateful to the anonymous reviewers for their significant contributions.

References (25)

J.D. Cohen
Hardware-assisted algorithm for full-text large-dictionary string matching using n-gram hashing
Information Processing and Management
(1998)
P. Flajolet et al.
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
(1985)
X. Li et al.
A fast and memory-efficient N-gram language model lookup method for large vocabulary continuous speech recognition
Computer Speech & Language
(2007)
H. Schwenk
Continuous space language models
Computer Speech & Language
(2007)
Cardenal-Lopez, A., Diguez-Tirado, F.J., Garcia-Mateo, C., 2002. Fast LM look-ahead for large vocabulary continuous...
L. Carter et al.
Universal classes of hash functions
Journal of Computer and System Sciences
(1979)
J.D. Cohen
Recursive hashing functions for n-grams
ACM Transactions on Information Systems
(1997)
J.D. Cohen
An n-gram hash and skip algorithm for finding large numbers of keywords in continuous text streams
Software – Practice Experience
(1998)
J.D. Cohen
Massive query resolution for rapid selective dissemination of information
Journal of the American Society for Information Science
(1999)
Durand, M., Flajolet, P., 2003. Loglog counting of large cardinalities. In: ESA’03, vol. 2832 of LNCS, pp....

Fürer, M., 2007. Faster integer multiplication. In: STOC ’07, pp....

Gibbons, P.B., Tirthapura, S., 2001. Estimating simple functions on the union of data streams. In: SPAA’01, pp....

Cited by (22)

The universality of iterated hashing over variable-length strings
2012, Discrete Applied Mathematics
Citation Excerpt :
Given a strongly permuting iterated hash family, two strings differing by exactly one character never collide. Hashing by tabulation [34,5,4,14] has good universality, at the expense of the memory usage. We adapt this strategy to iterated hashing of variable-length strings.
Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability.
aaHash: recursive amino acid sequence hashing
2023, Bioinformatics Advances
Symmetric and Supersymmetric Polynomials and Their Applications in the Blockchain Technology and Neural Networks
2023, Proceedings - 2023 IEEE World Conference on Applied Intelligence and Computing, AIC 2023
An arrangement of the number of K-grams in the performance of Rabin Karp algorithm in text adjustment
2022, Indonesian Journal of Electrical Engineering and Computer Science
n-Grams exclusion and inclusion filter for intrusion detection in Internet of Energy big data systems
2022, Transactions on Emerging Telecommunications Technologies
Efficient Algorithms for Time Series Prediction Method
2022, 2022 IEEE International Multi-Conference on Engineering, Computer and Information Sciences, SIBIRCON 2022

View all citing articles on Scopus

View full text

Recursive n-gram hashing is pairwise independent, at best

Abstract

Introduction

Section snippets

Trailing-zero independence

Recursive hash functions are no more than pairwise independent

A non-recursive 3-wise independent hash function

Randomized Karp–Rabin is not independent

Generating hash families from polynomials over Galois fields

Recursive hashing by irreducible polynomials is pairwise independent

Trading memory for speed: RAM-Buffered General

Recursive hashing by cyclic polynomials is not even uniform

Cyclic is pairwise independent if you remove n-1 consecutive bits

Experimental comparison

Conclusion

Acknowledgments

Information Processing and Management

Journal of Computer and System Sciences

Computer Speech & Language

Computer Speech & Language

Universal classes of hash functions

Journal of Computer and System Sciences

Recursive hashing functions for n-grams

ACM Transactions on Information Systems

An n-gram hash and skip algorithm for finding large numbers of keywords in continuous text streams

Software – Practice Experience

Massive query resolution for rapid selective dissemination of information

Journal of the American Society for Information Science

Cyclic is pairwise independent if you remove $n - 1$ consecutive bits