Recursive n-gram hashing is pairwise independent, at best

https://doi.org/10.1016/j.csl.2009.12.001Get rights and content

Abstract

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp–Rabin hash families are not pairwise independent.

Introduction

An n-gram is a consecutive sequence of n symbols from an alphabet Σ. An n-gram hash function h maps n-grams to numbers in [0,2L). These functions have several applications from full-text matching (Cohen, 1998a, Cohen, 1998b, Cohen, 1999), pattern matching (Tan et al., 2006), or language models (Cardenal-Lopez et al., 2002, Zhang and Zhao, 2002, Schwenk, 2007, Li and Zhao, 2007, Talbot and Osborne, 2007a, Talbot and Osborne, 2007b, Talbot and Brants, 2008) to plagiarism detection (Ribler and Abrams, 2000).

To prove that a hashing algorithm must work well, we typically need hash values to satisfy some statistical property. Indeed, a hash function that maps all n-grams to a single integer would not be useful. Yet, a single hash function is deterministic: it maps an n-gram to a single hash value. Thus, we may be able to choose the input data so that the hash values are biased. Therefore, we randomly pick a function from a family H of functions (Carter and Wegman, 1979).

Such a family H is uniform (over L-bits) if all hash values are equiprobable. That is, considering h selected uniformly at random from H, we have P(h(x)=y)=1/2L for all n-grams x and all hash values y. This condition is weak; the family of constant functions (h(x)=c) is uniform.1

Intuitively, we would want that if an adversary knows the hash value of one n-gram, it cannot deduce anything about the hash value of another n-gram. For example, with the family of constant functions, once we know one hash value, we know them all. The family H is pairwise independent if the hash value of n-gram x1 is independent from the hash value of any other n-gram x2. That is, we have P(h(x1)=yh(x2)=z)=P(h(x1)=y)P(h(x2)=z)=1/4L for all n-grams x1,x2, and all hash values y, z with x1x2. Pairwise independence implies uniformity. We refer to a particular hash function hH as “uniform” or “a pairwise independent hash function” when the family in question can be inferred from the context.

Moreover, the idea of pairwise independence can be generalized: a family of hash functions H is k-wise independent if given distinct x1,,xk and given h selected uniformly at random from H, then P(h(x1)=y1h(xk)=yk)=1/2kL. Note that k-wise independence implies k-1-wise independence and uniformity. (Fully) Independent families are k-wise independent for arbitrarily large k. For applications, non-independent families may fare as well as fully independent families if the entropy of the data source is sufficiently high (Mitzenmacher and Vadhan, 2008).

A hash function h is recursive (Cohen, 1997)—or rolling (Schleimer et al., 2003)—if there is a function F computing the hash value of the n-gram x2,,xn+1 from the hash value of the preceding n-gram (x1,,xn) and the values of x1 and xn+1. That is, we haveh(x2,,xn+1)=F(h(x1,,xn),x1,xn+1).

Ideally, we could compute function F in time O(L) and not, for example, in time O(Ln).

The main contributions of this paper are:

  • a proof that recursive hashing is no more than pairwise independent (Section 3);

  • a proof that randomized Karp–Rabin can be uniform but never pairwise independent (Section 5);

  • a proof that hashing by irreducible polynomials is pairwise independent (Section 7);

  • a proof that hashing by cyclic polynomials is not even uniform (Section 9);

  • a proof that hashing by cyclic polynomials is pairwise independent—after ignoring n-1 consecutive bits (Section 10).

We conclude with an experimental section where we show that hashing by cyclic polynomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithms presented.

Section snippets

Trailing-zero independence

Some randomized algorithms (Flajolet and Martin, 1985, Gibbons and Tirthapura, 2001) merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values (Durand and Flajolet, 2003). Naïvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈2k distinct n-grams. Such estimates might be

Recursive hash functions are no more than pairwise independent

Not only are recursive hash functions limited to pairwise independence: they cannot be 3-wise trailing-zero independent.

Proposition 1

There is no 3-wise trailing-zero independent hashing function that is recursive.

Proof

Consider the (n+2)-gram anbb. Suppose h is recursive and 3-wise trailing-zero independent, thenPzeros(h(a,,a))Lzeros(h(a,,a,b))Lzeros(h(a,,a,b,b))L=Ph(a,,a)=0F(0,a,b)=0F(0,a,b)=0=Ph(a,,a)=0F(0,a,b)=0=Pzeros(h(a,,a))Lzeros(h(a,,a,b))L=2-2Lby trailing-zero pairwise independence2-3L

A non-recursive 3-wise independent hash function

A trivial way to generate an independent hash is to assign a random integer in [0,2L) to each new value x. Unfortunately, this requires as much processing and storage as a complete indexing of all values.

However, in a multidimensional setting this approach can be put to good use. Suppose that we have tuples in K1×K2××Kn such that |Ki| is small for all i. We can construct independent hash functions hi:Ki[0,2L) for all i and combine them. The hash function h(x1,x2,,xn)=h1(x1)h2(x2)hn(xn)

Randomized Karp–Rabin is not independent

One of the most common recursive hash functions is commonly associated with the Karp–Rabin string-matching algorithm (Karp and Rabin, 1987). Given an integer B, the hash value over the sequence of integers x1,x2,,xn is i=1nxiBn-i. A variation of the Karp–Rabin hash method is “Hashing by Power-of-2 Integer Division” (Cohen, 1997), where h(x1,,xn)=i=1nxiBn-imod2L. In particular, the hashcode method of the Java String class uses this approach, with L=32 and B=31 (Sun Microsystems, 2004). A

Generating hash families from polynomials over Galois fields

A practical form of hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen (1997). GF(2) contains only two values (1 and 0) with the addition (and hence subtraction) defined by XOR, a+b=ab and the multiplication by AND, a×b=ab. GF(2)[x] is the vector space of all polynomials with coefficients from GF(2). Any integer in binary form (e.g., c=1101) can thus be interpreted as an element of GF(2)[x] (e.g., c=x3+x2+1). If p

Recursive hashing by irreducible polynomials is pairwise independent

Algorithm 3. The recursive General family.
Require: an L-bit hash function h1 over Σ from an independent hash family; an irreducible polynomial p of degree L in GF(2)[x]
1:s empty FIFO structure
2:x0 (L-bit integer)
3:z0 (L-bit integer)
4:for each character c do
5: append c to s
6: xshift(x)
7: zshiftn(z)
8: xxzh1(c)
9: if length(s) = n then
10: yield x
11: remove oldest character y from s
12: zh1(y)
13: end if
14:end for

1:function shift
2:input L-bit integer x
3:shift x left by 1 bit, storing result in an L+1

Trading memory for speed: RAM-Buffered General

Unfortunately, General—as computed by Algorithm 3—requires O(nL) time per n-gram. Indeed, shifting a value n times in GF(2)[x]/p(x) requires O(nL) time. However, if we are willing to trade memory usage for speed, we can precompute these shifts. We call the resulting scheme RAM-Buffered General.

Lemma 2

Pick any p(x) in GF(2)[x]. The degree of p(x) is L. Represent elements of GF(2)[x]/p(x) as polynomials of degree at most L-1. Given any h in GF(2)[x]/p(x). we can compute xnh in O(L) time given an O(L2n)

Recursive hashing by cyclic polynomials is not even uniform

Choosing p(x)=xL+1 for Ln, for any polynomial q(x)=i=0L-1qixi, we havexiq(x)=xi(qL-1xL-1++q1x+q0)=qL-i-1xL-i-2++qL-i+1x+qL-i.

Thus, we have that multiplication by xi is a bitwise rotation, a cyclic left shift—which can be computed in O(L) time. The resulting hash (see Algorithm 4) is called Cyclic. It requires only O(L) time per hash value. Empirically, Cohen showed that Cyclic is uniform (Cohen, 1997). In contrast, we show that it is not formally uniform:

Algorithm 4. The recursive Cyclic

Cyclic is pairwise independent if you remove n-1 consecutive bits

Because Cohen found empirically that Cyclic had good uniformity (Cohen, 1997), it is reasonable to expect Cyclic to be almost uniform and maybe even almost pairwise independent. To illustrate this intuition, consider Table 3 which shows that while h(a,a) is not uniform (h(a,a)=001 is impossible), h(a,a) minus any bit is indeed uniformly distributed. We will prove that this result holds in general.

The next lemma and the next theorem show that Cyclic is quasi-pairwise independent in the sense

Experimental comparison

Irrespective of p(x), computing hash values has complexity Ω(L). For General and Cyclic, we require Ln. Hence, the computation of their hash values is in Ω(n). For moderate values of L and n, this analysis is pessimistic because CPUs can process 32- or 64-bit words in one operation.

To assess their real-world performance, the various hashing algorithms5 were written in C++. We compiled them with the GNU GCC 4.0.1 compiler on an Apple MacBook with two Intel

Conclusion

Considering speed and pairwise independence, we recommend Cyclic—after discarding n-1 consecutive bits. If we require only uniformity, Randomized Integer-Division is twice as fast.

Acknowledgments

This work is supported by NSERC Grants 155967, 261437 and by FQRNT Grant 112381. The authors are grateful to the anonymous reviewers for their significant contributions.

References (25)

  • Fürer, M., 2007. Faster integer multiplication. In: STOC ’07, pp....
  • Gibbons, P.B., Tirthapura, S., 2001. Estimating simple functions on the union of data streams. In: SPAA’01, pp....
  • Cited by (22)

    View all citing articles on Scopus
    View full text