Analysis of Counting Bloom Filters Used for Count Thresholding

: A bloom ﬁlter is an extremely useful tool applicable to various ﬁelds of electronics and computers; it enables highly efﬁcient search of extremely large data sets with no false negatives but a possibly small number of false positives. A counting bloom ﬁlter is a variant of a bloom ﬁlter that is typically used to permit deletions as well as additions of elements to a target data set. However, it is also sometimes useful to use a counting bloom ﬁlter as an approximate counting mechanism that can be used, for example, to determine when a speciﬁc web page has been referenced more than a speciﬁc number of times or when a memory address is a “hot” address. This paper derives, for the ﬁrst time, highly accurate approximate false positive probabilities and optimal numbers of hash functions for counting bloom ﬁlters used in count thresholding applications. The analysis is conﬁrmed by comparisons to existing theoretical results, which show an error, with respect to exact analysis, of less than 0.48% for typical parameter values.


Introduction
A bloom filter (BF) is a powerful tool that can be used to create novel, low-overhead methods for dealing with big data sets in various software/hardware applications. Proposed by Burton Howard Bloom in 1970 [1], a BF is an m-bit vector that is initialized to 0. Assuming that n elements are stored into a data set, each time a new element is stored, k hash functions, each of which maps the element to one of m bit locations, are applied and the corresponding bits in the BF are set to 1. To determine if a new unknown element is a member of the data set or not, the k hash functions are applied to that element and the corresponding bits in the BF are checked. A positive answer is returned if all of those bits are found to be 1. Since the search time is independent of the number of elements in the data set, this results in extremely space-efficient and fast search of large data sets. Although false positives are possible, false negatives are not since the element could not have been stored in the data set if any of the k hash function bits in the BF are 0.
Over the years, various BF variants have been proposed. One commonly cited variant is a counting bloom filter (CBF), which has been proposed as a method that supports deletion as well as addition of elements to a data set [2]. In a CBF, the m elements are multiple-bit counts instead of bits. Every time an element is added to or deleted from the data set, k hash functions are applied to the element and all of the k locations in the m-element CBF are incremented or decremented by one. In order to check if a specific element is still in the data set, the k hashed locations in the CBF can be checked for the absence of 0 values. Several methods have been proposed to enable efficient support of m-element CBFs [3,4].
BFs and CBFs have been found to be useful for numerous applications. For example, in a web cache, instead of storing all web objects that are accessed, disk writes can be significantly reduced by only storing those web objects that are referred to more than once, thereby eliminating "one-hit wonders" (accessed by a set of users once and never again). A BF can be used to quickly determine if a specific web page has been referenced before or not. For long-term usage, a CBF can be used instead of a BF in order to permit deletion of long unused web objects from the web cache. This approach has been found to reduce the rate of disk writes by nearly half in an actual system of servers [5].
Given that the elements in a CBF contain approximate "counts", this paper examines the problem of using a CBF as an approximate counting mechanism, in particular to check whether a certain data element has been referred to θ or more times, where θ is a count threshold. For example, when applied to the above web cache application, θ = 2 could be used to eliminate one-hit and two-hit web objects; i.e., disk write usage could be reduced by only storing web objects that have been accessed two or more times. As a second example, when applied to memory management in a computer system, a value such as θ = 5 could be used to identify hot memory addresses. As a third example, when applied to DNA sequence identification, approximate count thresholding could be used to quickly identify or match specific strings in a DNA sequence that occur more than θ times. As a fourth example, approximate count thresholding could be used to determine if a set of nodes are accessing a given web page a large number of times within a short timespan and thus help guard that web page against distributed denial of service (DDOS) attacks [16].
The main motivation for this paper to provide a solid theoretical foundation for the use of CBFs for count thresholding applications. Towards this end, Section 2 introduces the traditional BF and CBF analysis. Then Section 3 presents the newly proposed CBF analysis method and its main results. Next, Section 4 uses comparisons to existing theoretical analysis to confirm this new analysis method. Finally, Section 5 concludes this paper.

Previous Bloom Filter (BF) and Counting Bloom Filter (CBF) Analysis
The traditional BF analysis method proceeds as follows [17,18]. For insertion of n elements into a data set, an m-bit BF, initialized with all 0 bits, is used. Every time an element is inserted into the data set, k hash functions are applied and the corresponding bits in the m-bit BF are set to 1. After the first element is inserted into the data set and the first hash function is used to set one bit of the BF, an arbitrary bit in the BF is 0 with probability (1 − 1/m). After all k hash functions are used, an arbitrary bit in the BF is still 0 with probability (1 − 1/m) k . Thus, after all n elements are inserted and all k hash functions applied to each of those n elements, the probability that an arbitrary bit in the BF is 0 is p 0 = (1 − 1/m) kn ≈ e −kn/m , where the approximation is based on the definition of e [17,18].
If a user wishes to determine if an element is present in the data set or not, he/she applies the k hash functions and checks all k bits in the BF. If an element is not in the data set, an erroneous result is produced when all k hashed bits in the BF are unity. Using this as an approximation, the false positive probability p trad An important BF parameter that must be selected is the number of hash functions, k, to be used. For this purpose, the k value that is typically used is the one that minimizes the false positive probability. Thus, based on the above approximation, The careful reader will note that the above analysis is not strictly correct as it assumes independence of the values of bits in the BF, even if the k hashed locations are for an element in the data set [11]. However, using Chernoff bounds, Mitzenmacher and Upfal have shown that, for large m and n, the same result is obtained even without the independence assumption [19]. Therefore, as can be easily verified by the reader, given sufficiently large n and m (e.g., n = 25 and m = 100 or larger), the above equations are highly accurate and k can be chosen based on Equation (1).
Using the same assumptions as [18,19], the above analysis can be extended to a CBF. To do this, it is noted that after n elements have been inserted into a data set and kn uniformly random hash mappings have been used to increment the values in an m-element CBF, the probability that an arbitrary element in the CBF has the value l is simply defined by the probability mass function (pmf) of a binomial distribution with success probability 1/m. Denoting this as b(l, kn, 1 Note that when l = 0, which corresponds to checking whether an arbitrary element in the CBF is 0, Suppose an m-element CBF is used to determine if an element has been referenced θ or more times. After insertion of n elements in a data set, the probability that an arbitrary element of the CBF has a value less than θ is simply the sum of b(l, kn, 1 m ) from l = 0 to l = θ − 1. Thus, the false positive probability with count threshold θ is as follows.

Proposed Analysis and Results
Although clearly useful for certain applications such as web data caching, determination of hot memory addresses, string matching in DNA sequence analysis, and protection again DDOS attacks, there has been no previous detailed theoretical analysis of CBFs used for count thresholding in the open literature (previous papers have only dealt with CBFs used to permit deletions of elements in large data sets). Such an analysis is necessary in order to be able to predict the effectiveness of a CBF solution and the specific CBF parameters to use. For example, m, the number of CBF elements to be used, must be selected such that the resulting false positive probability level is acceptable for the chosen application. In addition, k, the number of hash functions to be used, must be selected to minimize the the false positive probability. This type of analysis is provided in this section.

False Positive Probability
The analysis starts with a derivation of a close approximation for the false positive probability, which is necessary since the exact form given in Equation (2) involves a sum of binomial distributions, which is extremely difficult and time-consuming to compute for large n and m values. For large x n and small x p , it is well known that a binomial distribution b(x, x n , x p ) can be approximated by a Poisson distribution with mean x n x p [20]. For CBF applications, large n (data set size) and m (CBF size) values satisfy these conditions since x n = kn and x p = 1 m . Thus, the approximate false positive probabilitŷ p f p can be written as follows.
The cumulative mass function (cmf) of a Poisson distribution is a regularized incomplete gamma function [21]. Thus, the approximate false positive probability can be written aŝ where the mean of the Poisson distribution used is defined as κ = kn m and Γ(θ, κ) = ∞ κ t θ−1 e −t dt . As shown in Figure 1, this incomplete Gamma function approximation results in a highly accurate approximation of p f p . Note that the approximationp f p only depends on the ratio of kn to m. Figure 1 shows that p f p andp f p overlap almost 100%. The exact relative error ofp f p is shown in Figure 2.
For the parameters shown, the relative error is less than 0.48% when an optimal number of hash functions is used. The optimal k values k opt (θ), for which the false positive probabilities are the lowest, are shown using a dashed cyan line in Figure 1. As can be seen in the figure, k opt (θ) is definitely not the same, or even close, to k trad opt (θ) = k opt (1), shown as a solid vertical orange line in Figure 1, when θ > 1. Before proposing a systematic method for finding k opt (θ) for general values of θ, a rigorous analysis will be presented that shows that only one such value exists.

Uniqueness of Optimal Number of Hash Functions
A sequence of lemmas are used to prove that there exists a unique value of k opt for which the false positive probability is minimized. To follow this proof process, the reader is advised to refer to the plots in Figures 3 and 4 when reading the following lemmas.
Since the optimal false positive probability point occurs when its slope is 0, the proof starts by taking the derivative ofp f p (θ, k, n, m) with respect to k. To find the shape of the derivative ofp f p , the logarithm ofp f p can be used. ).
By Leibniz's rule and the definition of the incomplete gamma function [21], Therefore, by applying Equation (6) to the right side of Equation (5) and multiplying both sides of Equation (5) byp f p (θ, k, n, m), ).
For k opt (θ), this derivative should be set to 0. Sincep f p > 0, the second part must be 0. Then, multiplying this second part by a common factor and denoting this term as g(θ, κ), the following equations and lemmas follow.
To determine whether g is a decreasing or increasing function, the derivative of g is needed.
In Equation (7), the first part is greater than 0. Thus, g is a decreasing or increasing function depending on the polarity of 1 + θ + ln(1 − Γ(θ,κ) Γ(θ,0) ) − κ. Let y 1 (κ) = ln(1 − Γ(θ,κ) Γ(θ,0) ) and y 2 (κ) = (κ − θ − 1). The y 1 and y 2 terms are defined in this manner in order to facilitate the examination of the exact conditions under which g is a decreasing or increasing function, and thereby determine the conditions for the changes in slope of the false positive probability function. Then, Examples of the shapes of y 1 and y 2 are shown in Figure 3. An example of the g(θ, κ) function is shown in Figure 4.  Proof of Lemma 1. Using the definition of an incomplete Gamma function, the first partial derivative of y 1 can be shown to be greater than zero. i.e., Then, using the second partial derivative, it can also be verified that y 1 is concave when κ ≥ θ − 1.
Now consider the situation when κ < θ − 1. The following are well-known properties of a Poisson distribution with mean κ, denoted by Poi κ (X = θ) [22], Then, based on [21] and applying Equation (9) (8), Thus, Therefore, y 1 is a strictly increasing concave function for all values of κ.
Proof of Lemma 2. Lemma 2 is equivalent to From [21], by putting θ+1 2 into the ln( ) function, Then, by the Stirling inequality, The function on the right hand side decreases from θ = 0 to 1 and increases from θ = 1 to ∞.
The minimum value of this function is 1, which occurs at some point θ with θ > 0. Thus,

Procedure for Determining the Optimal Number of Hash Functions
Based on the lemmas and theorem of the previous subsection, a general procedure to be used to find the optimal number of hash functions k opt (θ) is as follows. Start from k = 1 and compute the false positive probabilityp * f p (θ, k, n, m) using Equation (3). Then increment k by one and recompute the false positive probability. Continue until the false positive probability starts to increase or k = (θ + 1)n/m, whichever comes first. The k value that results in the minimump * f p (θ, k, n, m) is k opt (θ). The above procedure can be simplified by using precomputed tables or a linear approximation. Table 1 shows a table of precomputed optimal κ * values for count thresholds θ ranging from 1 to 30. This table was created by following the procedure outlined above. Since κ = kn/m, this table can be used to determine k opt by simply using the relationship shown in Theorem 1; i.e., k opt is either the floor or ceiling of κ * m/n.

Simulation Results
Simulations were conducted to verify the proposed theoretical analysis. A simulation program was written in Java for a general CBF with n data entries, m CBF elements, and k hash functions. The hash functions were created as uniform random distributions between 0 and m − 1 using the pseudorandom number generator provided in the java.util.Random package and stored in tables so that they could be reused during hashing. Care was taken to ensure that the hash functions created were orthogonal to each other. This simulator program is freely available for all interested readers. Each simulation result, which was the average of 100 simulation runs, was obtained in the following manner. The CBF was initialized by setting all m CBF entries to 0. Then, n data entries were generated randomly. For each data entry, k hash functions, applied by looking up table values from precomputed random hashes (as described above), are applied and used to increment the CBF entries corresponding to the hash function outputs. Finally, n q = n queries were randomly generated and the k hash functions are applied to each of those queries. Each query resulted in a "true" answer if all k CBF elements mapped by the k hash functions are greater than or equal to the count threshold θ. Finally, the number of false positives generated in this manner were counted and divided by n q to produce the false positive probability. Figure 7 shows the false positive rate simulation and analysis results, as a function of k, with n = 10 million, m = 40 million, and θ values ranging from 1 to 5. As can be seen from the figure, the simulation results closely map the theoretical results, with slight variations only visible for exceedingly low false probabilities of 10 −7 or smaller. Exceedingly low false probabilities imply rare occurrences of false positives, thus requiring longer simulation runs to obtain accurate results. Even lower false positive probabilities (smaller than 10 −9 ) then result in zero occurrences of false positive events in our simulations. Thus, simulation results were not recorded, since false positive events did not occur, for θ = 5 and k = 4 through 10 in Figure 7.

Conclusions
This paper has investigated the problem of determining the optimal parameter values to be used for counting bloom filters used in applications requiring approximate count thresholds. Rigorous analysis has led to a highly accurate equation for the false positive probability, with relative errors of less than 0.48% given typical parameter values. It has also been proven that there exists a unique number of hash functions k opt for which an minimal false positive probability is obtained. Next, a systematic procedure based on precomputed tables and a linear approximation has been presented for finding k opt . Finally, realistic simulations modeling the use of a CBF for a count thresholding application has been used to show that the theoretical analysis closely models actual CBF behavior.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

BF
Bloom filter CBF Counting bloom filter pmf Probability mass function cmf Cumulative mass function