Elsevier

Information Systems

Volume 88, February 2020, 101449
Information Systems

Bitmap filter: Speeding up exact set similarity joins with bitwise operations

https://doi.org/10.1016/j.is.2019.101449Get rights and content

Highlights

  • A novel low overhead filter (Bitmap Filter) for the exact set similarity join.

  • We improved four state-of-the-art algorithms in up to 4.50× (1.35× on average).

  • Bitmap Filter can be used in candidate generation or verification stages.

  • Bitmap Filter was also implemented in GPUs, presenting speedups of up to 577×.

Abstract

The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as Overlap, Jaccard, Dice or Cosine. The naïve approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact Set Similarity Join algorithms are usually based on the Filter-Verification Framework, that applies a series of filters to reduce the number of verified pairs. This paper presents a new filtering technique called Bitmap Filter, which is able to accelerate state-of-the-art algorithms for the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create bitmaps of fixed b bits, representing characteristics of the sets. Then, it applies bitwise operations (such as xor and population count) on the bitmaps in order to infer a similarity upper bound for each pair of sets. If the upper bound is below a given similarity threshold, the pair of sets is pruned. The Bitmap Filter benefits from the fact that bitwise operations are efficiently implemented by many modern general-purpose processors and it was easily applied to four state-of-the-art algorithms implemented in CPU: AllPairs, PPJoin, AdaptJoin and GroupJoin. Furthermore, we propose a Graphic Processor Unit (GPU) algorithm based on the naïve approach but using the Bitmap Filter to speedup the computation. The experiments considered 12 collections containing from 100 thousands up to 10 million sets and the joins were made using Jaccard thresholds from 0.50 to 0.95. The Bitmap Filter was able to improve 85% of the experiments in CPU, with speedups of up to 4.50× and 1.35× on average. Using the GPU algorithm, the experiments were able to speedup the original CPU algorithms by up to 577× using an Nvidia Geforce GTX 980 Ti.

Introduction

Data analysts often need to identify similar records stored in multiple data collections. This operation is very common for data cleaning [1], element clustering [2] and record linkage [3], [4]. In some scenarios the similarity analysis aims to detect near duplicate records [5], in which slightly different data representations may be caused by input errors, misspelling or use of synonyms [6]. In other scenarios, the goal is to find patterns or common behaviors between real entities, such as purchase patterns [7] and similar user interests [8].

Set Similarity Join is the operation that identifies all similar sets between two collections of sets (or inside the same collection) [7], [9]. In this case, each record is represented as a set of elements (tokens). For example, the text record “Data Warehouse” may be represented as a set of two words {Data, Warehouse} or a set of bigrams {Da, at, ta, a_, _W, Wa, ar, re, eh, ho, ou, us, se}. Two records are considered similar according to a similarity function, such as Overlap, Dice, Cosine or Jaccard [10], [11]. For instance, two records may be considered similar if the overlap (intersection) of their bigrams is greater or equal than 6, such as in “Databases” and “Databazes”.

Many different solutions have been proposed in the literature for Set Similarity Join [12], [1], [13], [5], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. With respect to the guarantee that all similar pairs will be found, these solutions are divided into exact approaches [12], [1], [13], [5], [14], [15], [22], [23], [24] and approximate approaches [16], [17], [18], [19], [20], [21]. The approximate algorithms usually rely on the Local Sensitive Hashing technique (LSH) [25] and are often very competitive, but these algorithms are outside the scope of this paper since only exact approaches will be considered here.

The solutions for Exact Set Similarity Join are usually based on the Filter-Verification Framework [7], [24], which is divided into two stages: (a) the candidate generation uses a series of filters to produce a reduced list of candidate pairs; (b) the verification stage verifies each candidate pair in order to check which ones are similar, with respect to the selected similarity function and threshold.

The filtering stage is commonly based on Prefix Filter [1], [13] and Length Filter [12], which are able to prune a considerable number of candidate pairs without compromising the exact result. Prefix Filter [1], [13] is based on the idea that two similar sets must share at least one element in their initial elements, given a defined element ordering. Length Filter [12] prunes candidate lists according to their lengths, considering that elements with a big difference in length will have low similarity. Other filters have also been proposed, such as the Positional Filter [5], which is able to prune candidate pairs based on the position where the token occurs in the set, and the Suffix Filter [5], that prunes pairs by applying a binary search on the last elements of the sets.

There are many algorithms in the literature combining the aforementioned filters. AllPairs [13] uses the Prefix and Length filters, PPJoin [5] extends AllPairs with the Positional Filter, PPJoin+ [5] extends PPJoin with the Suffix Filter. AdaptJoin [14] extends PPJoin using a variable schema with adaptive prefix lengths and GroupJoin [15] applies the filters in groups of similar sets, avoiding redundant computations. These algorithms usually rely on the assumption that the verification stage contributes significantly to the overall running time. However, in [7] a new verification procedure using an early termination condition was proposed that significantly reduced the execution time of the verification phase. With this, the proportion of execution time in the candidate generation phase increased and, as a consequence, the simplest filtering schemes are now able to produce the best performance. In [7], the best results were obtained by AllPairs, PPJoin, GroupJoin and AdaptJoin, whereas AllPairs was the best algorithm on the largest number of experiments [7]. On the contrary, PPJoin+ [5], which uses a sophisticated suffix filter technique, was the slowest algorithm in [7].

The state-of-the-art set similarity join algorithms still need to be improved in order to allow a faster join, otherwise very large collections may not be processed in a reasonable time. Based on the recent findings on the filter-verification trade-off overhead [7], the authors claimed that future filtering techniques should invest in fast and simple candidate generation methods instead of sophisticated filter-based algorithms, otherwise the filters effectiveness will not pay off their overhead when compared to the verification procedure.

The main contribution of this paper is a new low overhead filter called Bitmap Filter, which is able to efficiently prune candidate pairs and improve the performance of many state-of-the-art exact set similarity join algorithms such as AllPairs [13], PPJoin [5], AdaptJoin [14] and GroupJoin [15]. Using bitwise operations implemented by most modern processors, the Bitmap Filter is able to speedup the performance of AllPairs, PPJoin, AdaptJoin and GroupJoin algorithms by up to 4.50× (1.43× on average) considering 9 different collections and Jaccard threshold varying from 0.50 to 0.95. As far as we know, there is no other filter for the Exact Set Similarity Join problem that is based on efficient bitwise operations.

The Bitmap Filter is sufficiently flexible to be used in the candidate generation or verification stages. Furthermore, it can be efficiently implemented in devices such as Graphical Processor Units (GPU). As a secondary contribution of this paper, we thus propose a GPU implementation of the Bitmap Filter that is able to speedup the join by up to 577× using a Nvidia Geforce GTX 980 Ti card, considering 6 collections of up to 606,770 sets.

The rest of this paper is structured as follows. Section 2 presents the background of the Set Similarity Join problem. Then, Section 3 proposes the Bitmap Filter and explains it in detail. Section 4 shows the design of the proposed CPU and GPU implementations. Section 5 presents the experimental results and Section 6 concludes the paper.

Section snippets

Background

In this section, we discuss the Set Similarity Join problem. First, the problem is formalized (Section 2.1). Then, the Filter-Verification Framework is presented (Section 2.2). After that, some commonly used filters employed in the Filter-Verification Framework are explained (Section 2.3). Finally, state-of-the-art algorithms are described (Section 2.4).

Bitmap filter

In this section we propose the Bitmap Filter, which uses a new bitwise technique capable of speeding up similarity set joins without compromising the exact result. The bitwise operations are widely used in modern microprocessors and most of the compilers are capable to map them to fast hardware instructions.

First, some preliminaries are given in Section 3.1. Then, Section 3.2 presents how the bitmap is generated. Section 3.3 shows how the overlap upper bound can be calculated using bitmaps.

Implementation details

The Bitmap Filter will be evaluated using CPU and GPU implementations, which are detailed in this section.

Experimental results

In order to create a baseline, we replicated the experiments conducted by [7]. Table 4 presents the characteristics of the chosen collections. All collections were preprocessed and the sets in the collections were sorted by size and, in case of a tie, they were sorted lexicographically by the token ids. We observed that the lexicographical ordering speeds up all the algorithms, as stated by [7].

Fig. 8 shows the set size distribution in the collections. Orkut, netflix, spotify, enron and kosarak

Conclusions

This paper presented a new filtering technique called Bitmap Filter, which is able to reduce the running time of state-of-the-art algorithms that solve the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create binary words of b bits to represent the set of tokens, with reduced dimension. The comparison of two bitmaps using population count and xor operations (hamming distance) allows to infer the number of different tokens in the original sets, and this difference

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.is.2019.101449.

Acknowledgments

We would like to thank the Coordination for the Improvement of Higher Education Personnel (CAPES, Brazil ) for the Post-Doctoral Scholarship given to E. Sandes.

References (29)

  • BroderA.Z. et al.

    Syntactic clustering of the web

    Comput. Netw. ISDN Syst.

    (1997)
  • SohrabiM.K. et al.

    Parallel set similarity join on big data based on locality-sensitive hashing

    Sci. Comput. Program.

    (2017)
  • RibeiroL.A. et al.

    Generalizing prefix filtering to improve set similarity joins

    Inf. Syst.

    (2011)
  • ChaudhuriS. et al.

    A primitive operator for similarity joins in data cleaning

  • CohenW.W.

    Data integration using similarity joins and a word-based information representation language

    ACM Trans. Inf. Syst.

    (2000)
  • VernicaR. et al.

    Efficient parallel set-similarity joins using MapReduce

  • XiaoC. et al.

    Efficient similarity joins for near-duplicate detection

    ACM Trans. Database Syst.

    (2011)
  • ArasuA. et al.

    Efficient exact set-similarity joins

  • MannW. et al.

    An empirical evaluation of set similarity join techniques

    Proc. VLDB Endow.

    (2016)
  • SpertusE. et al.

    Evaluating similarity measures: A large-scale study in the orkut social network

  • JiangY. et al.

    String similarity joins: An experimental evaluation

    Proc. VLDB Endow.

    (2014)
  • AugstenN. et al.

    Similarity Joins in Relational Database Systems

    (2013)
  • TheobaldM. et al.

    Spotsigs: Robust and efficient near duplicate detection in large web collections

  • GravanoL. et al.

    Approximate string joins in a database (almost) for free

  • Cited by (3)

    • HySet: A hybrid framework for exact set similarity join using a GPU

      2021, Parallel Computing
      Citation Excerpt :

      Bitmap can also be used as a main filter for the CPU-enabled techniques alongside prefix filter. However, the speedup gains in this case for the best performing CPU techniques are relatively small, up to 1.35x on average [12]. In addition, length filter can also be used at block level which enables pruning of whole blocks.

    • A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

      2023, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
    View full text