Bitmap filter: Speeding up exact set similarity joins with bitwise operations
Introduction
Data analysts often need to identify similar records stored in multiple data collections. This operation is very common for data cleaning [1], element clustering [2] and record linkage [3], [4]. In some scenarios the similarity analysis aims to detect near duplicate records [5], in which slightly different data representations may be caused by input errors, misspelling or use of synonyms [6]. In other scenarios, the goal is to find patterns or common behaviors between real entities, such as purchase patterns [7] and similar user interests [8].
Set Similarity Join is the operation that identifies all similar sets between two collections of sets (or inside the same collection) [7], [9]. In this case, each record is represented as a set of elements (tokens). For example, the text record “Data Warehouse” may be represented as a set of two words {Data, Warehouse} or a set of bigrams {Da, at, ta, a_, _W, Wa, ar, re, eh, ho, ou, us, se}. Two records are considered similar according to a similarity function, such as Overlap, Dice, Cosine or Jaccard [10], [11]. For instance, two records may be considered similar if the overlap (intersection) of their bigrams is greater or equal than 6, such as in “Databases” and “Databazes”.
Many different solutions have been proposed in the literature for Set Similarity Join [12], [1], [13], [5], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. With respect to the guarantee that all similar pairs will be found, these solutions are divided into exact approaches [12], [1], [13], [5], [14], [15], [22], [23], [24] and approximate approaches [16], [17], [18], [19], [20], [21]. The approximate algorithms usually rely on the Local Sensitive Hashing technique (LSH) [25] and are often very competitive, but these algorithms are outside the scope of this paper since only exact approaches will be considered here.
The solutions for Exact Set Similarity Join are usually based on the Filter-Verification Framework [7], [24], which is divided into two stages: (a) the candidate generation uses a series of filters to produce a reduced list of candidate pairs; (b) the verification stage verifies each candidate pair in order to check which ones are similar, with respect to the selected similarity function and threshold.
The filtering stage is commonly based on Prefix Filter [1], [13] and Length Filter [12], which are able to prune a considerable number of candidate pairs without compromising the exact result. Prefix Filter [1], [13] is based on the idea that two similar sets must share at least one element in their initial elements, given a defined element ordering. Length Filter [12] prunes candidate lists according to their lengths, considering that elements with a big difference in length will have low similarity. Other filters have also been proposed, such as the Positional Filter [5], which is able to prune candidate pairs based on the position where the token occurs in the set, and the Suffix Filter [5], that prunes pairs by applying a binary search on the last elements of the sets.
There are many algorithms in the literature combining the aforementioned filters. AllPairs [13] uses the Prefix and Length filters, PPJoin [5] extends AllPairs with the Positional Filter, PPJoin+ [5] extends PPJoin with the Suffix Filter. AdaptJoin [14] extends PPJoin using a variable schema with adaptive prefix lengths and GroupJoin [15] applies the filters in groups of similar sets, avoiding redundant computations. These algorithms usually rely on the assumption that the verification stage contributes significantly to the overall running time. However, in [7] a new verification procedure using an early termination condition was proposed that significantly reduced the execution time of the verification phase. With this, the proportion of execution time in the candidate generation phase increased and, as a consequence, the simplest filtering schemes are now able to produce the best performance. In [7], the best results were obtained by AllPairs, PPJoin, GroupJoin and AdaptJoin, whereas AllPairs was the best algorithm on the largest number of experiments [7]. On the contrary, PPJoin+ [5], which uses a sophisticated suffix filter technique, was the slowest algorithm in [7].
The state-of-the-art set similarity join algorithms still need to be improved in order to allow a faster join, otherwise very large collections may not be processed in a reasonable time. Based on the recent findings on the filter-verification trade-off overhead [7], the authors claimed that future filtering techniques should invest in fast and simple candidate generation methods instead of sophisticated filter-based algorithms, otherwise the filters effectiveness will not pay off their overhead when compared to the verification procedure.
The main contribution of this paper is a new low overhead filter called Bitmap Filter, which is able to efficiently prune candidate pairs and improve the performance of many state-of-the-art exact set similarity join algorithms such as AllPairs [13], PPJoin [5], AdaptJoin [14] and GroupJoin [15]. Using bitwise operations implemented by most modern processors, the Bitmap Filter is able to speedup the performance of AllPairs, PPJoin, AdaptJoin and GroupJoin algorithms by up to 4.50 (1.43 on average) considering 9 different collections and Jaccard threshold varying from 0.50 to 0.95. As far as we know, there is no other filter for the Exact Set Similarity Join problem that is based on efficient bitwise operations.
The Bitmap Filter is sufficiently flexible to be used in the candidate generation or verification stages. Furthermore, it can be efficiently implemented in devices such as Graphical Processor Units (GPU). As a secondary contribution of this paper, we thus propose a GPU implementation of the Bitmap Filter that is able to speedup the join by up to using a Nvidia Geforce GTX 980 Ti card, considering 6 collections of up to 606,770 sets.
The rest of this paper is structured as follows. Section 2 presents the background of the Set Similarity Join problem. Then, Section 3 proposes the Bitmap Filter and explains it in detail. Section 4 shows the design of the proposed CPU and GPU implementations. Section 5 presents the experimental results and Section 6 concludes the paper.
Section snippets
Background
In this section, we discuss the Set Similarity Join problem. First, the problem is formalized (Section 2.1). Then, the Filter-Verification Framework is presented (Section 2.2). After that, some commonly used filters employed in the Filter-Verification Framework are explained (Section 2.3). Finally, state-of-the-art algorithms are described (Section 2.4).
Bitmap filter
In this section we propose the Bitmap Filter, which uses a new bitwise technique capable of speeding up similarity set joins without compromising the exact result. The bitwise operations are widely used in modern microprocessors and most of the compilers are capable to map them to fast hardware instructions.
First, some preliminaries are given in Section 3.1. Then, Section 3.2 presents how the bitmap is generated. Section 3.3 shows how the overlap upper bound can be calculated using bitmaps.
Implementation details
The Bitmap Filter will be evaluated using CPU and GPU implementations, which are detailed in this section.
Experimental results
In order to create a baseline, we replicated the experiments conducted by [7]. Table 4 presents the characteristics of the chosen collections. All collections were preprocessed and the sets in the collections were sorted by size and, in case of a tie, they were sorted lexicographically by the token ids. We observed that the lexicographical ordering speeds up all the algorithms, as stated by [7].
Fig. 8 shows the set size distribution in the collections. Orkut, netflix, spotify, enron and kosarak
Conclusions
This paper presented a new filtering technique called Bitmap Filter, which is able to reduce the running time of state-of-the-art algorithms that solve the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create binary words of bits to represent the set of tokens, with reduced dimension. The comparison of two bitmaps using population count and xor operations (hamming distance) allows to infer the number of different tokens in the original sets, and this difference
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.is.2019.101449.
Acknowledgments
We would like to thank the Coordination for the Improvement of Higher Education Personnel (CAPES, Brazil ) for the Post-Doctoral Scholarship given to E. Sandes.
References (29)
- et al.
Syntactic clustering of the web
Comput. Netw. ISDN Syst.
(1997) - et al.
Parallel set similarity join on big data based on locality-sensitive hashing
Sci. Comput. Program.
(2017) - et al.
Generalizing prefix filtering to improve set similarity joins
Inf. Syst.
(2011) - et al.
A primitive operator for similarity joins in data cleaning
Data integration using similarity joins and a word-based information representation language
ACM Trans. Inf. Syst.
(2000)- et al.
Efficient parallel set-similarity joins using MapReduce
- et al.
Efficient similarity joins for near-duplicate detection
ACM Trans. Database Syst.
(2011) - et al.
Efficient exact set-similarity joins
- et al.
An empirical evaluation of set similarity join techniques
Proc. VLDB Endow.
(2016) - et al.
Evaluating similarity measures: A large-scale study in the orkut social network
String similarity joins: An experimental evaluation
Proc. VLDB Endow.
Similarity Joins in Relational Database Systems
Spotsigs: Robust and efficient near duplicate detection in large web collections
Approximate string joins in a database (almost) for free
Cited by (3)
HySet: A hybrid framework for exact set similarity join using a GPU
2021, Parallel ComputingCitation Excerpt :Bitmap can also be used as a main filter for the CPU-enabled techniques alongside prefix filter. However, the speedup gains in this case for the best performing CPU techniques are relatively small, up to 1.35x on average [12]. In addition, length filter can also be used at block level which enables pruning of whole blocks.
PMMTss: A Parallel Multi-Way Merging-Based Trajectory Similarity Search for a Million Metro Passengers
2023, Applied Sciences (Switzerland)A Set Similarity Self-Join Algorithm with FP-tree and MapReduce
2023, Jisuanji Yanjiu yu Fazhan/Computer Research and Development