Bitmap filter: Speeding up exact set similarity joins with bitwise operations

doi:10.1016/j.is.2019.101449

Information Systems

Volume 88, February 2020, 101449

https://doi.org/10.1016/j.is.2019.101449 Get rights and content

Highlights

•
A novel low overhead filter (Bitmap Filter) for the exact set similarity join.
•
We improved four state-of-the-art algorithms in up to 4.50 $\times$ (1.35 $\times$ on average).
•
Bitmap Filter can be used in candidate generation or verification stages.
•
Bitmap Filter was also implemented in GPUs, presenting speedups of up to 577 $\times$ .

Abstract

The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as Overlap, Jaccard, Dice or Cosine. The naïve approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact Set Similarity Join algorithms are usually based on the Filter-Verification Framework, that applies a series of filters to reduce the number of verified pairs. This paper presents a new filtering technique called Bitmap Filter, which is able to accelerate state-of-the-art algorithms for the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create bitmaps of fixed $b$ bits, representing characteristics of the sets. Then, it applies bitwise operations (such as xor and population count) on the bitmaps in order to infer a similarity upper bound for each pair of sets. If the upper bound is below a given similarity threshold, the pair of sets is pruned. The Bitmap Filter benefits from the fact that bitwise operations are efficiently implemented by many modern general-purpose processors and it was easily applied to four state-of-the-art algorithms implemented in CPU: AllPairs, PPJoin, AdaptJoin and GroupJoin. Furthermore, we propose a Graphic Processor Unit (GPU) algorithm based on the naïve approach but using the Bitmap Filter to speedup the computation. The experiments considered 12 collections containing from 100 thousands up to 10 million sets and the joins were made using Jaccard thresholds from 0.50 to 0.95. The Bitmap Filter was able to improve 85% of the experiments in CPU, with speedups of up to 4.50 $\times$ and 1.35 $\times$ on average. Using the GPU algorithm, the experiments were able to speedup the original CPU algorithms by up to 577 $\times$ using an Nvidia Geforce GTX 980 Ti.

Introduction

Data analysts often need to identify similar records stored in multiple data collections. This operation is very common for data cleaning [1], element clustering [2] and record linkage [3], [4]. In some scenarios the similarity analysis aims to detect near duplicate records [5], in which slightly different data representations may be caused by input errors, misspelling or use of synonyms [6]. In other scenarios, the goal is to find patterns or common behaviors between real entities, such as purchase patterns [7] and similar user interests [8].

Set Similarity Join is the operation that identifies all similar sets between two collections of sets (or inside the same collection) [7], [9]. In this case, each record is represented as a set of elements (tokens). For example, the text record “Data Warehouse” may be represented as a set of two words {Data, Warehouse} or a set of bigrams {Da, at, ta, a_, _W, Wa, ar, re, eh, ho, ou, us, se}. Two records are considered similar according to a similarity function, such as Overlap, Dice, Cosine or Jaccard [10], [11]. For instance, two records may be considered similar if the overlap (intersection) of their bigrams is greater or equal than 6, such as in “Databases” and “Databazes”.

Many different solutions have been proposed in the literature for Set Similarity Join [12], [1], [13], [5], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. With respect to the guarantee that all similar pairs will be found, these solutions are divided into exact approaches [12], [1], [13], [5], [14], [15], [22], [23], [24] and approximate approaches [16], [17], [18], [19], [20], [21]. The approximate algorithms usually rely on the Local Sensitive Hashing technique (LSH) [25] and are often very competitive, but these algorithms are outside the scope of this paper since only exact approaches will be considered here.

The solutions for Exact Set Similarity Join are usually based on the Filter-Verification Framework [7], [24], which is divided into two stages: (a) the candidate generation uses a series of filters to produce a reduced list of candidate pairs; (b) the verification stage verifies each candidate pair in order to check which ones are similar, with respect to the selected similarity function and threshold.

The filtering stage is commonly based on Prefix Filter [1], [13] and Length Filter [12], which are able to prune a considerable number of candidate pairs without compromising the exact result. Prefix Filter [1], [13] is based on the idea that two similar sets must share at least one element in their initial elements, given a defined element ordering. Length Filter [12] prunes candidate lists according to their lengths, considering that elements with a big difference in length will have low similarity. Other filters have also been proposed, such as the Positional Filter [5], which is able to prune candidate pairs based on the position where the token occurs in the set, and the Suffix Filter [5], that prunes pairs by applying a binary search on the last elements of the sets.

There are many algorithms in the literature combining the aforementioned filters. AllPairs [13] uses the Prefix and Length filters, PPJoin [5] extends AllPairs with the Positional Filter, PPJoin＋ [5] extends PPJoin with the Suffix Filter. AdaptJoin [14] extends PPJoin using a variable schema with adaptive prefix lengths and GroupJoin [15] applies the filters in groups of similar sets, avoiding redundant computations. These algorithms usually rely on the assumption that the verification stage contributes significantly to the overall running time. However, in [7] a new verification procedure using an early termination condition was proposed that significantly reduced the execution time of the verification phase. With this, the proportion of execution time in the candidate generation phase increased and, as a consequence, the simplest filtering schemes are now able to produce the best performance. In [7], the best results were obtained by AllPairs, PPJoin, GroupJoin and AdaptJoin, whereas AllPairs was the best algorithm on the largest number of experiments [7]. On the contrary, PPJoin＋ [5], which uses a sophisticated suffix filter technique, was the slowest algorithm in [7].

The state-of-the-art set similarity join algorithms still need to be improved in order to allow a faster join, otherwise very large collections may not be processed in a reasonable time. Based on the recent findings on the filter-verification trade-off overhead [7], the authors claimed that future filtering techniques should invest in fast and simple candidate generation methods instead of sophisticated filter-based algorithms, otherwise the filters effectiveness will not pay off their overhead when compared to the verification procedure.

The main contribution of this paper is a new low overhead filter called Bitmap Filter, which is able to efficiently prune candidate pairs and improve the performance of many state-of-the-art exact set similarity join algorithms such as AllPairs [13], PPJoin [5], AdaptJoin [14] and GroupJoin [15]. Using bitwise operations implemented by most modern processors, the Bitmap Filter is able to speedup the performance of AllPairs, PPJoin, AdaptJoin and GroupJoin algorithms by up to 4.50 $\times$ (1.43 $\times$ on average) considering 9 different collections and Jaccard threshold varying from 0.50 to 0.95. As far as we know, there is no other filter for the Exact Set Similarity Join problem that is based on efficient bitwise operations.

The Bitmap Filter is sufficiently flexible to be used in the candidate generation or verification stages. Furthermore, it can be efficiently implemented in devices such as Graphical Processor Units (GPU). As a secondary contribution of this paper, we thus propose a GPU implementation of the Bitmap Filter that is able to speedup the join by up to $577 \times$ using a Nvidia Geforce GTX 980 Ti card, considering 6 collections of up to 606,770 sets.

The rest of this paper is structured as follows. Section 2 presents the background of the Set Similarity Join problem. Then, Section 3 proposes the Bitmap Filter and explains it in detail. Section 4 shows the design of the proposed CPU and GPU implementations. Section 5 presents the experimental results and Section 6 concludes the paper.

Section snippets

Background

In this section, we discuss the Set Similarity Join problem. First, the problem is formalized (Section 2.1). Then, the Filter-Verification Framework is presented (Section 2.2). After that, some commonly used filters employed in the Filter-Verification Framework are explained (Section 2.3). Finally, state-of-the-art algorithms are described (Section 2.4).

Bitmap filter

In this section we propose the Bitmap Filter, which uses a new bitwise technique capable of speeding up similarity set joins without compromising the exact result. The bitwise operations are widely used in modern microprocessors and most of the compilers are capable to map them to fast hardware instructions.

First, some preliminaries are given in Section 3.1. Then, Section 3.2 presents how the bitmap is generated. Section 3.3 shows how the overlap upper bound can be calculated using bitmaps.

Implementation details

The Bitmap Filter will be evaluated using CPU and GPU implementations, which are detailed in this section.

Experimental results

In order to create a baseline, we replicated the experiments conducted by [7]. Table 4 presents the characteristics of the chosen collections. All collections were preprocessed and the sets in the collections were sorted by size and, in case of a tie, they were sorted lexicographically by the token ids. We observed that the lexicographical ordering speeds up all the algorithms, as stated by [7].

Fig. 8 shows the set size distribution in the collections. Orkut, netflix, spotify, enron and kosarak

Conclusions

This paper presented a new filtering technique called Bitmap Filter, which is able to reduce the running time of state-of-the-art algorithms that solve the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create binary words of $b$ bits to represent the set of tokens, with reduced dimension. The comparison of two bitmaps using population count and xor operations (hamming distance) allows to infer the number of different tokens in the original sets, and this difference

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.is.2019.101449.

Acknowledgments

We would like to thank the Coordination for the Improvement of Higher Education Personnel (CAPES, Brazil ) for the Post-Doctoral Scholarship given to E. Sandes.

References (29)

BroderA.Z. et al.
Syntactic clustering of the web
Comput. Netw. ISDN Syst.
(1997)
SohrabiM.K. et al.
Parallel set similarity join on big data based on locality-sensitive hashing
Sci. Comput. Program.
(2017)
RibeiroL.A. et al.
Generalizing prefix filtering to improve set similarity joins
Inf. Syst.
(2011)
ChaudhuriS. et al.
A primitive operator for similarity joins in data cleaning
CohenW.W.
Data integration using similarity joins and a word-based information representation language
ACM Trans. Inf. Syst.
(2000)
VernicaR. et al.
Efficient parallel set-similarity joins using MapReduce
XiaoC. et al.
Efficient similarity joins for near-duplicate detection
ACM Trans. Database Syst.
(2011)
ArasuA. et al.
Efficient exact set-similarity joins
MannW. et al.
An empirical evaluation of set similarity join techniques
Proc. VLDB Endow.
(2016)
SpertusE. et al.
Evaluating similarity measures: A large-scale study in the orkut social network

JiangY. et al.

String similarity joins: An experimental evaluation

Proc. VLDB Endow.

(2014)

AugstenN. et al.

Similarity Joins in Relational Database Systems

(2013)

TheobaldM. et al.

Spotsigs: Robust and efficient near duplicate detection in large web collections

GravanoL. et al.

Approximate string joins in a database (almost) for free

Cited by (3)

HySet: A hybrid framework for exact set similarity join using a GPU
2021, Parallel Computing
Citation Excerpt :
Bitmap can also be used as a main filter for the CPU-enabled techniques alongside prefix filter. However, the speedup gains in this case for the best performing CPU techniques are relatively small, up to 1.35x on average [12]. In addition, length filter can also be used at block level which enables pruning of whole blocks.
Set similarity join is a fundamental operation used in a wide range of applications such as data mining, data cleaning and entity resolution. Existing methods proposed for set similarity join conform to a filter-verification framework where potential candidate pairs are generated in the filtering phase and then undergo a verification phase to output the final result. Several different kinds of filtering techniques have been proposed and techniques also differentiate in the manner they couple filtering with verification. However, it has been shown that no globally dominant technique exists. Depending on the dataset and query characteristics, each technique has its own strong and weak points. Based on these findings, the main contribution of this work is the development of a hybrid framework for the set similarity join operation for a single GPU-equipped machine setting. Our framework encapsulates a partitioning mechanism to utilize appropriately both the CPU and the GPU. We present all technical details and we show performance speedups up to 3.25x after thorough evaluation.
PMMTss: A Parallel Multi-Way Merging-Based Trajectory Similarity Search for a Million Metro Passengers
2023, Applied Sciences (Switzerland)
A Set Similarity Self-Join Algorithm with FP-tree and MapReduce
2023, Jisuanji Yanjiu yu Fazhan/Computer Research and Development

View full text

Bitmap filter: Speeding up exact set similarity joins with bitwise operations

Highlights

Abstract

Introduction

Section snippets

Background

Bitmap filter

Implementation details

Experimental results

Conclusions

Declaration of Competing Interest

Acknowledgments

Comput. Netw. ISDN Syst.

Sci. Comput. Program.

Inf. Syst.

A primitive operator for similarity joins in data cleaning

Data integration using similarity joins and a word-based information representation language

ACM Trans. Inf. Syst.

Efficient parallel set-similarity joins using MapReduce

Efficient similarity joins for near-duplicate detection

ACM Trans. Database Syst.

Efficient exact set-similarity joins

An empirical evaluation of set similarity join techniques

Proc. VLDB Endow.

Evaluating similarity measures: A large-scale study in the orkut social network

String similarity joins: An experimental evaluation

Proc. VLDB Endow.

Similarity Joins in Relational Database Systems

Spotsigs: Robust and efficient near duplicate detection in large web collections

Approximate string joins in a database (almost) for free