A unified framework for string similarity search with edit-distance constraint

Yu, Minghe; Wang, Jin; Li, Guoliang; Zhang, Yong; Deng, Dong; Feng, Jianhua

doi:10.1007/s00778-016-0449-y

A unified framework for string similarity search with edit-distance constraint

Regular Paper
Published: 17 December 2016

Volume 26, pages 249–274, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Minghe Yu¹,
Jin Wang¹,
Guoliang Li¹,
Yong Zhang¹,
Dong Deng¹ &
…
Jianhua Feng¹

2082 Accesses
23 Citations
Explore all metrics

Abstract

String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-\(k\) string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (\({\textsf {HS}}{\text {-}}{\textsf {Tree}}\)) on top of the segments. Then, we utilize the \({\textsf {HS}}{\text {-}}{\textsf {Tree}}\) to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-\(k\) search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Tomasz Kociumaka, Jakub Radoszewski & Tatiana Starikovskaya

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

Rahul Mondal, Evelina Ignatova, … Robert Heyer

Notes

References

Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40, e41 (2012)
Article Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Behm, A., Li, C., Carey, M.J.: Answering approximate string queries on large data sets using external memory. In: ICDE, pp. 888–899 (2011)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)
Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: SIGMOD Conference, pp. 673–684 (2014)
Deng, D., Li, G., Feng, J., Duan, Y., Gong, Z.: A unified framework for approximate dictionary-based entity extraction. VLDB J. 24(1), 143–167 (2015)
Article Google Scholar
Deng, D., Li, G., Feng, J., Li, W.-S.: Top-\(k\) string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)
Google Scholar
Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: META: an efficient matching-based method for error-tolerant autocompletion. PVLDB 9(10), 828–839 (2016)
Google Scholar
Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)
Article Google Scholar
Gerdjikov, S., Mihov, S., Mitankin, P., Schulz, K.U.: Wallbreaker: overcoming the wall effect in similarity search. In:EDBT/ICDT, pp. 366–369 (2013)
Guo, L., Shanmugasundaram, J., Beyer, K.S., Shekita, E.J.: Efficient inverted lists and query algorithms for structured value ranking in update-intensive relational databases. In: ICDE, pp. 298–309 (2005)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)
Google Scholar
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: WWW (2009)
Jiang, Y., Li, G., Feng, J.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Google Scholar
Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: SIGMOD Conference, pp. 385–396 (2013)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Google Scholar
Li, G., Feng, J., Li, C.: Supporting search-as-you-type using SQL in databases. IEEE Trans. Knowl. Data Eng. 25(2), 461–475 (2013)
Article Google Scholar
Li, G., He, J., Deng, D., Li, J.: Efficient similarity join and search on multi-attribute data. In: SIGMOD, pp. 1137–1151 (2015)
Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)
Article Google Scholar
Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: Efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)
Article Google Scholar
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD Conference, pp. 1033–1044 (2011)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)
Article MathSciNet MATH Google Scholar
Siragusai, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013)
Tomasic, A., Garcia-Molina, H., Shoens, K.A.: Incremental updates of inverted lists for text document retrieval. In: SIGMOD, pp. 289–300 (1994)
Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Rec. 43(1), 64–76 (2014)
Article Google Scholar
Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-\(k\) and threshold-based string similarity search. In: ICDE (2015)
Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, (2009)
Wang, X., Ding, X., Tung, A.K.H., Zhang, Z.: Efficient and effective knn sequence search with approximate n-grams. PVLDB 7, 1–12 (2014)
Google Scholar
Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. PVLDB 6(6), 373–384 (2013)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
MathSciNet Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: AAAI (2010)
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)
Article Google Scholar
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD Conference, pp. 915–926 (2010)

Download references

Acknowledgements

This work was supported by 973 Program of China (2015CB358700), NSF of China (61373024, 61632016, 61422205, 61472198, 61661166012), Shenzhou, Tencent, TNList, FDCT/116/2013/A3, and MYRG105 (Y1-L3)-FST13-GZ.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng & Jianhua Feng

Authors

Minghe Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Deng
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoliang Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, M., Wang, J., Li, G. et al. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 249–274 (2017). https://doi.org/10.1007/s00778-016-0449-y

Download citation

Received: 10 February 2015
Revised: 25 September 2016
Accepted: 24 November 2016
Published: 17 December 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s00778-016-0449-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A unified framework for string similarity search with edit-distance constraint

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Clustering graph data: the roadmap to spectral techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A unified framework for string similarity search with edit-distance constraint

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Clustering graph data: the roadmap to spectral techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation