Skip to main content
Log in

A unified framework for string similarity search with edit-distance constraint

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-\(k\) string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (\({\textsf {HS}}{\text {-}}{\textsf {Tree}}\)) on top of the segments. Then, we utilize the \({\textsf {HS}}{\text {-}}{\textsf {Tree}}\) to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-\(k\) search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. http://www.informatik.uni-trier.de/~ley/db/.

  2. http://www.informatik.uni-trier.de/~ley/db/.

  3. http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/.

  4. http://www.ncbi.nlm.nih.gov/pubmed/.

References

  1. Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40, e41 (2012)

    Article  Google Scholar 

  2. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)

  3. Behm, A., Li, C., Carey, M.J.: Answering approximate string queries on large data sets using external memory. In: ICDE, pp. 888–899 (2011)

  4. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)

  5. Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)

  6. Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: SIGMOD Conference, pp. 673–684 (2014)

  7. Deng, D., Li, G., Feng, J., Duan, Y., Gong, Z.: A unified framework for approximate dictionary-based entity extraction. VLDB J. 24(1), 143–167 (2015)

    Article  Google Scholar 

  8. Deng, D., Li, G., Feng, J., Li, W.-S.: Top-\(k\) string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)

  9. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)

  10. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)

    Google Scholar 

  11. Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: META: an efficient matching-based method for error-tolerant autocompletion. PVLDB 9(10), 828–839 (2016)

    Google Scholar 

  12. Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)

    Article  Google Scholar 

  13. Gerdjikov, S., Mihov, S., Mitankin, P., Schulz, K.U.: Wallbreaker: overcoming the wall effect in similarity search. In:EDBT/ICDT, pp. 366–369 (2013)

  14. Guo, L., Shanmugasundaram, J., Beyer, K.S., Shekita, E.J.: Efficient inverted lists and query algorithms for structured value ranking in update-intensive relational databases. In: ICDE, pp. 298–309 (2005)

  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  16. Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)

    Google Scholar 

  17. Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: WWW (2009)

  18. Jiang, Y., Li, G., Feng, J.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)

    Google Scholar 

  19. Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: SIGMOD Conference, pp. 385–396 (2013)

  20. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

  21. Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)

  22. Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)

  23. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  24. Li, G., Feng, J., Li, C.: Supporting search-as-you-type using SQL in databases. IEEE Trans. Knowl. Data Eng. 25(2), 461–475 (2013)

    Article  Google Scholar 

  25. Li, G., He, J., Deng, D., Li, J.: Efficient similarity join and search on multi-attribute data. In: SIGMOD, pp. 1137–1151 (2015)

  26. Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)

    Article  Google Scholar 

  27. Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: Efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)

    Article  Google Scholar 

  28. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD Conference, pp. 1033–1044 (2011)

  29. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)

  30. Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  31. Siragusai, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013)

  32. Tomasic, A., Garcia-Molina, H., Shoens, K.A.: Incremental updates of inverted lists for text document retrieval. In: SIGMOD, pp. 289–300 (1994)

  33. Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Rec. 43(1), 64–76 (2014)

    Article  Google Scholar 

  34. Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-\(k\) and threshold-based string similarity search. In: ICDE (2015)

  35. Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)

  36. Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)

  37. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)

  38. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, (2009)

  39. Wang, X., Ding, X., Tung, A.K.H., Zhang, Z.: Efficient and effective knn sequence search with approximate n-grams. PVLDB 7, 1–12 (2014)

    Google Scholar 

  40. Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. PVLDB 6(6), 373–384 (2013)

    Google Scholar 

  41. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    MathSciNet  Google Scholar 

  42. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)

  43. Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: AAAI (2010)

  44. Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)

    Article  Google Scholar 

  45. Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD Conference, pp. 915–926 (2010)

Download references

Acknowledgements

This work was supported by 973 Program of China (2015CB358700), NSF of China (61373024, 61632016, 61422205, 61472198, 61661166012), Shenzhou, Tencent, TNList, FDCT/116/2013/A3, and MYRG105 (Y1-L3)-FST13-GZ.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoliang Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, M., Wang, J., Li, G. et al. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 249–274 (2017). https://doi.org/10.1007/s00778-016-0449-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0449-y

Keywords

Navigation