skip to main content
research-article
Artifacts Available / v1.1

Cardinality estimation of approximate substring queries using deep learning

Published:01 July 2022Publication History
Skip Abstract Section

Abstract

Cardinality estimation of an approximate substring query is an important problem in database systems. Traditional approaches build a summary from the text data and estimate the cardinality using the summary with some statistical assumptions. Since deep learning models can learn underlying complex data patterns effectively, they have been successfully applied and shown to outperform traditional methods for cardinality estimations of queries in database systems. However, since they are not yet applied to approximate substring queries, we investigate a deep learning approach for cardinality estimation of such queries. Although the accuracy of deep learning models tends to improve as the train data size increases, producing a large train data is computationally expensive for cardinality estimation of approximate substring queries. Thus, we develop efficient train data generation algorithms by avoiding unnecessary computations and sharing common computations. We also propose a deep learning model as well as a novel learning method to quickly obtain an accurate deep learning-based estimator. Extensive experiments confirm the superiority of our data generation algorithms and deep learning model with the novel learning method.

References

  1. (Accessed June 11, 2021). Edit distance. https://en.wikipedia.org/wiki/Edit_distanceGoogle ScholarGoogle Scholar
  2. Mehmet Aytimur and Ali Cakmak. 2018. Estimating the selectivity of LIKE queries using pattern-based histograms. Turkish Journal of Electrical Engineering & Computer Sciences 26, 6 (2018), 3319--3334.Google ScholarGoogle ScholarCross RefCross Ref
  3. Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics. 26--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Surajit Chaudhuri, Venkatesh Ganti, and Luis Gravano. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings. 20th International Conference on Data Engineering. IEEE, 227--238.Google ScholarGoogle ScholarCross RefCross Ref
  5. Zhiyuan Chen, Nick Koudas, Flip Korn, and Shanmugavelayutham Muthukrishnan. 2000. Selectively estimation for boolean queries. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 216--225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dong Deng, Guoliang Li, and Jianhua Feng. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 762--773.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Szilárd Zsolt Fazekas and Robert Mercaş. 2021. Clusters of repetition roots: single chains. In International Conference on Current Trends in Theory and Practice of Informatics. Springer, 400--409.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).Google ScholarGoogle Scholar
  9. D Gusfield. 1997. Algorithms on strings, trees, and sequences Cambridge University Press. Cambridge, England (1997).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).Google ScholarGoogle Scholar
  11. HV Jagadish, Olga Kapitskaia, Raymond T Ng, and Divesh Srivastava. 2000. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal 9, 3 (2000), 214--230.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. HV Jagadish, Raymond T Ng, and Divesh Srivastava. 1999. Substring selectivity estimation. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Younghoon Kim and Kyuseok Shim. 2013. Efficient top-k algorithms for approximate substring matching. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 385--396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  15. Andreas Kipf, Michael Freitag, Dimitri Vorona, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating filtered group-by queries is hard: Deep learning to the rescue. In 1st International Workshop on Applied AI for Database Systems and Applications.Google ScholarGoogle Scholar
  16. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google ScholarGoogle Scholar
  17. P Krishnan, Jeffrey Scott Vitter, and Bala Iyer. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 282--293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Suyong Kwon, Woohwan Jung, and Kyuseok Shim. 2022. Cardinality Estimation of Approximate Substring Queries using Deep Learning. Technical Report. Seoul National University, Electrical and Computer Engineering Department. https://github.com/sykwon/vldb-tr/raw/main/pvldb_extended.pdfGoogle ScholarGoogle Scholar
  19. Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 195--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2009. Approximate substring selectivity estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 827--838.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Guoliang Li, Dong Deng, and Jianhua Feng. 2011. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 529--540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  23. Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33, 1 (2001), 31--88.Google ScholarGoogle Scholar
  24. Marius Pasca. 2004. Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. 137--145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701--710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, and Gautam Das. 2020. Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment 14, 4 (2020), 471--484.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Esko Ukkonen. 1985. Finding approximate patterns in strings. Journal of algorithms 6, 1 (1985), 132--137.Google ScholarGoogle ScholarCross RefCross Ref
  28. Rares Vernica and Chen Li. 2009. Efficient top-k algorithms for fuzzy search in string collections. In Proceedings of the First International Workshop on Keyword Search on Structured Data. 9--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jiannan Wang, Jianhua Feng, and Guoliang Li. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1219--1230.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wei Wang, Chuan Xiao, Xuemin Lin, and Chengqi Zhang. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 759--770.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic cardinality estimation of similarity selection: A deep learning approach. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1197--1212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Melanie Weis, Felix Naumann, and Franziska Brosy. 2006. A duplicate detection benchmark for XML (and relational) data. In Proc. of Workshop on Information Quality for Information Systems (IQIS).Google ScholarGoogle Scholar
  33. Ronald J Williams and Jing Peng. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation 2, 4 (1990), 490--501.Google ScholarGoogle Scholar
  34. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 1--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).Google ScholarGoogle Scholar
  36. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of ACL 2019.Google ScholarGoogle ScholarCross RefCross Ref
  37. Qiang Yu, Dingbang Wei, and Hongwei Huo. 2018. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC bioinformatics 19, 1 (2018), 1--16.Google ScholarGoogle Scholar

Index Terms

  1. Cardinality estimation of approximate substring queries using deep learning
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader