research-article

Cardinality estimation of approximate substring queries using deep learning

Authors:
Suyong Kwon

Seoul National University

Seoul National University
View Profile

,
Woohwan Jung

Hanyang University

Hanyang University
View Profile

,
Kyuseok Shim

Seoul National University

Seoul National University
View Profile

Proceedings of the VLDB Endowment Volume 15 Issue 11pp 3145–3157https://doi.org/10.14778/3551793.3551859

Published:01 July 2022Publication History

Proceedings of the VLDB Endowment

Abstract

Cardinality estimation of an approximate substring query is an important problem in database systems. Traditional approaches build a summary from the text data and estimate the cardinality using the summary with some statistical assumptions. Since deep learning models can learn underlying complex data patterns effectively, they have been successfully applied and shown to outperform traditional methods for cardinality estimations of queries in database systems. However, since they are not yet applied to approximate substring queries, we investigate a deep learning approach for cardinality estimation of such queries. Although the accuracy of deep learning models tends to improve as the train data size increases, producing a large train data is computationally expensive for cardinality estimation of approximate substring queries. Thus, we develop efficient train data generation algorithms by avoiding unnecessary computations and sharing common computations. We also propose a deep learning model as well as a novel learning method to quickly obtain an accurate deep learning-based estimator. Extensive experiments confirm the superiority of our data generation algorithms and deep learning model with the novel learning method.

References

(Accessed June 11, 2021). Edit distance. https://en.wikipedia.org/wiki/Edit_distanceGoogle Scholar
Mehmet Aytimur and Ali Cakmak. 2018. Estimating the selectivity of LIKE queries using pattern-based histograms. Turkish Journal of Electrical Engineering & Computer Sciences 26, 6 (2018), 3319--3334.Google ScholarCross Ref
Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics. 26--33.Google ScholarDigital Library
Surajit Chaudhuri, Venkatesh Ganti, and Luis Gravano. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings. 20th International Conference on Data Engineering. IEEE, 227--238.Google ScholarCross Ref
Zhiyuan Chen, Nick Koudas, Flip Korn, and Shanmugavelayutham Muthukrishnan. 2000. Selectively estimation for boolean queries. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 216--225.Google ScholarDigital Library
Dong Deng, Guoliang Li, and Jianhua Feng. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 762--773.Google ScholarDigital Library
Szilárd Zsolt Fazekas and Robert Mercaş. 2021. Clusters of repetition roots: single chains. In International Conference on Current Trends in Theory and Practice of Informatics. Springer, 400--409.Google ScholarDigital Library
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).Google Scholar
D Gusfield. 1997. Algorithms on strings, trees, and sequences Cambridge University Press. Cambridge, England (1997).Google ScholarDigital Library
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).Google Scholar
HV Jagadish, Olga Kapitskaia, Raymond T Ng, and Divesh Srivastava. 2000. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal 9, 3 (2000), 214--230.Google ScholarDigital Library
HV Jagadish, Raymond T Ng, and Divesh Srivastava. 1999. Substring selectivity estimation. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 249--260.Google ScholarDigital Library
Younghoon Kim and Kyuseok Shim. 2013. Efficient top-k algorithms for approximate substring matching. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 385--396.Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Andreas Kipf, Michael Freitag, Dimitri Vorona, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating filtered group-by queries is hard: Deep learning to the rescue. In 1st International Workshop on Applied AI for Database Systems and Applications.Google Scholar
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
P Krishnan, Jeffrey Scott Vitter, and Bala Iyer. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 282--293.Google ScholarDigital Library
Suyong Kwon, Woohwan Jung, and Kyuseok Shim. 2022. Cardinality Estimation of Approximate Substring Queries using Deep Learning. Technical Report. Seoul National University, Electrical and Computer Engineering Department. https://github.com/sykwon/vldb-tr/raw/main/pvldb_extended.pdfGoogle Scholar
Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 195--206.Google ScholarDigital Library
Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2009. Approximate substring selectivity estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 827--838.Google ScholarDigital Library
Guoliang Li, Dong Deng, and Jianhua Feng. 2011. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 529--540.Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33, 1 (2001), 31--88.Google Scholar
Marius Pasca. 2004. Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. 137--145.Google ScholarDigital Library
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701--710.Google ScholarDigital Library
Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, and Gautam Das. 2020. Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment 14, 4 (2020), 471--484.Google ScholarDigital Library
Esko Ukkonen. 1985. Finding approximate patterns in strings. Journal of algorithms 6, 1 (1985), 132--137.Google ScholarCross Ref
Rares Vernica and Chen Li. 2009. Efficient top-k algorithms for fuzzy search in string collections. In Proceedings of the First International Workshop on Keyword Search on Structured Data. 9--14.Google ScholarDigital Library
Jiannan Wang, Jianhua Feng, and Guoliang Li. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1219--1230.Google ScholarDigital Library
Wei Wang, Chuan Xiao, Xuemin Lin, and Chengqi Zhang. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 759--770.Google ScholarDigital Library
Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic cardinality estimation of similarity selection: A deep learning approach. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1197--1212.Google ScholarDigital Library
Melanie Weis, Felix Naumann, and Franziska Brosy. 2006. A duplicate detection benchmark for XML (and relational) data. In Proc. of Workshop on Information Quality for Information Systems (IQIS).Google Scholar
Ronald J Williams and Jing Peng. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation 2, 4 (1990), 490--501.Google Scholar
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 1--41.Google ScholarDigital Library
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).Google Scholar
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of ACL 2019.Google ScholarCross Ref
Qiang Yu, Dingbang Wei, and Hongwei Huo. 2018. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC bioinformatics 19, 1 (2018), 1--16.Google Scholar

Index Terms

Cardinality estimation of approximate substring queries using deep learning
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Read More
Cardinality estimation of activity trajectory similarity queries using deep learning
Abstract
Cardinality estimation, which involves estimating the result size of queries, is a critical aspect of query processing and optimization. Deep Neural Networks (DNNs) are data hungry, and being trained directly for cardinality estimation ...
Read More
Learned Cardinality Estimation for Similarity Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

In this paper, we study the problem of using deep neural networks (DNNs) for estimating the cardinality of similarity queries. Intuitively, DNNs can capture the distribution of data points, and learn to predict the number of data points that are similar ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2022
Published in pvldb Volume 15, Issue 11
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 147
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cardinality estimation of approximate substring queries using deep learning

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

Cardinality estimation of activity trajectory similarity queries using deep learning

Learned Cardinality Estimation for Similarity Queries