Abstract
Cardinality estimation of an approximate substring query is an important problem in database systems. Traditional approaches build a summary from the text data and estimate the cardinality using the summary with some statistical assumptions. Since deep learning models can learn underlying complex data patterns effectively, they have been successfully applied and shown to outperform traditional methods for cardinality estimations of queries in database systems. However, since they are not yet applied to approximate substring queries, we investigate a deep learning approach for cardinality estimation of such queries. Although the accuracy of deep learning models tends to improve as the train data size increases, producing a large train data is computationally expensive for cardinality estimation of approximate substring queries. Thus, we develop efficient train data generation algorithms by avoiding unnecessary computations and sharing common computations. We also propose a deep learning model as well as a novel learning method to quickly obtain an accurate deep learning-based estimator. Extensive experiments confirm the superiority of our data generation algorithms and deep learning model with the novel learning method.
- (Accessed June 11, 2021). Edit distance. https://en.wikipedia.org/wiki/Edit_distanceGoogle Scholar
- Mehmet Aytimur and Ali Cakmak. 2018. Estimating the selectivity of LIKE queries using pattern-based histograms. Turkish Journal of Electrical Engineering & Computer Sciences 26, 6 (2018), 3319--3334.Google ScholarCross Ref
- Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics. 26--33.Google ScholarDigital Library
- Surajit Chaudhuri, Venkatesh Ganti, and Luis Gravano. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings. 20th International Conference on Data Engineering. IEEE, 227--238.Google ScholarCross Ref
- Zhiyuan Chen, Nick Koudas, Flip Korn, and Shanmugavelayutham Muthukrishnan. 2000. Selectively estimation for boolean queries. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 216--225.Google ScholarDigital Library
- Dong Deng, Guoliang Li, and Jianhua Feng. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 762--773.Google ScholarDigital Library
- Szilárd Zsolt Fazekas and Robert Mercaş. 2021. Clusters of repetition roots: single chains. In International Conference on Current Trends in Theory and Practice of Informatics. Springer, 400--409.Google ScholarDigital Library
- Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).Google Scholar
- D Gusfield. 1997. Algorithms on strings, trees, and sequences Cambridge University Press. Cambridge, England (1997).Google ScholarDigital Library
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).Google Scholar
- HV Jagadish, Olga Kapitskaia, Raymond T Ng, and Divesh Srivastava. 2000. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal 9, 3 (2000), 214--230.Google ScholarDigital Library
- HV Jagadish, Raymond T Ng, and Divesh Srivastava. 1999. Substring selectivity estimation. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 249--260.Google ScholarDigital Library
- Younghoon Kim and Kyuseok Shim. 2013. Efficient top-k algorithms for approximate substring matching. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 385--396.Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Andreas Kipf, Michael Freitag, Dimitri Vorona, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating filtered group-by queries is hard: Deep learning to the rescue. In 1st International Workshop on Applied AI for Database Systems and Applications.Google Scholar
- Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
- P Krishnan, Jeffrey Scott Vitter, and Bala Iyer. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 282--293.Google ScholarDigital Library
- Suyong Kwon, Woohwan Jung, and Kyuseok Shim. 2022. Cardinality Estimation of Approximate Substring Queries using Deep Learning. Technical Report. Seoul National University, Electrical and Computer Engineering Department. https://github.com/sykwon/vldb-tr/raw/main/pvldb_extended.pdfGoogle Scholar
- Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 195--206.Google ScholarDigital Library
- Hongrae Lee, Raymond T Ng, and Kyuseok Shim. 2009. Approximate substring selectivity estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 827--838.Google ScholarDigital Library
- Guoliang Li, Dong Deng, and Jianhua Feng. 2011. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 529--540.Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33, 1 (2001), 31--88.Google Scholar
- Marius Pasca. 2004. Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. 137--145.Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701--710.Google ScholarDigital Library
- Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, and Gautam Das. 2020. Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment 14, 4 (2020), 471--484.Google ScholarDigital Library
- Esko Ukkonen. 1985. Finding approximate patterns in strings. Journal of algorithms 6, 1 (1985), 132--137.Google ScholarCross Ref
- Rares Vernica and Chen Li. 2009. Efficient top-k algorithms for fuzzy search in string collections. In Proceedings of the First International Workshop on Keyword Search on Structured Data. 9--14.Google ScholarDigital Library
- Jiannan Wang, Jianhua Feng, and Guoliang Li. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1219--1230.Google ScholarDigital Library
- Wei Wang, Chuan Xiao, Xuemin Lin, and Chengqi Zhang. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 759--770.Google ScholarDigital Library
- Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic cardinality estimation of similarity selection: A deep learning approach. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1197--1212.Google ScholarDigital Library
- Melanie Weis, Felix Naumann, and Franziska Brosy. 2006. A duplicate detection benchmark for XML (and relational) data. In Proc. of Workshop on Information Quality for Information Systems (IQIS).Google Scholar
- Ronald J Williams and Jing Peng. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation 2, 4 (1990), 490--501.Google Scholar
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 1--41.Google ScholarDigital Library
- Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).Google Scholar
- Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of ACL 2019.Google ScholarCross Ref
- Qiang Yu, Dingbang Wei, and Hongwei Huo. 2018. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC bioinformatics 19, 1 (2018), 1--16.Google Scholar
Index Terms
- Cardinality estimation of approximate substring queries using deep learning
Recommendations
Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataSelectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
Cardinality estimation of activity trajectory similarity queries using deep learning
AbstractCardinality estimation, which involves estimating the result size of queries, is a critical aspect of query processing and optimization. Deep Neural Networks (DNNs) are data hungry, and being trained directly for cardinality estimation ...
Learned Cardinality Estimation for Similarity Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataIn this paper, we study the problem of using deep neural networks (DNNs) for estimating the cardinality of similarity queries. Intuitively, DNNs can capture the distribution of data points, and learn to predict the number of data points that are similar ...
Comments