research-article

Open Access

Rethink Query Optimization in HTAP Databases

Authors:
Haoze Song

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong

0009-0000-5952-5168
View Profile

,
Wenchao Zhou

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0002-2689-6020
View Profile

,
Feifei Li

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0003-0770-5775
View Profile

,
Xiang Peng

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0008-6355-4525
View Profile

,
Heming Cui

The University of Hong Kong, Hong Kong, Hong Kong

The University of Hong Kong, Hong Kong, Hong Kong

0000-0001-7746-440X
View Profile

Authors Info & Claims

Proceedings of the ACM on Management of Data Volume 1 Issue 4Article No.: 256pp 1–27https://doi.org/10.1145/3626750

Published:12 December 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

The advent of data-intensive applications has fueled the evolution of hybrid transactional and analytical processing (HTAP). To support mixed workloads, distributed HTAP databases typically maintain two data copies that are specially tailored for data freshness and performance isolation. In particular, a copy in a row-oriented format is well-suited for OLTP workloads, and a second copy in a column-oriented format is optimized for OLAP workloads. Such a hybrid design opens up a new design space for query optimization: plans can be optimized over different data formats and can be executed over isolated resources, which we term hybrid plans. In this paper, we demonstrate that hybrid plans can largely benefit query execution (e.g., up to 11x speedups in our evaluation). However, we also found these benefits will potentially be at the cost of sacrificing data freshness or performance isolation since traditional optimizers may not precisely model and schedule the execution of hybrid plans on real-time updated HTAP databases. Therefore, we propose Metis, an HTAP-aware optimizer. We show, both theoretically and experimentally, that using the proposed optimizations, a system can largely benefit from hybrid plans while preserving isolated performance for OLTP and OLAP, and these optimizations are robust to the changes in workloads.

References

Daniel J Abadi, Samuel R Madden, and Nabil Hachem. 2008. Column-stores vs. row-stores: how different are they really?. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 967--980.Google ScholarDigital Library
Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Proteus: Autonomous Adaptive Storage for Mixed Workloads. Technical Report. Technical Report. University of Waterloo. https://cs. uwaterloo. ca . . . .Google Scholar
Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-ChuanWu, Ion Stoica, and Jingren Zhou. 2012. Reoptimizing Data Parallel Computing.. In NSDI, Vol. 12. 281--294.Google Scholar
Nitin Agrawal and Ashish Vulimiri. 2017. Low-latency analytics on colossal data streams with summarystore. In Proceedings of the 26th Symposium on Operating Systems Principles. 647--664.Google ScholarDigital Library
Rafi Ahmed, Allison Lee, Andrew Witkowski, Dinesh Das, Hong Su, Mohamed Zait, and Thierry Cruanes. 2006. Costbased query transformation in Oracle. In VLDB, Vol. 6. 1026--1036.Google Scholar
Gennady Antoshenkov. 1993. Dynamic query optimization in Rdb/VMS. In Proceedings of IEEE 9th International Conference on Data Engineering. IEEE, 538--547.Google ScholarCross Ref
Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi. 2017. Janus: A hybrid scalable multirepresentation cloud datastore. IEEE Transactions on Knowledge and Data Engineering 30, 4 (2017), 689--702.Google ScholarCross Ref
Ron Avnur and Joseph M Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 261--272.Google ScholarDigital Library
AWS. 2023. AWS Latency Monitoring. https://www.cloudping.co/grid.Google Scholar
Shivnath Babu, Pedro Bizarro, and David DeWitt. 2005. Proactive re-optimization. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 107--118.Google ScholarDigital Library
Renata Borovica-Gajic, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski, and Campbell Fraser. 2015. Smooth scan: Statistics-oblivious access paths. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 315--326.Google ScholarCross Ref
Dennis Butterstein, Daniel Martin, Knut Stolze, Felix Beier, Jia Zhong, and Lingyun Wang. 2020. Replication at the Speed of Change: A Fast, Scalable Replication Solution for near Real-Time HTAP Processing. Proc. VLDB Endow. 13, 12 (aug 2020), 3245--3257. https://doi.org/10.14778/3415478.3415548Google ScholarDigital Library
Shaosheng Cao, XinXing Yang, Cen Chen, Jun Zhou, Xiaolong Li, and Yuan Qi. 2019. TitAnt: Online Real-Time Transaction Fraud Detection in Ant Financial. Proc. VLDB Endow. 12, 12 (aug 2019), 2082--2093. https://doi.org/10. 14778/3352063.3352126Google ScholarDigital Library
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).Google Scholar
Surajit Chaudhuri. 2009. Query optimizers: time to rethink the contract?. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 961--968.Google ScholarDigital Library
Jianjun Chen, Yonghua Ding, Ye Liu, Fangshi Li, Li Zhang, Mingyi Zhang, Kui Wei, Lixun Cao, Dan Zou, Yang Liu, et al. 2022. ByteHTAP: bytedance's HTAP system with high data freshness and strong data consistency. Proceedings of the VLDB Endowment 15, 12 (2022), 3411--3424.Google ScholarDigital Library
Xusheng Chen, Haoze Song, Jianyu Jiang, Chaoyi Ruan, Cheng Li, Sen Wang, Gong Zhang, Reynold Cheng, and Heming Cui. 2021. Achieving low tail-latency and high scalability for serializable transactions in edge computing. In Proceedings of the Sixteenth European Conference on Computer Systems. 210--227.Google ScholarDigital Library
Inc. ClickHouse. 2023. ClickHouse - open source distributed column-oriented DBMS. https://github.com/ClickHouse/ ClickHouse/tree/22.6.Google Scholar
Richard Cole, Florian Funke, Leo Giakoumakis, Wey Guy, Alfons Kemper, Stefan Krompass, Harumi Kuno, Raghunath Nambiar, Thomas Neumann, Meikel Poess, et al. 2011. The mixed workload CH-benCHmark. In Proceedings of the Fourth International Workshop on Testing Database Systems. 1--6.Google ScholarDigital Library
Brian Cooper. 2010. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB.Google Scholar
The Transaction Processing Council. 1992. TPC-H. http://www.tpc.org/tpch/.Google Scholar
The Transaction Processing Council. 2014. TPC-C. http://www.tpc.org/tpcc/.Google Scholar
The Transaction Processing Council. 2015. TPC-DS. http://www.tpc.org/tpcds/.Google Scholar
Akon Dey, Alan Fekete, Raghunath Nambiar, and Uwe Röhm. 2014. YCSB T: Benchmarking web-scale transactional databases. In 2014 IEEE 30th International Conference on Data Engineering Workshops. IEEE, 223--230.Google ScholarCross Ref
Bailu Ding, Sudipto Das, Wentao Wu, Surajit Chaudhuri, and Vivek Narasayya. 2018. Plan stitch: harnessing the best of many plans. Proceedings of the VLDB Endowment 11, 10 (2018), 1123--1136.Google ScholarDigital Library
Science Direct. 2004. Real-Time Pricing. https://www.sciencedirect.com/topics/engineering/real-time-pricing.Google Scholar
Anshuman Dutt and Jayant R Haritsa. 2014. Plan bouquets: query processing without selectivity estimation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1039--1050.Google ScholarDigital Library
Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R Narasayya, and Manoj Syamala. 2018. Columnstore and B tree-Are Hybrid Physical Designs Important?. In Proceedings of the 2018 International Conference on Management of Data. 177--190.Google ScholarDigital Library
Kaushik Ghosh. 1995. Speculative execution in real-time systems. Ph.D. Dissertation. Citeseer.Google Scholar
Matteo Golfarelli and Stefano Rizzi. 2017. From Star Schemas to Big Data: 20 Years of Data Warehouse Research. A comprehensive guide through the Italian database research over the last 25 years (2017), 93--107.Google Scholar
Google. 2022. AlloyDB for PostgreSQL under the hood: Columnar engine. https://cloud.google.com/blog/products/ databases/alloydb-for-postgresql-columnar-engine.Google Scholar
Hui-I Hsiao, Ming-Syan Chen, and Philip S Yu. 1994. On parallel execution of multiple pipelined hash joins. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data. 185--196.Google ScholarDigital Library
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, et al. 2020. TiDB: a Raft-based HTAP database. Proceedings of the VLDB Endowment 13, 12 (2020), 3072--3084.Google ScholarDigital Library
Bert Hubert. 2023. tc(8), Linux manual page. https://man7.org/linux/man-pages/man8/tc.8.html.Google Scholar
SnowFlake Inc. 2023. Unistore: A modern approach to working with transactional and analytical data together in a single platform. https://www.snowflake.com/workloads/unistore/.Google Scholar
Alekh Jindal, Lalitha Viswanathan, and Konstantinos Karanasos. 2019. Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems. arXiv preprint arXiv:1906.06590 (2019).Google Scholar
Navin Kabra and David J DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 106--117.Google ScholarDigital Library
The kernel development community. 2023. Control Groups. https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html.Google Scholar
Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proceedings of the 2017 ACM International Conference on Management of Data. 715--730.Google ScholarDigital Library
Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proceedings of the 2017 ACM International Conference on Management of Data. 715--730.Google ScholarDigital Library
Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, et al. 2015. Oracle database in-memory: A dual format in-memory database. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 1253--1258.Google ScholarCross Ref
Per-Åke Larson, Adrian Birka, Eric N Hanson, Weiyun Huang, Michal Nowakiewicz, and Vassilis Papadimos. 2015. Real-time analytical processing with SQL server. Proceedings of the VLDB Endowment 8, 12 (2015), 1740--1751.Google ScholarDigital Library
Juchang Lee, SeungHyun Moon, Kyu Hwan Kim, Deok Hoe Kim, Sang Kyun Cha, and Wook-Shin Han. 2017. Parallel replication across formats in SAP HANA for scaling out mixed OLTP/OLAP workloads. Proceedings of the VLDB Endowment 10, 12 (2017), 1598--1609.Google ScholarDigital Library
Mark Levene and George Loizou. 2003. Why is the snowflake schema a good data warehouse design? Information Systems 28, 3 (2003), 225--240.Google ScholarDigital Library
Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and What is Next. In Proceedings of the 2022 International Conference on Management of Data. 2483--2488.Google ScholarDigital Library
Quanzhong Li, Minglong Shao, Volker Markl, Kevin Beyer, Latha Colby, and Guy Lohman. 2006. Adaptively reordering joins during query execution. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 26--35.Google Scholar
Yan Li, Liwei Wang, Sheng Wang, Yuan Sun, and Zhiyong Peng. 2022. A Resource-Aware Deep Cost Model for Big Data Query Processing. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 885--897.Google ScholarCross Ref
Chen Luo and Michael J Carey. 2019. On performance stability in LSM-based storage systems. Proceedings of the VLDB Endowment 13, 4 (2019).Google ScholarDigital Library
Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, AlexandraWang, et al. 2021. Greenplum: A Hybrid Database for Transactional and AnalyticalWorkloads. In Proceedings of the 2021 International Conference on Management of Data. 2530--2542.Google ScholarDigital Library
Elena Milkai, Yannis Chronis, Kevin P Gaffney, Zhihan Guo, Jignesh M Patel, and Xiangyao Yu. 2022. How Good is My HTAP System?. In Proceedings of the 2022 International Conference on Management of Data. 1810--1824.Google ScholarDigital Library
MySQL. 2022. MySQL Heatwave. https://dev.mysql.com/doc/heatwave/en/heatwave-introduction.html.Google Scholar
Edmund B Nightingale, Peter M Chen, and Jason Flinn. 2005. Speculative execution in a distributed file system. ACM SIGOPS operating systems review 39, 5 (2005), 191--205.Google Scholar
Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. 2017. Hybrid transactional/analytical processing: A survey. In Proceedings of the 2017 ACM International Conference on Management of Data. 1771--1775.Google ScholarDigital Library
Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Performance Evaluation and Benchmarking: First TPC Technology Conference, TPCTC 2009, Lyon, France, August 24--28, 2009, Revised Selected Papers 1. Springer, 237--252.Google ScholarDigital Library
Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M Lohman, et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091.Google ScholarDigital Library
Aunn Raza, Periklis Chrysogelos, Angelos Christos Anadiotis, and Anastasia Ailamaki. 2020. Adaptive HTAP through elastic resource scheduling. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2043--2054.Google ScholarDigital Library
Mohammad Sadoghi, Souvik Bhattacherjee, Bishwaranjan Bhattacharjee, and Mustafa Canim. 2016. L-store: A realtime OLTP and OLAP system. arXiv preprint arXiv:1601.04084 (2016).Google Scholar
Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A tunable deleteaware LSM engine. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 893--908.Google ScholarDigital Library
Hemant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F Ilyas. 2021. Real-time LSM-trees for HTAP workloads. arXiv preprint arXiv:2101.06801 (2021).Google Scholar
P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data. 23--34.Google ScholarDigital Library
Sijie Shen, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 219--238.Google Scholar
Inc. SingleStore. 2023. SingleStore: Real-Time Distributed SQL. https://www.singlestore.com/.Google Scholar
T Spenser and T Loukas. 1999. From Star to Snowflake to ERD: Comparing Data Warehouse Design Approaches. Enterprise Systems Journal 14 (1999), 62--69.Google Scholar
Khin Me Me Thein. 2014. Apache kafka: Next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research 3, 47 (2014), 9478--9483.Google Scholar
Panos Vassiliadis. 2009. A survey of extract--transform--load technology. International Journal of Data Warehousing and Mining (IJDWM) 5, 3 (2009), 1--27.Google ScholarCross Ref
Lalitha Viswanathan, Alekh Jindal, and Konstantinos Karanasos. 2018. Query and resource optimization: Bridging the gap. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1384--1387.Google ScholarCross Ref
Jianying Wang, Tongliang Li, Haoze Song, Xinjun Yang, Wenchao Zhou, Feifei Li, Baoyue Yan, Qianqian Wu, Yukun Liang, ChengJun Ying, Yujie Wang, Baokai Chen, Chang Cai, Yubin Ruan, Xiaoyi Weng, Shibin Chen, Liang Yin, Chengzhong Yang, Xin Cai, Hongyan Xing, Nanlong Yu, Xiaofei Chen, Dapeng Huang, and Jianling Sun. 2023. PolarDBIMCI: A Cloud-Native HTAP Database System at Alibaba. Proc. ACM Manag. Data 1, 2, Article 199 (jun 2023), 25 pages. https://doi.org/10.1145/3589785Google ScholarDigital Library
Jiacheng Yang, Ian Rae, Jun Xu, Jeff Shute, Zhan Yuan, Kelvin Lau, Qiang Zeng, Xi Zhao, Jun Ma, Ziyang Chen, et al. 2020. F1 Lightning: HTAP as a Service. Proceedings of the VLDB Endowment 13, 12 (2020), 3313--3325.Google ScholarDigital Library
Ting Yao, Yiwen Zhang, JiguangWan, Qiu Cui, gLiu Tang, Hong Jiang, Changsheng Xie, and Xubin He. 2020. MatrixKV: reducing write stalls and write amplification in LSM-tree based KV stores with a matrix container in NVM. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference. 17--31.Google Scholar
Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan. 2015. Robust query optimization methods with respect to estimation errors: A survey. ACM Sigmod Record 44, 3 (2015), 25--36.Google ScholarDigital Library

Index Terms

Rethink Query Optimization in HTAP Databases
1. Computer systems organization
  1. Real-time systems
    1. Real-time system architecture
2. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
      2. Data layout
    2. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

Multidatabase Query Optimization

A multidatabase system (MDBS) allows the users to simultaneously access heterogeneous, and autonomous databases using an integrated schema and a single global query language. The query optimization problem in MDBSs is quite different from the query ...
Read More
Query Rewriting and Optimization for Ontological Databases

Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than ...
Read More
SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications
Algorithmic Aspects of Cloud Computing
Abstract
A method for query optimization is presented by utilizing Spark SQL, a module of Apache Spark that integrates relational data processing. The goal of this paper is to explore NoSQL databases and their effective usage in conjunction with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2023
Published in pacmmod Volume 1, Issue 4
Author Tags
HTAP database
hybrid physical format
query optimization
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 441
  Total Downloads
- Downloads (Last 12 months)441
- Downloads (Last 6 weeks)130
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rethink Query Optimization in HTAP Databases

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Multidatabase Query Optimization

Query Rewriting and Optimization for Ontological Databases

SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Rethink Query Optimization in HTAP Databases

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Multidatabase Query Optimization

Query Rewriting and Optimization for Ontological Databases

SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media