skip to main content
research-article
Open Access

Rethink Query Optimization in HTAP Databases

Published:12 December 2023Publication History
Skip Abstract Section

Abstract

The advent of data-intensive applications has fueled the evolution of hybrid transactional and analytical processing (HTAP). To support mixed workloads, distributed HTAP databases typically maintain two data copies that are specially tailored for data freshness and performance isolation. In particular, a copy in a row-oriented format is well-suited for OLTP workloads, and a second copy in a column-oriented format is optimized for OLAP workloads. Such a hybrid design opens up a new design space for query optimization: plans can be optimized over different data formats and can be executed over isolated resources, which we term hybrid plans. In this paper, we demonstrate that hybrid plans can largely benefit query execution (e.g., up to 11x speedups in our evaluation). However, we also found these benefits will potentially be at the cost of sacrificing data freshness or performance isolation since traditional optimizers may not precisely model and schedule the execution of hybrid plans on real-time updated HTAP databases. Therefore, we propose Metis, an HTAP-aware optimizer. We show, both theoretically and experimentally, that using the proposed optimizations, a system can largely benefit from hybrid plans while preserving isolated performance for OLTP and OLAP, and these optimizations are robust to the changes in workloads.

References

  1. Daniel J Abadi, Samuel R Madden, and Nabil Hachem. 2008. Column-stores vs. row-stores: how different are they really?. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 967--980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Proteus: Autonomous Adaptive Storage for Mixed Workloads. Technical Report. Technical Report. University of Waterloo. https://cs. uwaterloo. ca . . . .Google ScholarGoogle Scholar
  3. Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-ChuanWu, Ion Stoica, and Jingren Zhou. 2012. Reoptimizing Data Parallel Computing.. In NSDI, Vol. 12. 281--294.Google ScholarGoogle Scholar
  4. Nitin Agrawal and Ashish Vulimiri. 2017. Low-latency analytics on colossal data streams with summarystore. In Proceedings of the 26th Symposium on Operating Systems Principles. 647--664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rafi Ahmed, Allison Lee, Andrew Witkowski, Dinesh Das, Hong Su, Mohamed Zait, and Thierry Cruanes. 2006. Costbased query transformation in Oracle. In VLDB, Vol. 6. 1026--1036.Google ScholarGoogle Scholar
  6. Gennady Antoshenkov. 1993. Dynamic query optimization in Rdb/VMS. In Proceedings of IEEE 9th International Conference on Data Engineering. IEEE, 538--547.Google ScholarGoogle ScholarCross RefCross Ref
  7. Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi. 2017. Janus: A hybrid scalable multirepresentation cloud datastore. IEEE Transactions on Knowledge and Data Engineering 30, 4 (2017), 689--702.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ron Avnur and Joseph M Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 261--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. AWS. 2023. AWS Latency Monitoring. https://www.cloudping.co/grid.Google ScholarGoogle Scholar
  10. Shivnath Babu, Pedro Bizarro, and David DeWitt. 2005. Proactive re-optimization. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 107--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Renata Borovica-Gajic, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski, and Campbell Fraser. 2015. Smooth scan: Statistics-oblivious access paths. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 315--326.Google ScholarGoogle ScholarCross RefCross Ref
  12. Dennis Butterstein, Daniel Martin, Knut Stolze, Felix Beier, Jia Zhong, and Lingyun Wang. 2020. Replication at the Speed of Change: A Fast, Scalable Replication Solution for near Real-Time HTAP Processing. Proc. VLDB Endow. 13, 12 (aug 2020), 3245--3257. https://doi.org/10.14778/3415478.3415548Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shaosheng Cao, XinXing Yang, Cen Chen, Jun Zhou, Xiaolong Li, and Yuan Qi. 2019. TitAnt: Online Real-Time Transaction Fraud Detection in Ant Financial. Proc. VLDB Endow. 12, 12 (aug 2019), 2082--2093. https://doi.org/10. 14778/3352063.3352126Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).Google ScholarGoogle Scholar
  15. Surajit Chaudhuri. 2009. Query optimizers: time to rethink the contract?. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 961--968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jianjun Chen, Yonghua Ding, Ye Liu, Fangshi Li, Li Zhang, Mingyi Zhang, Kui Wei, Lixun Cao, Dan Zou, Yang Liu, et al. 2022. ByteHTAP: bytedance's HTAP system with high data freshness and strong data consistency. Proceedings of the VLDB Endowment 15, 12 (2022), 3411--3424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xusheng Chen, Haoze Song, Jianyu Jiang, Chaoyi Ruan, Cheng Li, Sen Wang, Gong Zhang, Reynold Cheng, and Heming Cui. 2021. Achieving low tail-latency and high scalability for serializable transactions in edge computing. In Proceedings of the Sixteenth European Conference on Computer Systems. 210--227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Inc. ClickHouse. 2023. ClickHouse - open source distributed column-oriented DBMS. https://github.com/ClickHouse/ ClickHouse/tree/22.6.Google ScholarGoogle Scholar
  19. Richard Cole, Florian Funke, Leo Giakoumakis, Wey Guy, Alfons Kemper, Stefan Krompass, Harumi Kuno, Raghunath Nambiar, Thomas Neumann, Meikel Poess, et al. 2011. The mixed workload CH-benCHmark. In Proceedings of the Fourth International Workshop on Testing Database Systems. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Brian Cooper. 2010. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB.Google ScholarGoogle Scholar
  21. The Transaction Processing Council. 1992. TPC-H. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  22. The Transaction Processing Council. 2014. TPC-C. http://www.tpc.org/tpcc/.Google ScholarGoogle Scholar
  23. The Transaction Processing Council. 2015. TPC-DS. http://www.tpc.org/tpcds/.Google ScholarGoogle Scholar
  24. Akon Dey, Alan Fekete, Raghunath Nambiar, and Uwe Röhm. 2014. YCSB T: Benchmarking web-scale transactional databases. In 2014 IEEE 30th International Conference on Data Engineering Workshops. IEEE, 223--230.Google ScholarGoogle ScholarCross RefCross Ref
  25. Bailu Ding, Sudipto Das, Wentao Wu, Surajit Chaudhuri, and Vivek Narasayya. 2018. Plan stitch: harnessing the best of many plans. Proceedings of the VLDB Endowment 11, 10 (2018), 1123--1136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Science Direct. 2004. Real-Time Pricing. https://www.sciencedirect.com/topics/engineering/real-time-pricing.Google ScholarGoogle Scholar
  27. Anshuman Dutt and Jayant R Haritsa. 2014. Plan bouquets: query processing without selectivity estimation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1039--1050.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R Narasayya, and Manoj Syamala. 2018. Columnstore and B tree-Are Hybrid Physical Designs Important?. In Proceedings of the 2018 International Conference on Management of Data. 177--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kaushik Ghosh. 1995. Speculative execution in real-time systems. Ph.D. Dissertation. Citeseer.Google ScholarGoogle Scholar
  30. Matteo Golfarelli and Stefano Rizzi. 2017. From Star Schemas to Big Data: 20 Years of Data Warehouse Research. A comprehensive guide through the Italian database research over the last 25 years (2017), 93--107.Google ScholarGoogle Scholar
  31. Google. 2022. AlloyDB for PostgreSQL under the hood: Columnar engine. https://cloud.google.com/blog/products/ databases/alloydb-for-postgresql-columnar-engine.Google ScholarGoogle Scholar
  32. Hui-I Hsiao, Ming-Syan Chen, and Philip S Yu. 1994. On parallel execution of multiple pipelined hash joins. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data. 185--196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, et al. 2020. TiDB: a Raft-based HTAP database. Proceedings of the VLDB Endowment 13, 12 (2020), 3072--3084.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Bert Hubert. 2023. tc(8), Linux manual page. https://man7.org/linux/man-pages/man8/tc.8.html.Google ScholarGoogle Scholar
  35. SnowFlake Inc. 2023. Unistore: A modern approach to working with transactional and analytical data together in a single platform. https://www.snowflake.com/workloads/unistore/.Google ScholarGoogle Scholar
  36. Alekh Jindal, Lalitha Viswanathan, and Konstantinos Karanasos. 2019. Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems. arXiv preprint arXiv:1906.06590 (2019).Google ScholarGoogle Scholar
  37. Navin Kabra and David J DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 106--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. The kernel development community. 2023. Control Groups. https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html.Google ScholarGoogle Scholar
  39. Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proceedings of the 2017 ACM International Conference on Management of Data. 715--730.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proceedings of the 2017 ACM International Conference on Management of Data. 715--730.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, et al. 2015. Oracle database in-memory: A dual format in-memory database. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 1253--1258.Google ScholarGoogle ScholarCross RefCross Ref
  42. Per-Åke Larson, Adrian Birka, Eric N Hanson, Weiyun Huang, Michal Nowakiewicz, and Vassilis Papadimos. 2015. Real-time analytical processing with SQL server. Proceedings of the VLDB Endowment 8, 12 (2015), 1740--1751.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Juchang Lee, SeungHyun Moon, Kyu Hwan Kim, Deok Hoe Kim, Sang Kyun Cha, and Wook-Shin Han. 2017. Parallel replication across formats in SAP HANA for scaling out mixed OLTP/OLAP workloads. Proceedings of the VLDB Endowment 10, 12 (2017), 1598--1609.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mark Levene and George Loizou. 2003. Why is the snowflake schema a good data warehouse design? Information Systems 28, 3 (2003), 225--240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and What is Next. In Proceedings of the 2022 International Conference on Management of Data. 2483--2488.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Quanzhong Li, Minglong Shao, Volker Markl, Kevin Beyer, Latha Colby, and Guy Lohman. 2006. Adaptively reordering joins during query execution. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 26--35.Google ScholarGoogle Scholar
  47. Yan Li, Liwei Wang, Sheng Wang, Yuan Sun, and Zhiyong Peng. 2022. A Resource-Aware Deep Cost Model for Big Data Query Processing. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 885--897.Google ScholarGoogle ScholarCross RefCross Ref
  48. Chen Luo and Michael J Carey. 2019. On performance stability in LSM-based storage systems. Proceedings of the VLDB Endowment 13, 4 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, AlexandraWang, et al. 2021. Greenplum: A Hybrid Database for Transactional and AnalyticalWorkloads. In Proceedings of the 2021 International Conference on Management of Data. 2530--2542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Elena Milkai, Yannis Chronis, Kevin P Gaffney, Zhihan Guo, Jignesh M Patel, and Xiangyao Yu. 2022. How Good is My HTAP System?. In Proceedings of the 2022 International Conference on Management of Data. 1810--1824.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. MySQL. 2022. MySQL Heatwave. https://dev.mysql.com/doc/heatwave/en/heatwave-introduction.html.Google ScholarGoogle Scholar
  52. Edmund B Nightingale, Peter M Chen, and Jason Flinn. 2005. Speculative execution in a distributed file system. ACM SIGOPS operating systems review 39, 5 (2005), 191--205.Google ScholarGoogle Scholar
  53. Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. 2017. Hybrid transactional/analytical processing: A survey. In Proceedings of the 2017 ACM International Conference on Management of Data. 1771--1775.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The star schema benchmark and augmented fact table indexing. In Performance Evaluation and Benchmarking: First TPC Technology Conference, TPCTC 2009, Lyon, France, August 24--28, 2009, Revised Selected Papers 1. Springer, 237--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M Lohman, et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Aunn Raza, Periklis Chrysogelos, Angelos Christos Anadiotis, and Anastasia Ailamaki. 2020. Adaptive HTAP through elastic resource scheduling. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2043--2054.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Mohammad Sadoghi, Souvik Bhattacherjee, Bishwaranjan Bhattacharjee, and Mustafa Canim. 2016. L-store: A realtime OLTP and OLAP system. arXiv preprint arXiv:1601.04084 (2016).Google ScholarGoogle Scholar
  58. Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, and Manos Athanassoulis. 2020. Lethe: A tunable deleteaware LSM engine. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 893--908.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Hemant Saxena, Lukasz Golab, Stratos Idreos, and Ihab F Ilyas. 2021. Real-time LSM-trees for HTAP workloads. arXiv preprint arXiv:2101.06801 (2021).Google ScholarGoogle Scholar
  60. P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data. 23--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sijie Shen, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 219--238.Google ScholarGoogle Scholar
  62. Inc. SingleStore. 2023. SingleStore: Real-Time Distributed SQL. https://www.singlestore.com/.Google ScholarGoogle Scholar
  63. T Spenser and T Loukas. 1999. From Star to Snowflake to ERD: Comparing Data Warehouse Design Approaches. Enterprise Systems Journal 14 (1999), 62--69.Google ScholarGoogle Scholar
  64. Khin Me Me Thein. 2014. Apache kafka: Next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research 3, 47 (2014), 9478--9483.Google ScholarGoogle Scholar
  65. Panos Vassiliadis. 2009. A survey of extract--transform--load technology. International Journal of Data Warehousing and Mining (IJDWM) 5, 3 (2009), 1--27.Google ScholarGoogle ScholarCross RefCross Ref
  66. Lalitha Viswanathan, Alekh Jindal, and Konstantinos Karanasos. 2018. Query and resource optimization: Bridging the gap. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1384--1387.Google ScholarGoogle ScholarCross RefCross Ref
  67. Jianying Wang, Tongliang Li, Haoze Song, Xinjun Yang, Wenchao Zhou, Feifei Li, Baoyue Yan, Qianqian Wu, Yukun Liang, ChengJun Ying, Yujie Wang, Baokai Chen, Chang Cai, Yubin Ruan, Xiaoyi Weng, Shibin Chen, Liang Yin, Chengzhong Yang, Xin Cai, Hongyan Xing, Nanlong Yu, Xiaofei Chen, Dapeng Huang, and Jianling Sun. 2023. PolarDBIMCI: A Cloud-Native HTAP Database System at Alibaba. Proc. ACM Manag. Data 1, 2, Article 199 (jun 2023), 25 pages. https://doi.org/10.1145/3589785Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Jiacheng Yang, Ian Rae, Jun Xu, Jeff Shute, Zhan Yuan, Kelvin Lau, Qiang Zeng, Xi Zhao, Jun Ma, Ziyang Chen, et al. 2020. F1 Lightning: HTAP as a Service. Proceedings of the VLDB Endowment 13, 12 (2020), 3313--3325.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Ting Yao, Yiwen Zhang, JiguangWan, Qiu Cui, gLiu Tang, Hong Jiang, Changsheng Xie, and Xubin He. 2020. MatrixKV: reducing write stalls and write amplification in LSM-tree based KV stores with a matrix container in NVM. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference. 17--31.Google ScholarGoogle Scholar
  70. Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan. 2015. Robust query optimization methods with respect to estimation errors: A survey. ACM Sigmod Record 44, 3 (2015), 25--36.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Rethink Query Optimization in HTAP Databases

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the ACM on Management of Data
            Proceedings of the ACM on Management of Data  Volume 1, Issue 4
            PACMMOD
            December 2023
            1317 pages
            EISSN:2836-6573
            DOI:10.1145/3637468
            • Editor:
            • Divyakant Agrawal
            Issue’s Table of Contents

            Copyright © 2023 Owner/Author

            This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 December 2023
            Published in pacmmod Volume 1, Issue 4

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)441
            • Downloads (Last 6 weeks)130

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader