Abstract
Query optimization is essential for the efficient execution of queries. The necessary analysis, if we can and should apply optimizations and transform the query plan, is already challenging. Traditional techniques focus on the availability of columns at individual operators, which does not scale for analysis of data flow through the query. Tracking available columns per operator takes quadratic space, which can result in multi-second optimization time for deep algebra trees. Instead, we need to re-think the naïve algebra representation to efficiently support data flow analysis.
In this paper, we introduce Indexed Algebra, a novel representation of relational algebra that makes common optimization tasks efficient. Indexed Algebra enables efficient reasoning with an auxiliary index structure based on link/cut trees that support dynamic updates and queries in O(log n). This approach not only improves the asymptotic complexity, but also allows elegant and concise formulations for the data flow questions needed for query optimization. While large queries see theoretically unbounded improvements, Indexed Algebra also improves optimization time of the relatively harmless queries of TPC-H and TPC-DS by more than 1.8×.
- Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic. 2012. DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views. Proc. VLDB Endow. 5, 10 (2012), 968--979.Google ScholarDigital Library
- Darshana Balakrishnan, Carl Nuessle, Oliver Kennedy, and Lukasz Ziarek. 2021. TreeToaster: Towards an IVM-Optimized Compiler. In SIGMOD Conference. ACM, 155--167.Google ScholarDigital Library
- Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In SIGMOD. ACM, 221--230.Google ScholarDigital Library
- Peter A. Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In TPCTC (Lecture Notes in Computer Science), Vol. 8391. Springer, 61--76.Google Scholar
- Nicolas Dieu, Adrian Dragusanu, Françoise Fabret, François Llirbat, and Eric Simon. 2009. 1,000 Tables Under the From. Proc. VLDB Endow. 2, 2 (2009), 1450--1461.Google ScholarDigital Library
- Philipp Fent, Altan Birler, and Thomas Neumann. 2022. Practical planning and execution of groupjoin and nested aggregates. VLDB J. (2022).Google Scholar
- Michael J. Freitag and Thomas Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates. In CIDR. www.cidrdb.org.Google Scholar
- Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6, 1 (1994), 120--135.Google ScholarDigital Library
- Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.Google Scholar
- Goetz Graefe and David J. DeWitt. 1987. The EXODUS Optimizer Generator. In SIGMOD. ACM Press, 160--172.Google Scholar
- Torsten Grust. 2002. Accelerating XPath location steps. In SIGMOD Conference. ACM, 109--120.Google ScholarDigital Library
- Dov Harel and Robert Endre Tarjan. 1984. Fast Algorithms for Finding Nearest Common Ancestors. SIAM J. Comput. 13, 2 (1984), 338--355.Google ScholarDigital Library
- Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an Open Framework for Query Optimization and Compilation. Proc. VLDB Endow. 15, 11 (2022), 2389--2401.Google ScholarDigital Library
- Timo Kersten, Viktor Leis, and Thomas Neumann. 2021. Tidy Tuples and Flying Start: fast compilation and fast execution of relational queries in Umbra. VLDB J. 30, 5 (2021), 883--905.Google ScholarDigital Library
- Philip N. Klein and Shay Mozes. 2021. Optimization Algorithms for Planar Graphs. https://planarity.org.Google Scholar
- Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In SIGMOD. ACM, 311--326.Google ScholarDigital Library
- Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? Proc. VLDB Endow. 9, 3 (2015), 204--215.Google ScholarDigital Library
- Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, Alexandra Wang, Wen Lin, Ashwin Agrawal, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, and Venkatesh Raghavan. 2021. Greenplum: A Hybrid Database for Transactional and Analytical Workloads. In SIGMOD. ACM, 2530--2542.Google Scholar
- Norman May, Alexander Böhm, and Wolfgang Lehner. 2017. SAP HANA - The Evolution of an In-Memory DBMS from Pure OLAP Processing Towards Mixed Workloads. In BTW (LNI), Vol. P-265. GI, 545--563.Google Scholar
- Guido Moerkotte. 2020. Building Query Compilers. http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf.Google Scholar
- Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. In VLDB. ACM, 1049--1058.Google Scholar
- Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (2011), 539--550.Google ScholarDigital Library
- Thomas Neumann. 2018. Reasoning in the Presence of NULLs. In ICDE. IEEE Computer Society, 1682--1683.Google Scholar
- Thomas Neumann. 2020. Linear Time Liveness Analysis. https://databasearchitects.blogspot.com/2020/04/linear-time-liveness-analysis.html.Google Scholar
- Thomas Neumann. 2020. Taming Deep Recursion. https://databasearchitects.blogspot.com/2020/11/taming-deep-recursion.html.Google Scholar
- Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. In CIDR. www.cidrdb.org.Google Scholar
- Thomas Neumann and Alfons Kemper. 2015. Unnesting Arbitrary Queries. In BTW (LNI), Vol. P-241. GI, 383--402.Google Scholar
- Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization of Very Large Join Queries. In SIGMOD. ACM, 677--692.Google Scholar
- Patrick E. O'Neil, Elizabeth J. O'Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, and Nigel Westbury. 2004. ORDPATHs: Insert-Friendly XML Node Labels. In SIGMOD Conference. ACM, 903--908.Google ScholarDigital Library
- Ravindra Pindikura. 2018. Gandiva Initiative: Improving SQL Performance by 70x. https://www.dremio.com/gandiva-performance-improvements-production-query/.Google Scholar
- Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access Path Selection in a Relational Database Management System. In SIGMOD Conference. ACM, 23--34.Google Scholar
- Hesam Shahrokhi and Amir Shaikhha. 2023. Building a Compiled Query Engine in Python. In CC. ACM, 180--190.Google Scholar
- Amir Shaikhha, Yannis Klonatos, Lionel Parreaux, Lewis Brown, Mohammad Dashti, and Christoph Koch. 2016. How to Architect a Query Compiler. In SIGMOD. ACM, 1907--1922.Google ScholarDigital Library
- Daniel Dominic Sleator. 2011. Submission #860934 - Codeforces. https://codeforces.com/contest/117/submission/860934.Google Scholar
- Daniel Dominic Sleator and Robert Endre Tarjan. 1983. A Data Structure for Dynamic Trees. J. Comput. Syst. Sci. 26, 3 (1983), 362--391.Google ScholarDigital Library
- Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-Adjusting Binary Search Trees. J. ACM 32, 3 (1985), 652--686.Google ScholarDigital Library
- Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: a modular query optimizer architecture for big data. In SIGMOD. ACM, 337--348.Google Scholar
- Ruby Y. Tahboub, Grégory M. Essertel, and Tiark Rompf. 2018. How to Architect a Query Compiler, Revisited. In SIGMOD Conference. ACM, 307--322.Google Scholar
- Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In DBTest@SIGMOD. ACM, 1:1--1:6.Google Scholar
- Adrian Vogelsgesang, Tobias Mühlbauer, Viktor Leis, Thomas Neumann, and Alfons Kemper. 2019. Domain Query Optimization: Adapting the General-Purpose Database System Hyper for Tableau Workloads. In BTW (LNI), Vol. P-289. Gesellschaft für Informatik, Bonn, 313--333.Google Scholar
Recommendations
Distributed Query Plan Generation using Ant Colony Optimization
Query processing is a critical performance evaluation parameter and has received a considerable amount of attention especially in the context of distributed database systems. The aim of distributed query processing is to effectively and efficiently ...
Distributed Query Plan Generation using Particle Swarm Optimization
A large number of queries are posed on databases spread across the globe. In order to process these queries efficiently, optimal query processing strategies that generate efficient query processing plans are being devised. In distributed relational ...
Comments