Abstract
A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.
- Abdurro'uf and et al. 2021. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data. arXiv:2112.02026. Google ScholarCross Ref
- V. A. Alegana, P. M. Atkinson, C. Pezzulo, A. Sorichetta, D. Weiss, T. Bird, E. Erbach-Schoenberg, and A. J. Tatem. 2015. Fine resolution mapping of population age-structures for health and development applications. Journal of The Royal Society Interface 12, 105 (April 2015), 20150073. Google ScholarCross Ref
- Timo Berthold. 2009. RENS-relaxation enforced neighborhood search. Technical Report. Zuse Institute Berlin (ZIB).Google Scholar
- Robert Bixby and Alexander Martin. 2000. Parallelizing the Dual Simplex Method. INFORMS Journal on Computing 12 (02 2000), 45--56. Google ScholarCross Ref
- Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2018. Package queries: efficient and scalable computation of high-order constraints. The VLDB Journal 27, 5 (01 Oct 2018), 693--718. Google ScholarDigital Library
- Wei Cui, Qianxi Zhang, Spyros Blanas, Jesús Camacho-Rodriguez, Brandon Haynes, Yinan Li, Ravi Ramamurthy, Peng Cheng, Rathijit Sen, and Matteo Interlandi. 2023. Query Processing on Gaming Consoles. In Proceedings of the 19th International Workshop on Data Management on New Hardware (Seattle, WA, USA) (DaMoN '23). Association for Computing Machinery, New York, NY, USA, 86--88. Google ScholarDigital Library
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE 5, 1 (1998), 46--55.Google ScholarDigital Library
- Raphael Finkel and Jon Bentley. 1974. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Inf. 4 (03 1974), 1--9. Google ScholarDigital Library
- Matteo Fischetti and Andrea Lodi. 2011. Heuristics in Mixed Integer Programming. Google ScholarCross Ref
- Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. https://www.gurobi.comGoogle Scholar
- Boukthir Haddar, Mahdi Khemakhem, Saïd Hanafi, and Christophe Wilbaut. 2015. A hybrid heuristic for the 0--1 Knapsack Sharing Problem. Expert Systems with Applications 42, 10 (June 2015), 4653--4666. Google ScholarDigital Library
- J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108. http://www.jstor.org/stable/2346830Google ScholarCross Ref
- Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811--2825. Google ScholarDigital Library
- Frederick S. Hillier. 1967. Introduction to Operations Research. San Francisco, Holden-Day.Google Scholar
- Q. Huangfu and J. A. J. Hall. 2018. Parallelizing the dual revised simplex method. Mathematical Programming Computation 10, 1 (01 Mar 2018), 119--142. Google ScholarCross Ref
- Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive Data Exploration Using Semantic Windows. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 505--516. Google ScholarDigital Library
- Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2015. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. VLDB Endow. 8, 10 (jun 2015), 1094--1105. Google ScholarDigital Library
- Leonard Kaufman and Peter J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley.Google Scholar
- Bernard Knueven, James Ostrowski, and Jean-Paul Watson. 2020. On Mixed-Integer Programming Formulations for the Unit Commitment Problem. INFORMS Journal on Computing (June 2020). Google ScholarDigital Library
- C. C. N. Kuhn, G. Calbert, I. Garanovich, and T. Weir. 2023. Integer linear programming supporting portfolio design. arXiv:2303.14364 [math.OC]Google Scholar
- Xiaoqian Li and Kwan L. Yeung. 2019. Traffic Engineering in Segment Routing using MILP. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC). IEEE. Google ScholarCross Ref
- Anh Mai, Matteo Brucato, Azza Abouzied, Peter J. Haas, and Alexandra Meliou. 2023. Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization. https://github.com/alm818/PackageQuery. arXiv:2307.02860 [cs.DB] Google ScholarCross Ref
- Rajesh Matai, Surya Singh, and Murari Lal Mittal. 2010. Traveling Salesman Problem: an Overview of Applications, Formulations, and Solution Approaches. In Traveling Salesman Problem, Donald Davendra (Ed.). IntechOpen, Rijeka, Chapter 1. Google ScholarCross Ref
- Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2020. Solving Mixed Integer Programs Using Neural Networks. Google ScholarCross Ref
- Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. CoRR abs/2110.11927 (2021). arXiv:2110.11927 https://arxiv.org/abs/2110.11927Google Scholar
- J.C. Nash. 2000. The (Dantzig) simplex method for linear programming. Computing in Science & Engineering 2, 1 (2000), 29--31. Google ScholarDigital Library
- Ravi Netravali, Vikram Nathan, James Mickens, and Hari Balakrishnan. 2018. Vesper: Measuring Time-to-Interactivity for Web Pages. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 217--231. https://www.usenix.org/conference/nsdi18/presentation/netravali-vesperGoogle Scholar
- Michael J. Panik. 1996. The Dual Simplex, Primal-Dual, and Complementary Pivot Methods. Springer US, Boston, MA, 251--288. Google ScholarCross Ref
- Maurice Roux. 2015. A comparative study of divisive hierarchical clustering algorithms. CoRR abs/1506.08977 (2015). arXiv:1506.08977 http://arxiv.org/abs/1506.08977Google Scholar
- Georg Sander and Adrian Vasiliu. 2005. Visualization and ILOG CPLEX. In Graph Drawing, János Pach (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 510--511.Google Scholar
- TPCH [n.d.]. TPC-H Decision Support Benchmark. https://www.tpc.org/tpch/.Google Scholar
- Vijay V. Vazirani. 2003. Approximation Algorithms. Springer Berlin Heidelberg, 108. Google ScholarCross Ref
- Laurynas Šikśnys and Torben Bach Pedersen. 2016. SolveDB: Integrating Optimization Problem Solvers Into SQL Databases. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management (Budapest, Hungary) (SSDBM '16). Association for Computing Machinery, New York, NY, USA, Article 14, 12 pages. Google ScholarDigital Library
Recommendations
Package queries: efficient and scalable computation of high-order constraints
Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However,...
Discovering queries based on example tuples
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataAn enterprise information worker is often aware of a few example tuples (but not the entire result) that should be present in the output of the query. We study the problem of discovering the minimal project join query that contains the given example ...
Scalable package queries in relational database systems
Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However,...
Comments