skip to main content
research-article
Artifacts Available / v1.1

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Published:02 May 2024Publication History
Skip Abstract Section

Abstract

A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.

References

  1. Abdurro'uf and et al. 2021. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data. arXiv:2112.02026. Google ScholarGoogle ScholarCross RefCross Ref
  2. V. A. Alegana, P. M. Atkinson, C. Pezzulo, A. Sorichetta, D. Weiss, T. Bird, E. Erbach-Schoenberg, and A. J. Tatem. 2015. Fine resolution mapping of population age-structures for health and development applications. Journal of The Royal Society Interface 12, 105 (April 2015), 20150073. Google ScholarGoogle ScholarCross RefCross Ref
  3. Timo Berthold. 2009. RENS-relaxation enforced neighborhood search. Technical Report. Zuse Institute Berlin (ZIB).Google ScholarGoogle Scholar
  4. Robert Bixby and Alexander Martin. 2000. Parallelizing the Dual Simplex Method. INFORMS Journal on Computing 12 (02 2000), 45--56. Google ScholarGoogle ScholarCross RefCross Ref
  5. Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2018. Package queries: efficient and scalable computation of high-order constraints. The VLDB Journal 27, 5 (01 Oct 2018), 693--718. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wei Cui, Qianxi Zhang, Spyros Blanas, Jesús Camacho-Rodriguez, Brandon Haynes, Yinan Li, Ravi Ramamurthy, Peng Cheng, Rathijit Sen, and Matteo Interlandi. 2023. Query Processing on Gaming Consoles. In Proceedings of the 19th International Workshop on Data Management on New Hardware (Seattle, WA, USA) (DaMoN '23). Association for Computing Machinery, New York, NY, USA, 86--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE 5, 1 (1998), 46--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Raphael Finkel and Jon Bentley. 1974. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Inf. 4 (03 1974), 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Matteo Fischetti and Andrea Lodi. 2011. Heuristics in Mixed Integer Programming. Google ScholarGoogle ScholarCross RefCross Ref
  10. Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. https://www.gurobi.comGoogle ScholarGoogle Scholar
  11. Boukthir Haddar, Mahdi Khemakhem, Saïd Hanafi, and Christophe Wilbaut. 2015. A hybrid heuristic for the 0--1 Knapsack Sharing Problem. Expert Systems with Applications 42, 10 (June 2015), 4653--4666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108. http://www.jstor.org/stable/2346830Google ScholarGoogle ScholarCross RefCross Ref
  13. Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811--2825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Frederick S. Hillier. 1967. Introduction to Operations Research. San Francisco, Holden-Day.Google ScholarGoogle Scholar
  15. Q. Huangfu and J. A. J. Hall. 2018. Parallelizing the dual revised simplex method. Mathematical Programming Computation 10, 1 (01 Mar 2018), 119--142. Google ScholarGoogle ScholarCross RefCross Ref
  16. Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive Data Exploration Using Semantic Windows. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 505--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2015. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. VLDB Endow. 8, 10 (jun 2015), 1094--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Leonard Kaufman and Peter J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley.Google ScholarGoogle Scholar
  19. Bernard Knueven, James Ostrowski, and Jean-Paul Watson. 2020. On Mixed-Integer Programming Formulations for the Unit Commitment Problem. INFORMS Journal on Computing (June 2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. C. N. Kuhn, G. Calbert, I. Garanovich, and T. Weir. 2023. Integer linear programming supporting portfolio design. arXiv:2303.14364 [math.OC]Google ScholarGoogle Scholar
  21. Xiaoqian Li and Kwan L. Yeung. 2019. Traffic Engineering in Segment Routing using MILP. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC). IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  22. Anh Mai, Matteo Brucato, Azza Abouzied, Peter J. Haas, and Alexandra Meliou. 2023. Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization. https://github.com/alm818/PackageQuery. arXiv:2307.02860 [cs.DB] Google ScholarGoogle ScholarCross RefCross Ref
  23. Rajesh Matai, Surya Singh, and Murari Lal Mittal. 2010. Traveling Salesman Problem: an Overview of Applications, Formulations, and Solution Approaches. In Traveling Salesman Problem, Donald Davendra (Ed.). IntechOpen, Rijeka, Chapter 1. Google ScholarGoogle ScholarCross RefCross Ref
  24. Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2020. Solving Mixed Integer Programs Using Neural Networks. Google ScholarGoogle ScholarCross RefCross Ref
  25. Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. CoRR abs/2110.11927 (2021). arXiv:2110.11927 https://arxiv.org/abs/2110.11927Google ScholarGoogle Scholar
  26. J.C. Nash. 2000. The (Dantzig) simplex method for linear programming. Computing in Science & Engineering 2, 1 (2000), 29--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ravi Netravali, Vikram Nathan, James Mickens, and Hari Balakrishnan. 2018. Vesper: Measuring Time-to-Interactivity for Web Pages. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 217--231. https://www.usenix.org/conference/nsdi18/presentation/netravali-vesperGoogle ScholarGoogle Scholar
  28. Michael J. Panik. 1996. The Dual Simplex, Primal-Dual, and Complementary Pivot Methods. Springer US, Boston, MA, 251--288. Google ScholarGoogle ScholarCross RefCross Ref
  29. Maurice Roux. 2015. A comparative study of divisive hierarchical clustering algorithms. CoRR abs/1506.08977 (2015). arXiv:1506.08977 http://arxiv.org/abs/1506.08977Google ScholarGoogle Scholar
  30. Georg Sander and Adrian Vasiliu. 2005. Visualization and ILOG CPLEX. In Graph Drawing, János Pach (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 510--511.Google ScholarGoogle Scholar
  31. TPCH [n.d.]. TPC-H Decision Support Benchmark. https://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  32. Vijay V. Vazirani. 2003. Approximation Algorithms. Springer Berlin Heidelberg, 108. Google ScholarGoogle ScholarCross RefCross Ref
  33. Laurynas Šikśnys and Torben Bach Pedersen. 2016. SolveDB: Integrating Optimization Problem Solvers Into SQL Databases. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management (Budapest, Hungary) (SSDBM '16). Association for Computing Machinery, New York, NY, USA, Article 14, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 17, Issue 5
    January 2024
    233 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 2 May 2024
    Published in pvldb Volume 17, Issue 5

    Check for updates

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)3

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader