research-article

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Authors:
Anh L. Mai

New York University Abu Dhabi

New York University Abu Dhabi
View Profile

,
Pengyu Wang

New York University Abu Dhabi

New York University Abu Dhabi
View Profile

,
Azza Abouzied

New York University Abu Dhabi

New York University Abu Dhabi
View Profile

,
Matteo Brucato

Microsoft Research

Microsoft Research
View Profile

,
Peter J. Haas

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Alexandra Meliou

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

Authors Info & Claims

Proceedings of the VLDB Endowment Volume 17 Issue 5pp 1146–1158https://doi.org/10.14778/3641204.3641222

Published:02 May 2024Publication History

Proceedings of the VLDB Endowment

Abstract

A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.

References

Abdurro'uf and et al. 2021. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data. arXiv:2112.02026. Google ScholarCross Ref
V. A. Alegana, P. M. Atkinson, C. Pezzulo, A. Sorichetta, D. Weiss, T. Bird, E. Erbach-Schoenberg, and A. J. Tatem. 2015. Fine resolution mapping of population age-structures for health and development applications. Journal of The Royal Society Interface 12, 105 (April 2015), 20150073. Google ScholarCross Ref
Timo Berthold. 2009. RENS-relaxation enforced neighborhood search. Technical Report. Zuse Institute Berlin (ZIB).Google Scholar
Robert Bixby and Alexander Martin. 2000. Parallelizing the Dual Simplex Method. INFORMS Journal on Computing 12 (02 2000), 45--56. Google ScholarCross Ref
Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2018. Package queries: efficient and scalable computation of high-order constraints. The VLDB Journal 27, 5 (01 Oct 2018), 693--718. Google ScholarDigital Library
Wei Cui, Qianxi Zhang, Spyros Blanas, Jesús Camacho-Rodriguez, Brandon Haynes, Yinan Li, Ravi Ramamurthy, Peng Cheng, Rathijit Sen, and Matteo Interlandi. 2023. Query Processing on Gaming Consoles. In Proceedings of the 19th International Workshop on Data Management on New Hardware (Seattle, WA, USA) (DaMoN '23). Association for Computing Machinery, New York, NY, USA, 86--88. Google ScholarDigital Library
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE 5, 1 (1998), 46--55.Google ScholarDigital Library
Raphael Finkel and Jon Bentley. 1974. Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Inf. 4 (03 1974), 1--9. Google ScholarDigital Library
Matteo Fischetti and Andrea Lodi. 2011. Heuristics in Mixed Integer Programming. Google ScholarCross Ref
Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. https://www.gurobi.comGoogle Scholar
Boukthir Haddar, Mahdi Khemakhem, Saïd Hanafi, and Christophe Wilbaut. 2015. A hybrid heuristic for the 0--1 Knapsack Sharing Problem. Expert Systems with Applications 42, 10 (June 2015), 4653--4666. Google ScholarDigital Library
J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108. http://www.jstor.org/stable/2346830Google ScholarCross Ref
Dong He, Supun C Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811--2825. Google ScholarDigital Library
Frederick S. Hillier. 1967. Introduction to Operations Research. San Francisco, Holden-Day.Google Scholar
Q. Huangfu and J. A. J. Hall. 2018. Parallelizing the dual revised simplex method. Mathematical Programming Computation 10, 1 (01 Mar 2018), 119--142. Google ScholarCross Ref
Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive Data Exploration Using Semantic Windows. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 505--516. Google ScholarDigital Library
Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2015. Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data. Proc. VLDB Endow. 8, 10 (jun 2015), 1094--1105. Google ScholarDigital Library
Leonard Kaufman and Peter J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley.Google Scholar
Bernard Knueven, James Ostrowski, and Jean-Paul Watson. 2020. On Mixed-Integer Programming Formulations for the Unit Commitment Problem. INFORMS Journal on Computing (June 2020). Google ScholarDigital Library
C. C. N. Kuhn, G. Calbert, I. Garanovich, and T. Weir. 2023. Integer linear programming supporting portfolio design. arXiv:2303.14364 [math.OC]Google Scholar
Xiaoqian Li and Kwan L. Yeung. 2019. Traffic Engineering in Segment Routing using MILP. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC). IEEE. Google ScholarCross Ref
Anh Mai, Matteo Brucato, Azza Abouzied, Peter J. Haas, and Alexandra Meliou. 2023. Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization. https://github.com/alm818/PackageQuery. arXiv:2307.02860 [cs.DB] Google ScholarCross Ref
Rajesh Matai, Surya Singh, and Murari Lal Mittal. 2010. Traveling Salesman Problem: an Overview of Applications, Formulations, and Solution Approaches. In Traveling Salesman Problem, Donald Davendra (Ed.). IntechOpen, Rijeka, Chapter 1. Google ScholarCross Ref
Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2020. Solving Mixed Integer Programs Using Neural Networks. Google ScholarCross Ref
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. CoRR abs/2110.11927 (2021). arXiv:2110.11927 https://arxiv.org/abs/2110.11927Google Scholar
J.C. Nash. 2000. The (Dantzig) simplex method for linear programming. Computing in Science & Engineering 2, 1 (2000), 29--31. Google ScholarDigital Library
Ravi Netravali, Vikram Nathan, James Mickens, and Hari Balakrishnan. 2018. Vesper: Measuring Time-to-Interactivity for Web Pages. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 217--231. https://www.usenix.org/conference/nsdi18/presentation/netravali-vesperGoogle Scholar
Michael J. Panik. 1996. The Dual Simplex, Primal-Dual, and Complementary Pivot Methods. Springer US, Boston, MA, 251--288. Google ScholarCross Ref
Maurice Roux. 2015. A comparative study of divisive hierarchical clustering algorithms. CoRR abs/1506.08977 (2015). arXiv:1506.08977 http://arxiv.org/abs/1506.08977Google Scholar
Georg Sander and Adrian Vasiliu. 2005. Visualization and ILOG CPLEX. In Graph Drawing, János Pach (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 510--511.Google Scholar
TPCH [n.d.]. TPC-H Decision Support Benchmark. https://www.tpc.org/tpch/.Google Scholar
Vijay V. Vazirani. 2003. Approximation Algorithms. Springer Berlin Heidelberg, 108. Google ScholarCross Ref
Laurynas Šikśnys and Torben Bach Pedersen. 2016. SolveDB: Integrating Optimization Problem Solvers Into SQL Databases. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management (Budapest, Hungary) (SSDBM '16). Association for Computing Machinery, New York, NY, USA, Article 14, 12 pages. Google ScholarDigital Library

Recommendations

Package queries: efficient and scalable computation of high-order constraints

Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However,...
Read More
Discovering queries based on example tuples
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

An enterprise information worker is often aware of a few example tuples (but not the entire result) that should be present in the output of the query. We study the problem of discovering the minimal project join query that contains the given example ...
Read More
Scalable package queries in relational database systems

Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 17, Issue 5
January 2024
233 pages
ISSN:2150-8097
Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 2 May 2024
Published in pvldb Volume 17, Issue 5

Check for updates
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 3
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Package queries: efficient and scalable computation of high-order constraints

Discovering queries based on example tuples

Scalable package queries in relational database systems