research-article

High-Dimensional Data Cubes

Authors:
Sachin Basil John

École Polytechnique Fédérale de Lausanne (EPFL)

École Polytechnique Fédérale de Lausanne (EPFL)
View Profile

,
Christoph Koch

École Polytechnique Fédérale de Lausanne (EPFL)

École Polytechnique Fédérale de Lausanne (EPFL)
View Profile

Proceedings of the VLDB Endowment Volume 15 Issue 13pp 3828–3840https://doi.org/10.14778/3565838.3565839

Published:01 September 2022Publication History

Proceedings of the VLDB Endowment

Abstract

This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information can be quickly reconstructed using statistical or linear programming techniques. This enables new applications such as exploratory data analysis for feature engineering and other fields of data science. Moreover, it removes the need to compromise when building a data cube - all columns that we might ever wish to use can be included as dimensions. Our approach also speeds up certain dice, roll-up, and drill-down operations on data cubes with hierarchical dimensions compared to traditional data cubes.

References

Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, and Sunita Sarawagi. 1996. On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 506--521.Google Scholar
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). 29--42.Google ScholarDigital Library
Elena Baralis, Stefano Paraboschi, and Ernest Teniente. 1997. Materialized Views Selection in a Multidimensional Database. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 156--165.Google ScholarDigital Library
Daniel Barbará and Mark Sullivan. 1997. Quasi-Cubes: Exploiting Approximations in Multidimensional Databases. SIGMOD Record 26, 3 (1997), 12--17.Google ScholarDigital Library
Daniel Barbará and Xintao Wu. 2000. Using Loglinear Models to Compress Datacubes. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM '00). 311--323.Google ScholarCross Ref
Sachin Basil John and Christoph Koch. 2022. High-dimensional Data Cubes. (2022), 15. Retrieved September 25, 2022 from http://infoscience.epfl.ch/record/292499Google Scholar
Andreas Björklund, Thore Husfeldt, Petteri Kaski, and Mikko Koivisto. 2007. Fourier Meets Möbius: Fast Subset Convolution. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC '07). 67--74.Google ScholarDigital Library
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems 32, 2 (jun 2007), 9.Google ScholarDigital Library
Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 1 (1997), 65--74.Google ScholarDigital Library
Zhimin Chen and Vivek R. Narasayya. 2005. Efficient Computation of Multiple Group By Queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05). 263--274.Google Scholar
James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.Google ScholarCross Ref
Roberto Fontana and Patrizia Semeraro. 2018. Representation of multivariate Bernoulli distributions with a given set of specified moments. Journal of Multivariate Analysis 168 (2018), 290--303.Google ScholarDigital Library
Saul I. Gass. 2003. Linear Programming: Methods and Applications. Courier Corporation.Google Scholar
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53.Google ScholarDigital Library
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). 1917--1923.Google ScholarDigital Library
Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Transactions on Knowledge and Data Engineering 17, 1 (2005), 24--43.Google ScholarDigital Library
Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD '96). 205--216.Google Scholar
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 171--182.Google Scholar
David C. Hoaglin, Frederick Mosteller, and John W. Tukey (Eds.). 2006. Exploring Data Tables, Trends, and Shapes. John Wiley & Sons.Google Scholar
Kenneth Hoffman. 1971. Linear Algebra. Englewood Cliffs, NJ, Prentice-Hall.Google Scholar
Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the DBO engine. ACM Transactions on Database Systems 33, 4 (2008), 23:1--23:54.Google ScholarDigital Library
Ruoming Jin, Leonid Glimcher, Chris Jermaine, and Gagan Agrawal. 2006. New Sampling-Based Estimators for OLAP Queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). 18.Google Scholar
Minsuk Kahng, Dezhi Fang, and Duen Horng (Polo) Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA '16). 1--6.Google ScholarDigital Library
Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In IEEE 30th International Conference on Data Engineering (ICDE '14). 472--483.Google ScholarCross Ref
Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient Cube: How to Summarize the Semantics of a Data Cube. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB '02). 778--789.Google ScholarDigital Library
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. Proceedings of the VLDB Endowment 2012 5, 12 (aug 2012), 1790--1801.Google ScholarDigital Library
Fangling Leng, Yubin Bao, Ge Yu, Daling Wang, and Yuntao Liu. 2006. An Efficient Indexing Technique for Computing High Dimensional Data Cubes. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM '06). 557--568.Google ScholarDigital Library
Alon Y. Levy, Alberto O. Mendelzon, and Yehoshua Sagiv. 1995. Answering Queries Using Views. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '95). 95--104.Google Scholar
Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-Dimensional OLAP: A Minimal Cubing Approach. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04). 528--539.Google Scholar
Xiaolei Li, Jiawei Han, Zhijun Yin, Jae-Gil Lee, and Yizhou Sun. 2008. Sampling cube: a framework for statistical olap over sampling data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 779--790.Google ScholarDigital Library
Eric Lo, Ben Kao, Wai-Shing Ho, Sau Dan Lee, Chun Kit Chui, and David W. Cheung. 2008. OLAP on sequence data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIMOD '08). 649--660.Google Scholar
Konstantinos Morfonios and Yannis E. Ioannidis. 2006. CURE for Cubes: Cubing Using a ROLAP Engine. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06). 379--390.Google Scholar
Konstantinos Morfonios, Stratis Konakas, Yannis E. Ioannidis, and Nikolaos Kotsis. 2007. ROLAP implementations of the data cube. Comput. Surveys 39, 4 (2007), 12.Google ScholarDigital Library
New York City Department of Finance. 2021. Parking Violations Issued - Fiscal Year 2021. Retrieved August 4, 2022 from https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2021/kvfd-bvesGoogle Scholar
Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Performance Evaluation and Benchmarking (TPTPC '09). 237--252.Google Scholar
Athanasios Papoulis and S Unnikrishna Pillai. 2002. Probability, Random Variables and Stochastic Processes (4 ed.). McGraw-Hill Professional.Google Scholar
Kenneth A. Ross and Divesh Srivastava. 1997. Fast Computation of Sparse Datacubes. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 116--125.Google Scholar
Eyal Rozenberg. 2020. Star Schema Benchmark data set generator (ssb-dbgen). Retrieved August 4, 2022 from https://github.com/eyalroz/ssb-dbgenGoogle Scholar
Iztok Savnik. 2013. Index Data Structure for Fast Subset and Superset Queries. In Availability, Reliability, and Security in Information Systems and HCI. Springer, 134--148.Google Scholar
Jayavel Shanmugasundaram, Usama M. Fayyad, and Paul S. Bradley. 1999. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '99). 223--232.Google Scholar
Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. 1998. Materialized View Selection for Multidimensional Datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB '98). 488--499.Google Scholar
Rodrigo Rocha Silva, Celso Massaki Hirata, and Joubert de Castro Lima. 2020. Big high-dimension data cube designs for hybrid memory systems. Knowledge and Information Systems 62, 12 (2020), 4717--4746.Google ScholarCross Ref
Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. 2002. Dwarf: shrinking the PetaCube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD '02). 464--475.Google ScholarDigital Library
Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy. 1996. Answering Queries with Aggregation Using Views. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 318--329.Google Scholar
Jozef L Teugels. 1990. Some representations of the multivariate Bernoulli and binomial distributions. Journal of Multivariate Analysis 32, 2 (1990), 256--268.Google ScholarDigital Library
Jeffrey Scott Vitter, Min Wang, and Balakrishna R. Iyer. 1998. Data Cube Approximation and Histograms via Wavelets. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management (CIKM '98). 96--104.Google Scholar
Wei Wang, Hongjun Lu, Jianlin Feng, and Jeffrey Xu Yu. 2002. Condensed Cube: An Efficient Approach to Reducing Data Cube Size. In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). 155--165.Google ScholarCross Ref
Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 159--170.Google Scholar

Recommendations

The Mobius Cubes
Read More
Efficient incremental maintenance of data cubes
VLDB '06: Proceedings of the 32nd international conference on Very large data bases

The data cube provides users with aggregated results that are group-bys for all possible combinations of dimension attributes. When the number of dimension attributes is n, the data cube computes 2ⁿ group-bys, each of which is called a cuboid. A data ...
Read More
Pancyclicity of Möbius cubes
ICPADS '02: Proceedings of the 9th International Conference on Parallel and Distributed Systems

The problem of containing pancyclic interconnectionnetworks is an important research topic.An n-dimensionalMöbius cube, MQn, is a variant of hypercubes according tospecific rules.In this paper, we prove that Möbius cubes areall pancyclic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 15, Issue 13
September 2022
278 pages
ISSN:2150-8097
Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2022
Published in pvldb Volume 15, Issue 13
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 73
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High-Dimensional Data Cubes

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

The Mobius Cubes

Efficient incremental maintenance of data cubes

Pancyclicity of Möbius cubes