skip to main content
research-article
Artifacts Available / v1.1

High-Dimensional Data Cubes

Published:01 September 2022Publication History
Skip Abstract Section

Abstract

This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information can be quickly reconstructed using statistical or linear programming techniques. This enables new applications such as exploratory data analysis for feature engineering and other fields of data science. Moreover, it removes the need to compromise when building a data cube - all columns that we might ever wish to use can be included as dimensions. Our approach also speeds up certain dice, roll-up, and drill-down operations on data cubes with hierarchical dimensions compared to traditional data cubes.

References

  1. Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, and Sunita Sarawagi. 1996. On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 506--521.Google ScholarGoogle Scholar
  2. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). 29--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Elena Baralis, Stefano Paraboschi, and Ernest Teniente. 1997. Materialized Views Selection in a Multidimensional Database. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 156--165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Daniel Barbará and Mark Sullivan. 1997. Quasi-Cubes: Exploiting Approximations in Multidimensional Databases. SIGMOD Record 26, 3 (1997), 12--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Daniel Barbará and Xintao Wu. 2000. Using Loglinear Models to Compress Datacubes. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM '00). 311--323.Google ScholarGoogle ScholarCross RefCross Ref
  6. Sachin Basil John and Christoph Koch. 2022. High-dimensional Data Cubes. (2022), 15. Retrieved September 25, 2022 from http://infoscience.epfl.ch/record/292499Google ScholarGoogle Scholar
  7. Andreas Björklund, Thore Husfeldt, Petteri Kaski, and Mikko Koivisto. 2007. Fourier Meets Möbius: Fast Subset Convolution. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC '07). 67--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems 32, 2 (jun 2007), 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 1 (1997), 65--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zhimin Chen and Vivek R. Narasayya. 2005. Efficient Computation of Multiple Group By Queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05). 263--274.Google ScholarGoogle Scholar
  11. James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.Google ScholarGoogle ScholarCross RefCross Ref
  12. Roberto Fontana and Patrizia Semeraro. 2018. Representation of multivariate Bernoulli distributions with a given set of specified moments. Journal of Multivariate Analysis 168 (2018), 290--303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Saul I. Gass. 2003. Linear Programming: Methods and Applications. Courier Corporation.Google ScholarGoogle Scholar
  14. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). 1917--1923.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Transactions on Knowledge and Data Engineering 17, 1 (2005), 24--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD '96). 205--216.Google ScholarGoogle Scholar
  18. Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 171--182.Google ScholarGoogle Scholar
  19. David C. Hoaglin, Frederick Mosteller, and John W. Tukey (Eds.). 2006. Exploring Data Tables, Trends, and Shapes. John Wiley & Sons.Google ScholarGoogle Scholar
  20. Kenneth Hoffman. 1971. Linear Algebra. Englewood Cliffs, NJ, Prentice-Hall.Google ScholarGoogle Scholar
  21. Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the DBO engine. ACM Transactions on Database Systems 33, 4 (2008), 23:1--23:54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ruoming Jin, Leonid Glimcher, Chris Jermaine, and Gagan Agrawal. 2006. New Sampling-Based Estimators for OLAP Queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). 18.Google ScholarGoogle Scholar
  23. Minsuk Kahng, Dezhi Fang, and Duen Horng (Polo) Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA '16). 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In IEEE 30th International Conference on Data Engineering (ICDE '14). 472--483.Google ScholarGoogle ScholarCross RefCross Ref
  25. Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient Cube: How to Summarize the Semantics of a Data Cube. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB '02). 778--789.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. Proceedings of the VLDB Endowment 2012 5, 12 (aug 2012), 1790--1801.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fangling Leng, Yubin Bao, Ge Yu, Daling Wang, and Yuntao Liu. 2006. An Efficient Indexing Technique for Computing High Dimensional Data Cubes. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM '06). 557--568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alon Y. Levy, Alberto O. Mendelzon, and Yehoshua Sagiv. 1995. Answering Queries Using Views. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '95). 95--104.Google ScholarGoogle Scholar
  29. Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-Dimensional OLAP: A Minimal Cubing Approach. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04). 528--539.Google ScholarGoogle Scholar
  30. Xiaolei Li, Jiawei Han, Zhijun Yin, Jae-Gil Lee, and Yizhou Sun. 2008. Sampling cube: a framework for statistical olap over sampling data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 779--790.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Eric Lo, Ben Kao, Wai-Shing Ho, Sau Dan Lee, Chun Kit Chui, and David W. Cheung. 2008. OLAP on sequence data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIMOD '08). 649--660.Google ScholarGoogle Scholar
  32. Konstantinos Morfonios and Yannis E. Ioannidis. 2006. CURE for Cubes: Cubing Using a ROLAP Engine. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06). 379--390.Google ScholarGoogle Scholar
  33. Konstantinos Morfonios, Stratis Konakas, Yannis E. Ioannidis, and Nikolaos Kotsis. 2007. ROLAP implementations of the data cube. Comput. Surveys 39, 4 (2007), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. New York City Department of Finance. 2021. Parking Violations Issued - Fiscal Year 2021. Retrieved August 4, 2022 from https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2021/kvfd-bvesGoogle ScholarGoogle Scholar
  35. Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Performance Evaluation and Benchmarking (TPTPC '09). 237--252.Google ScholarGoogle Scholar
  36. Athanasios Papoulis and S Unnikrishna Pillai. 2002. Probability, Random Variables and Stochastic Processes (4 ed.). McGraw-Hill Professional.Google ScholarGoogle Scholar
  37. Kenneth A. Ross and Divesh Srivastava. 1997. Fast Computation of Sparse Datacubes. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 116--125.Google ScholarGoogle Scholar
  38. Eyal Rozenberg. 2020. Star Schema Benchmark data set generator (ssb-dbgen). Retrieved August 4, 2022 from https://github.com/eyalroz/ssb-dbgenGoogle ScholarGoogle Scholar
  39. Iztok Savnik. 2013. Index Data Structure for Fast Subset and Superset Queries. In Availability, Reliability, and Security in Information Systems and HCI. Springer, 134--148.Google ScholarGoogle Scholar
  40. Jayavel Shanmugasundaram, Usama M. Fayyad, and Paul S. Bradley. 1999. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '99). 223--232.Google ScholarGoogle Scholar
  41. Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. 1998. Materialized View Selection for Multidimensional Datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB '98). 488--499.Google ScholarGoogle Scholar
  42. Rodrigo Rocha Silva, Celso Massaki Hirata, and Joubert de Castro Lima. 2020. Big high-dimension data cube designs for hybrid memory systems. Knowledge and Information Systems 62, 12 (2020), 4717--4746.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. 2002. Dwarf: shrinking the PetaCube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD '02). 464--475.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy. 1996. Answering Queries with Aggregation Using Views. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 318--329.Google ScholarGoogle Scholar
  45. Jozef L Teugels. 1990. Some representations of the multivariate Bernoulli and binomial distributions. Journal of Multivariate Analysis 32, 2 (1990), 256--268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jeffrey Scott Vitter, Min Wang, and Balakrishna R. Iyer. 1998. Data Cube Approximation and Histograms via Wavelets. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management (CIKM '98). 96--104.Google ScholarGoogle Scholar
  47. Wei Wang, Hongjun Lu, Jianlin Feng, and Jeffrey Xu Yu. 2002. Condensed Cube: An Efficient Approach to Reducing Data Cube Size. In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). 155--165.Google ScholarGoogle ScholarCross RefCross Ref
  48. Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 159--170.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader