Abstract
This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information can be quickly reconstructed using statistical or linear programming techniques. This enables new applications such as exploratory data analysis for feature engineering and other fields of data science. Moreover, it removes the need to compromise when building a data cube - all columns that we might ever wish to use can be included as dimensions. Our approach also speeds up certain dice, roll-up, and drill-down operations on data cubes with hierarchical dimensions compared to traditional data cubes.
- Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, and Sunita Sarawagi. 1996. On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 506--521.Google Scholar
- Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). 29--42.Google ScholarDigital Library
- Elena Baralis, Stefano Paraboschi, and Ernest Teniente. 1997. Materialized Views Selection in a Multidimensional Database. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 156--165.Google ScholarDigital Library
- Daniel Barbará and Mark Sullivan. 1997. Quasi-Cubes: Exploiting Approximations in Multidimensional Databases. SIGMOD Record 26, 3 (1997), 12--17.Google ScholarDigital Library
- Daniel Barbará and Xintao Wu. 2000. Using Loglinear Models to Compress Datacubes. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM '00). 311--323.Google ScholarCross Ref
- Sachin Basil John and Christoph Koch. 2022. High-dimensional Data Cubes. (2022), 15. Retrieved September 25, 2022 from http://infoscience.epfl.ch/record/292499Google Scholar
- Andreas Björklund, Thore Husfeldt, Petteri Kaski, and Mikko Koivisto. 2007. Fourier Meets Möbius: Fast Subset Convolution. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC '07). 67--74.Google ScholarDigital Library
- Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems 32, 2 (jun 2007), 9.Google ScholarDigital Library
- Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 1 (1997), 65--74.Google ScholarDigital Library
- Zhimin Chen and Vivek R. Narasayya. 2005. Efficient Computation of Multiple Group By Queries. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05). 263--274.Google Scholar
- James W. Cooley and John W. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.Google ScholarCross Ref
- Roberto Fontana and Patrizia Semeraro. 2018. Representation of multivariate Bernoulli distributions with a given set of specified moments. Journal of Multivariate Analysis 168 (2018), 290--303.Google ScholarDigital Library
- Saul I. Gass. 2003. Linear Programming: Methods and Applications. Courier Corporation.Google Scholar
- Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53.Google ScholarDigital Library
- Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). 1917--1923.Google ScholarDigital Library
- Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Transactions on Knowledge and Data Engineering 17, 1 (2005), 24--43.Google ScholarDigital Library
- Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD '96). 205--216.Google Scholar
- Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 171--182.Google Scholar
- David C. Hoaglin, Frederick Mosteller, and John W. Tukey (Eds.). 2006. Exploring Data Tables, Trends, and Shapes. John Wiley & Sons.Google Scholar
- Kenneth Hoffman. 1971. Linear Algebra. Englewood Cliffs, NJ, Prentice-Hall.Google Scholar
- Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the DBO engine. ACM Transactions on Database Systems 33, 4 (2008), 23:1--23:54.Google ScholarDigital Library
- Ruoming Jin, Leonid Glimcher, Chris Jermaine, and Gagan Agrawal. 2006. New Sampling-Based Estimators for OLAP Queries. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). 18.Google Scholar
- Minsuk Kahng, Dezhi Fang, and Duen Horng (Polo) Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA '16). 1--6.Google ScholarDigital Library
- Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. 2014. Distributed and interactive cube exploration. In IEEE 30th International Conference on Data Engineering (ICDE '14). 472--483.Google ScholarCross Ref
- Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient Cube: How to Summarize the Semantics of a Data Cube. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB '02). 778--789.Google ScholarDigital Library
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. Proceedings of the VLDB Endowment 2012 5, 12 (aug 2012), 1790--1801.Google ScholarDigital Library
- Fangling Leng, Yubin Bao, Ge Yu, Daling Wang, and Yuntao Liu. 2006. An Efficient Indexing Technique for Computing High Dimensional Data Cubes. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM '06). 557--568.Google ScholarDigital Library
- Alon Y. Levy, Alberto O. Mendelzon, and Yehoshua Sagiv. 1995. Answering Queries Using Views. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '95). 95--104.Google Scholar
- Xiaolei Li, Jiawei Han, and Hector Gonzalez. 2004. High-Dimensional OLAP: A Minimal Cubing Approach. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04). 528--539.Google Scholar
- Xiaolei Li, Jiawei Han, Zhijun Yin, Jae-Gil Lee, and Yizhou Sun. 2008. Sampling cube: a framework for statistical olap over sampling data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 779--790.Google ScholarDigital Library
- Eric Lo, Ben Kao, Wai-Shing Ho, Sau Dan Lee, Chun Kit Chui, and David W. Cheung. 2008. OLAP on sequence data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIMOD '08). 649--660.Google Scholar
- Konstantinos Morfonios and Yannis E. Ioannidis. 2006. CURE for Cubes: Cubing Using a ROLAP Engine. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06). 379--390.Google Scholar
- Konstantinos Morfonios, Stratis Konakas, Yannis E. Ioannidis, and Nikolaos Kotsis. 2007. ROLAP implementations of the data cube. Comput. Surveys 39, 4 (2007), 12.Google ScholarDigital Library
- New York City Department of Finance. 2021. Parking Violations Issued - Fiscal Year 2021. Retrieved August 4, 2022 from https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2021/kvfd-bvesGoogle Scholar
- Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Performance Evaluation and Benchmarking (TPTPC '09). 237--252.Google Scholar
- Athanasios Papoulis and S Unnikrishna Pillai. 2002. Probability, Random Variables and Stochastic Processes (4 ed.). McGraw-Hill Professional.Google Scholar
- Kenneth A. Ross and Divesh Srivastava. 1997. Fast Computation of Sparse Datacubes. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). 116--125.Google Scholar
- Eyal Rozenberg. 2020. Star Schema Benchmark data set generator (ssb-dbgen). Retrieved August 4, 2022 from https://github.com/eyalroz/ssb-dbgenGoogle Scholar
- Iztok Savnik. 2013. Index Data Structure for Fast Subset and Superset Queries. In Availability, Reliability, and Security in Information Systems and HCI. Springer, 134--148.Google Scholar
- Jayavel Shanmugasundaram, Usama M. Fayyad, and Paul S. Bradley. 1999. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '99). 223--232.Google Scholar
- Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. 1998. Materialized View Selection for Multidimensional Datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB '98). 488--499.Google Scholar
- Rodrigo Rocha Silva, Celso Massaki Hirata, and Joubert de Castro Lima. 2020. Big high-dimension data cube designs for hybrid memory systems. Knowledge and Information Systems 62, 12 (2020), 4717--4746.Google ScholarCross Ref
- Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. 2002. Dwarf: shrinking the PetaCube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD '02). 464--475.Google ScholarDigital Library
- Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy. 1996. Answering Queries with Aggregation Using Views. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96). 318--329.Google Scholar
- Jozef L Teugels. 1990. Some representations of the multivariate Bernoulli and binomial distributions. Journal of Multivariate Analysis 32, 2 (1990), 256--268.Google ScholarDigital Library
- Jeffrey Scott Vitter, Min Wang, and Balakrishna R. Iyer. 1998. Data Cube Approximation and Histograms via Wavelets. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management (CIKM '98). 96--104.Google Scholar
- Wei Wang, Hongjun Lu, Jianlin Feng, and Jeffrey Xu Yu. 2002. Condensed Cube: An Efficient Approach to Reducing Data Cube Size. In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). 155--165.Google ScholarCross Ref
- Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD '97). 159--170.Google Scholar
Recommendations
Efficient incremental maintenance of data cubes
VLDB '06: Proceedings of the 32nd international conference on Very large data basesThe data cube provides users with aggregated results that are group-bys for all possible combinations of dimension attributes. When the number of dimension attributes is n, the data cube computes 2n group-bys, each of which is called a cuboid. A data ...
Pancyclicity of Möbius cubes
ICPADS '02: Proceedings of the 9th International Conference on Parallel and Distributed SystemsThe problem of containing pancyclic interconnectionnetworks is an important research topic.An n-dimensionalMöbius cube, MQn, is a variant of hypercubes according tospecific rules.In this paper, we prove that Möbius cubes areall pancyclic ...
Comments