Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster

Lee, Suan; Kang, Seok; Kim, Jinho; Yu, Eun Jung

doi:10.1007/s10586-018-1811-1

Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster

Published: 01 February 2018

Volume 22, pages 2063–2087, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Suan Lee ORCID: orcid.org/0000-0002-3047-1167¹,
Seok Kang¹,
Jinho Kim¹ &
…
Eun Jung Yu²

626 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

A data cube is a powerful analytical tool that stores all aggregate values over a set of dimensions. It provides users with a simple and efficient means of performing complex data analysis while assisting in decision making. Since the computation time for building a data cube is very large, however, efficient methods for reducing the data cube computation time are needed. Previous works have developed various algorithms for efficiently generating data cubes using MapReduce, which is a large-scale distributed parallel processing framework. However, MapReduce incurs the overhead of disk I/Os and network traffic. To overcome these MapReduce limitations, Spark was recently proposed as a memory-based parallel/distributed processing framework. It has attracted considerable research attention owing to its high performance. In this paper, we propose two algorithms for efficiently building data cubes. The algorithms fully leverage Spark’s mechanisms and properties: Resilient Distributed Top-Down Computation (RDTDC) and Resilient Distributed Bottom-Up Computation (RDBUC). The former is an algorithm for computing the components (i.e., cuboids) of a data cube in a top-down approach; the latter is a bottom-up approach. The RDTDC algorithm has three key functions. (1) It approximates the size of the cuboid using the cardinality without additional Spark action computation to determine the size of each cuboid during top-down computation. Thus, one cuboid can be computed from the upper cuboid of a smaller size. (2) It creates an execution plan that is optimized to input the smaller sized cuboid. (3) Lastly, it uses a method of reusing the result of the already computed cuboid by top-down computation and simultaneously computes the cuboid of several dimensions. In addition, we propose the RDBUC bottom-up algorithm in Spark, which is widely used in computing Iceberg cubes to maintain only cells satisfying a certain condition of minimum support. This algorithm incorporates two primary strategies: (1) reducing the input size to compute aggregate values for a dimension combination (e.g., A, B, and C) by removing the input, which does not satisfy the Iceberg cube condition at its lower dimension combination (e.g., A and B) computed earlier. (2) We use a lazy materialization strategy that computes every combination of dimensions using only transformation operations without any action operation. It then stores them in a single action operation. To prove the efficiency of the proposed algorithms using a lazy materialization strategy by employing only one action operation, we conducted extensive experiments. We compared them to the cube() function, a built-in cube computation library of Spark SQL. The results showed that the proposed RDTDC and RDBUC algorithms outperformed Spark SQL cube().

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

HaCube: Extending MapReduce for Efficient OLAP Cube Materialization and View Maintenance

An Improved Frag-Shells Algorithm for Data Cube Construction Based on Irrelevance of Data Dispersion

References

Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20, 1135–1148 (2017)
Article Google Scholar
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1, 29–53 (1997)
Article Google Scholar
Xin, D., Han, J., Li, X., Wah, B.W.: Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29 (2003)
Xin, D., Shao, Z., Han, J., Liu, H.: C-cubing: efficient computation of closed cubes by aggregation-based checking. In: ICDE’06. Proceedings of the 22nd International Conference on Data Engineering, 2006 (2006)
Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD Record (2001)
Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J. D.: Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998 (1999)
Wang, Z., Chu, Y., Tan, K.-L., Agrawal, D., Abbadi, A.E.I., Xu, X.: Scalable data cube analysis over big data. arXiv preprint. arXiv:1311.5663 (2013)
Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Data cube materialization and mining over mapreduce. IEEE Trans. Knowl. Data Eng. 24, 1747–1759 (2012)
Article Google Scholar
Milo, T., Altshuler, E.: An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of Data (2016)
Apache Hadoop: Welcome to Apache Hadoop (2016)
Apache Spark: Apache Spark: lightning-fast cluster computing (2015)
Zhao, Y., Deshpande, P.M., Naughton, J.F.: An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD Record (1997)
Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R., Sarawagi, S.: On the computation of multidimensional aggregates. In: VLDB (1996)
Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cube. In: ACM SIGMOD Record (1999)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. Proceedings of the 20th International Conference Very Large Data Bases. VLDB, vol. 1215, pp. 487–499 (1994)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., et al.: Spark sql: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Spark-SQL: DataFrame. http://spark.apache.org/docs/latest/sql-programming-guide.html
Adamic, L.A.: Zipf, power-laws, and pareto-a ranking tutorial. Xerox Palo Alto Research Center, Palo Alto, CA. http://ginger.hpl.hp.com/shl/papers/ranking/ranking.html (2000)
GDELT: http://www.gdeltproject.org
Lee, S., Kim, J., Moon, Y.-S., Lee, W.: Efficient distributed parallel top-down computation of ROLAP data cube using mapreduce. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 168–179 (2012)
Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. ACM SIGMOD Record 25, 205–216 (1996)
Article Google Scholar
Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R., Sarawagi, S.: On the computation of multidimensional aggregates. VLDB 96, 506–521 (1996)
Google Scholar
Ross, K.A., Srivastava, D.: Fast computation of sparse datacubes. VLDB 97, 25–29 (1997)
Google Scholar
Roussopoulos, N., Kotidis, Y., Roussopoulos, M.: Cubetree: organization of and bulk incremental updates on the data cube. ACM SIGMOD Record 26, 89–99 (1997)
Article Google Scholar
Mumick, I.S., Quass, D., Mumick, B.S.: Maintenance of data cubes and summary tables in a warehouse. ACM Sigmod Record 26, 100–111 (1997)
Article Google Scholar
Goil, S., Choudhary, A.: High performance OLAP and data mining on parallel computers. Data Min. Knowl. Disc. 1, 391–417 (1997)
Article Google Scholar
Goil, S., Choudhary, A.: Parallel data cube construction for high performance on-line analytical processing. Proceedings of the Fourth International Conference on High-Performance Computing 1997, 10–15 (1997)
Article Google Scholar
Goil, S., Choudhary, A.: A parallel scalable infrastructure for OLAP and data mining. In: Proceedings. IDEAS’99. International Symposium Database Engineering and Applications, 1999, pp. 178–186 (1999)
Ng, R.T., Wagner, A., Yin, Y.: Iceberg-cube computation with PC clusters. ACM SIGMOD Record 30, 25–36 (2001)
Article Google Scholar
Dehne, F., Eavis, T., Rau-Chaplin, A.: A cluster architecture for parallel data warehousing. In: Proceedings. First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 161–168 (2001)
Dehne, F., Eavis, T., Rau-Chaplin, A.: Computing partial data cubes for parallel data warehousing applications. In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 319–326 (2001)
Dehne, F., Eavis, T., Hambrusch, S., Rau-Chaplin, A.: Parallelizing the data cube. Distrib. Parallel Databases 11, 181–201 (2002)
MATH Google Scholar
Dehne, F., Eavis, T., Rau-Chaplin, A.: Top-down computation of partial ROLAP data cubes. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, p. 10 (2004)
Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: Parallel ROLAP data cube construction on shared-nothing multiprocessors. Distrib. Parallel Databases 15, 219–236 (2004)
Article Google Scholar
Dehne, F., Eavis, T., Rau-Chaplin, A.: Parallel querying of ROLAP cubes in the presence of hierarchies. In: Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP, pp. 89–96 (2005)
Dehne, F., Eavis, T., Rau-Chaplin, A.: The cgmCUBE project: optimizing parallel data cube generation for ROLAP. Distrib. Parallel Databases 19, 29–62 (2006)
Article Google Scholar
Jin, R., Vaidyanathan, K., Yang, G., Agrawal, G.: Communication and memory optimal parallel data cube construction. IEEE Trans. Parallel Distrib. Syst. 16, 1105–1119 (2005)
Article Google Scholar
Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: Improved data partitioning for building large ROLAP data cubes in parallel. Int. J. Data Warehous. Mining (IJDWM) 2, 1–26 (2006)
Article Google Scholar
Chen, Y., Rau-Chaplin, A., Dehne, F., Eavis, T., Green, D., Sithirasenan, E.: cgmOLAP: efficient parallel generation and querying of terabyte size ROLAP data cubes. In: Proceedings of the 22nd International Conference on Data Engineering, 2006. ICDE’06, pp. 164–164 (2006)
You, J., Xi, J., Zhang, P., Chen, H.: A parallel algorithm for closed cube computation. In: Seventh IEEE/ACIS International Conference on Computer and Information Science, 2008. ICIS 08, pp. 95–99 (2008)
Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: PnP: sequential, external memory, and parallel iceberg cube computation. Distrib. Parallel Databases 23, 99–126 (2008)
Article Google Scholar
Dehne, F., Zaboli, H.: Parallel real-time OLAP on multi-core processors. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp. 588–594 (2012)
Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 472–483 (2014)
Sergey, K., Yury, K.: Applying map-reduce paradigm for parallel closed cube computation. In: First International Conference on Advances in Databases, Knowledge, and Data Applications, 2009. DBKDA’09, pp. 62–67 (2009)
Wang, Y., Song, A., Luo, J.: A mapreducemerge-based data cube construction method. In: 2010 9th International Conference on Grid and Cooperative Computing (GCC), pp. 1–6 (2010)
Wang, Z., Chu, Y., Tan, K.-L., Agrawal, D., Abbadi, A.E.: HaCube: extending MapReduce for efficient OLAP cube materialization and view maintenance. In: International Conference on Database Systems for Advanced Applications, pp. 113–129 (2016)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. Nsdi 10, 20 (2010)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Gerth, J., Talbot, J., et al.: Online aggregation and continuous query support in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1115–1118 (2010)
Suan, L., Yang-Sae, M., Jinho, K.: Distributed parallel top-down computation of data cube using MapReduce. In: Proceedings of the 3rd International Conference on Emerging Databases, Incheon, Korea, pp. 303–306 (2011)
Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Distributed cube materialization on holistic measures. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 183–194 (2011)
Li, J., Meng, L., Wang, F.Z., Zhang, W., Cai, Y.: A map-reduce-enabled SOLAP cube for large-scale remotely sensed data aggregation. Comput. Geosci. 70, 110–119 (2014)
Article Google Scholar
Phan, D.-H., DellÁmico, M., Michiardi, P.: On the design space of MapReduce ROLLUP aggregates. In: EDBT/ICDT Workshops, pp. 10–18 (2014)
Wang, B., Gui, H., Roantree, M.: OĆonnor. Data cube computational model with hadoop mapreduce, M.F. (2014)
Google Scholar
Lee, S., Jo, S., Kim, J.: MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp. 95–102 (2015)
Lee, S., Kim, J.: Performance evaluation of MRDataCube for data cube computation algorithm using MapReduce. In: 2016 International Conference on Big Data and Smart Computing (BigComp), pp. 325–328 (2016)
Phan, D.-H., Michiardi, P.: A novel, low-latency algorithm for multiple Group-By query optimization. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 301–312 (2016)
Kim, S., Lee, S., Kim, J., Yoon, Y.-I.: MRTensorCube: tensor factorization with data reduction for context-aware recommendations. J. Supercomput. (2017). https://doi.org/10.1007/s11227-017-2002-1
Google Scholar
Sethi, K.K., Ramesh, D.: HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. (2017). https://doi.org/10.1007/s11227-017-1963-4
Google Scholar
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on Spark. Proc. VLDB Endow. 9(10), 804–815 (2016)
Article Google Scholar

Download references

Acknowledgements

This research was supported by a Grant (17AUDP-B070719-05) from Architecture & Urban Development Research Program funded by Ministry of Land, Infrastructure and Transport of Korean government, and by the Industrial Technology Innovation Program (Project#: 10052797) through the Korea Evaluation Institute of Industrial Technology (Keit) funded by the Ministry of Trade, Industry and Energy.

Author information

Authors and Affiliations

Kangwon National University, Chuncheon-si, Gangwon-do, Republic of Korea
Suan Lee, Seok Kang & Jinho Kim
Yonsei University, Seoul, Republic of Korea
Eun Jung Yu

Authors

Suan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Seok Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jinho Kim
View author publications
You can also search for this author in PubMed Google Scholar
Eun Jung Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suan Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, S., Kang, S., Kim, J. et al. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Comput 22 (Suppl 1), 2063–2087 (2019). https://doi.org/10.1007/s10586-018-1811-1

Download citation

Received: 16 June 2017
Revised: 13 November 2017
Accepted: 10 January 2018
Published: 01 February 2018
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-018-1811-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster

Abstract

Access this article

Similar content being viewed by others

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

HaCube: Extending MapReduce for Efficient OLAP Cube Materialization and View Maintenance

An Improved Frag-Shells Algorithm for Data Cube Construction Based on Irrelevance of Data Dispersion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster

Abstract

Access this article

Similar content being viewed by others

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

HaCube: Extending MapReduce for Efficient OLAP Cube Materialization and View Maintenance

An Improved Frag-Shells Algorithm for Data Cube Construction Based on Irrelevance of Data Dispersion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation