ABSTRACT
In this paper, we investigate the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries. We concentrate on techniques directed towards online querying of large, high-dimensional datasets where it is assumed that the total cube has net been precomputed. The algorithmic space we explore considers trade-offs between parallelism, computation and I/0. Our main contribution is the development and a comprehensive evaluation of various novel, parallel algorithms. Specifically: (1) Algorithm RP is a straightforward parallel version of BUC [BR99]; (2) Algorithm BPP attempts to reduce I/0 by outputting results in a more efficient way; (3) Algorithm ASL, which maintains cells in a cuboid in a skiplist, is designed to put the utmost priority on load balancing; and (4) alternatively, Algorithm PT load-balances by using binary partitioning to divide the cube lattice as evenly as possible.
We present a thorough performance evaluation on all these algorithms on a variety of parameters, including the dimensionality of the cube, the sparseness of the cube, the selectivity of the constraints, the number of processors, and the size of the dataset. A key finding is that it is not a one-algorithm-fit-all situation. We recommend a “recipe” which uses PT as the default algorithm, but may also deploy ASL under specific circumstances.
- 1.R. Agrawal, S. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 VLDB, pp. 506-521.]] Google ScholarDigital Library
- 2.E. Baralis, S. Paraboschi and E. Teniente. Materialized view selection in a multidimensional database. In Proc. 1997 VLDB, pp. 98-112.]] Google ScholarDigital Library
- 3.K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In Proc. 1999 ACM SIGMOD, pp 359-370.]] Google ScholarDigital Library
- 4.M. Eberl, W. Karl, C. Trinitis, and A. Blaszczyk. Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications. In Proc. 6th European Parallel Virtual Machine/Message Passing Interface Conference, LNCS vol. 1697, pp. 493-498, 1999.]] Google ScholarDigital Library
- 5.M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani and J. Ullman. Computing iceberg queries efficiently. InProc. 1998 VLDB, pp. 299-310.]] Google ScholarDigital Library
- 6.S. Goil and A. Choudhary. High Performance OLAP and Data Mining on Parallel Computers. In The Journal of Data Mining and Knowledge Discovery, 1, 4, pp. 391-418, 1997.]] Google ScholarDigital Library
- 7.J. Gray, A. Bosworth, A. Layman and H. Pirahesh. Datacube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. In Proc. 1996 ICDE, pp. 152-159.]] Google ScholarDigital Library
- 8.H. Gupta, V. Harinarayan, A. Rajaraman and J. Ullman. Index selction for OLAP. InProc. 1997 ICDE, pp. 208-219.]] Google ScholarDigital Library
- 9.V. Harinarayan, A. Rajaraman and J. Ullman. Implementing data cubes efficiently. InProc. 1996 ACM SIGMOD, pp. 205-216.]] Google ScholarDigital Library
- 10.J. Hellerstein, J. Haas and H. Wang. Online Aggregation. In Proc. 1997 SIGMOD, pp. 171-182.]] Google ScholarDigital Library
- 11.M. Kamber, J. Han and J. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proc. 1997 KDD, pp. 207-210.]]Google Scholar
- 12.R.T. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc. 1998 SIGMOD, pp. 13-24.]] Google ScholarDigital Library
- 13.K. Ross and D. Srivastava. Fast Computation of Sparse Datacubes. In Proc. 1997 VLDB, pp. 116-125.]] Google ScholarDigital Library
- 14.S. Sarawagi. Explaining differences in multidimensional aggregates. In Proc. 1999 VLDB, pp. 42-53.]] Google ScholarDigital Library
- 15.A. Shukla, P. Deshpande and J. Naughton. Materialized view selection for multidimensional datasets. In Proc. 1998 VLDB, pp 488-499.]] Google ScholarDigital Library
- 16.A. Srivastava, E. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classification algorithm. In The Journal of Data Mining and Knowledge Discovery, 3, 3, pp. 237-262, 1999.]] Google ScholarDigital Library
- 17.M. Tamura and M. Kitsuregawa. Dynamic Load Balance for Parallel Association Rule Mining on Heterogeneous PC Cluster System. In Proc. 1999 VLDB, pp. 162-173.]] Google ScholarDigital Library
- 18.M. Zaki. Parallel and distributed association mining: a survey. InIEEE Concurrency, 7, 4, pp. 14-25, 1999.]] Google ScholarDigital Library
- 19.YiHong Zhao, Prasad Deshpande, and Jeffrey F. Naughton An Array-based algorithm for simultaneous Multidimensional aggregates. SIGMOD Conference 1997, pp. 159-170]] Google ScholarDigital Library
Index Terms
- Iceberg-cube computation with PC clusters
Recommendations
Iceberg-cube computation with PC clusters
In this paper, we investigate the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries. We concentrate on techniques directed towards online querying of large, high-dimensional datasets where it is assumed that ...
Bottom-up computation of sparse and Iceberg CUBE
We introduce the Iceberg-CUBE problem as a reformulation of the datacube (CUBE) problem. The Iceberg-CUBE problem is to compute only those group-by partitions with an aggregate value (e.g., count) above some minimum support threshold. The result of ...
A Parallel Algorithm for Closed Cube Computation
ICIS '08: Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)Closed cubing is a very efficient algorithm for data cube compression proposed recently in the literature. It losslessly condenses a group of cells into one cell if these cells have the same aggregate value and preserve roll-up/drill-down semantics. ...
Comments