Abstract
While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference.
- 1 S. Babu, M. Garofalakis, and R. Rastogi. "SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables". Bell Labs Tech. Report, 2001.Google ScholarDigital Library
- 2 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. "Classification and Regression Trees". Chapman & Hall, 1984.Google Scholar
- 3 A. L. Buchsbaum, D. F. Caldwell, K. Church, G. S. Fowler, and S. Muthukrishnan. "Engineering the Compression of Massive Tables: An Experimental Approach". In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, 2000. Google ScholarDigital Library
- 4 J. Cheng, D. A. Bell, and W. Liu. "Learning Belief Networks from Data: An Information Theory Based Approach". In Proc. of the 6th Intl. Conf. on Information and Knowledge Management, 1997. Google ScholarDigital Library
- 5 D. Chickering, D. Geiger, and D. Heckerman. "Learning Bayesian Networks is NP-Hard". Technical Report MSR-TR-94-17, Microsoft Research, 1993.Google Scholar
- 6 G. F. Cooper and E. Herskovits. "A Bayesian Method for Constructing Bayesian Belief Networks from Databases". In Proc. of the 7th Annual Conf. on Uncertainty in AI, 1991. Google ScholarDigital Library
- 7 G. F. Cooper and E. Herskovits. "A Bayesian Method for the Induction of Probabilisitc Networks from Data". Machine Learning, 9, 1992. Google ScholarDigital Library
- 8 N. Friedman, I. Nachman, and D. Peer. "Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm". In Proc. of the 15th Annual Conf. on Uncertainty in AI, 1999. Google ScholarDigital Library
- 9 M.R. Garey and D.S. Johnson. "Computers and Intractability: A Guide to the Theory of NP-Completeness". W.H. Freeman, 1979. Google ScholarDigital Library
- 10 J. Gehrke, R. Ramakrishnan, and V. Ganti. "RainForest - A Framework for Fast Decision Tree Construction of Large Datasets". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarDigital Library
- 11 M. M. Halld' orsson. "Approximations of Weighted Independent Set and Hereditary Subset Problems". Jrnl. of Graph Algorithms and Applications, 4(1):1-16, 2000.Google ScholarCross Ref
- 12 H.V. Jagadish, J. Madar, and R. Ng. "Semantic Compression and Pattern Extraction with Fascicles". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, 1999. Google ScholarDigital Library
- 13 Y. Morimoto, H. Ishii, and S. Morishita. "Efficient Construction of Regression Trees with Range and Region Splitting". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, 1997. Google ScholarDigital Library
- 14 J. Pearl. "Causality - Models, Reasoning, and Inference". Cambridge University Press, 2000. Google ScholarDigital Library
- 15 R. Rastogi and K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarDigital Library
- 16 P. Spirtes, C. Glymour, and R. Scheines. "Causation, Prediction, and Search". Springer-Verlag NY, Inc., 1993.Google Scholar
- 17 W. Stallings. "SNMP, SNMPv2, SNMPv3, and RMON 1 and 2". Addison-Wesley Longman, Inc., 1999. (3rd Edition). Google ScholarDigital Library
- 18 J. Ziv and A. Lempel. "A Universal Algorithm for Sequential Data Compression". IEEE Trans. on Info. Theory, 23(3):337-343, 1977.Google ScholarDigital Library
Index Terms
- SPARTAN: a model-based semantic compression system for massive data tables
Recommendations
SPARTAN: using constrained models for guaranteed-error semantic compression
While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such ...
SPARTAN: a model-based semantic compression system for massive data tables
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataWhile a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such ...
Reconfigurable CRC IP core design on Xilinx Spartan 3AN FPGA
This paper presents an efficient, reconfigurable, high throughput IP core implementation of a Cyclic Redundancy Check CRC chip design on Field Programmable Gate Array FPGA. The IP core design has the advantage of correcting multiple errors based on the ...
Comments