skip to main content
article

SPARTAN: a model-based semantic compression system for massive data tables

Published:01 May 2001Publication History
Skip Abstract Section

Abstract

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference.

References

  1. 1 S. Babu, M. Garofalakis, and R. Rastogi. "SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables". Bell Labs Tech. Report, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. "Classification and Regression Trees". Chapman & Hall, 1984.Google ScholarGoogle Scholar
  3. 3 A. L. Buchsbaum, D. F. Caldwell, K. Church, G. S. Fowler, and S. Muthukrishnan. "Engineering the Compression of Massive Tables: An Experimental Approach". In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4 J. Cheng, D. A. Bell, and W. Liu. "Learning Belief Networks from Data: An Information Theory Based Approach". In Proc. of the 6th Intl. Conf. on Information and Knowledge Management, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5 D. Chickering, D. Geiger, and D. Heckerman. "Learning Bayesian Networks is NP-Hard". Technical Report MSR-TR-94-17, Microsoft Research, 1993.Google ScholarGoogle Scholar
  6. 6 G. F. Cooper and E. Herskovits. "A Bayesian Method for Constructing Bayesian Belief Networks from Databases". In Proc. of the 7th Annual Conf. on Uncertainty in AI, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7 G. F. Cooper and E. Herskovits. "A Bayesian Method for the Induction of Probabilisitc Networks from Data". Machine Learning, 9, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8 N. Friedman, I. Nachman, and D. Peer. "Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm". In Proc. of the 15th Annual Conf. on Uncertainty in AI, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 M.R. Garey and D.S. Johnson. "Computers and Intractability: A Guide to the Theory of NP-Completeness". W.H. Freeman, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10 J. Gehrke, R. Ramakrishnan, and V. Ganti. "RainForest - A Framework for Fast Decision Tree Construction of Large Datasets". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11 M. M. Halld' orsson. "Approximations of Weighted Independent Set and Hereditary Subset Problems". Jrnl. of Graph Algorithms and Applications, 4(1):1-16, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  12. 12 H.V. Jagadish, J. Madar, and R. Ng. "Semantic Compression and Pattern Extraction with Fascicles". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13 Y. Morimoto, H. Ishii, and S. Morishita. "Efficient Construction of Regression Trees with Range and Region Splitting". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14 J. Pearl. "Causality - Models, Reasoning, and Inference". Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15 R. Rastogi and K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16 P. Spirtes, C. Glymour, and R. Scheines. "Causation, Prediction, and Search". Springer-Verlag NY, Inc., 1993.Google ScholarGoogle Scholar
  17. 17 W. Stallings. "SNMP, SNMPv2, SNMPv3, and RMON 1 and 2". Addison-Wesley Longman, Inc., 1999. (3rd Edition). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18 J. Ziv and A. Lempel. "A Universal Algorithm for Sequential Data Compression". IEEE Trans. on Info. Theory, 23(3):337-343, 1977.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SPARTAN: a model-based semantic compression system for massive data tables

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGMOD Record
          ACM SIGMOD Record  Volume 30, Issue 2
          June 2001
          625 pages
          ISSN:0163-5808
          DOI:10.1145/376284
          Issue’s Table of Contents
          • cover image ACM Conferences
            SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
            May 2001
            630 pages
            ISBN:1581133324
            DOI:10.1145/375663

          Copyright © 2001 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2001

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader