article

SPARTAN: a model-based semantic compression system for massive data tables

Authors:
Shivnath Babu

Stanford University and Bell Laboratories

Stanford University and Bell Laboratories
View Profile

,
Minos Garofalakis

Bell Laboratories

Bell Laboratories
View Profile

,
Rajeev Rastogi

Bell Laboratories

Bell Laboratories
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 30 Issue 2June 2001pp 283–294https://doi.org/10.1145/376284.375693

Published:01 May 2001Publication History

ACM SIGMOD Record

Abstract

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference.

References

1 S. Babu, M. Garofalakis, and R. Rastogi. "SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables". Bell Labs Tech. Report, 2001.Google ScholarDigital Library
2 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. "Classification and Regression Trees". Chapman & Hall, 1984.Google Scholar
3 A. L. Buchsbaum, D. F. Caldwell, K. Church, G. S. Fowler, and S. Muthukrishnan. "Engineering the Compression of Massive Tables: An Experimental Approach". In Proc. of the 11th Annual ACM-SIAM Symp. on Discrete Algorithms, 2000. Google ScholarDigital Library
4 J. Cheng, D. A. Bell, and W. Liu. "Learning Belief Networks from Data: An Information Theory Based Approach". In Proc. of the 6th Intl. Conf. on Information and Knowledge Management, 1997. Google ScholarDigital Library
5 D. Chickering, D. Geiger, and D. Heckerman. "Learning Bayesian Networks is NP-Hard". Technical Report MSR-TR-94-17, Microsoft Research, 1993.Google Scholar
6 G. F. Cooper and E. Herskovits. "A Bayesian Method for Constructing Bayesian Belief Networks from Databases". In Proc. of the 7th Annual Conf. on Uncertainty in AI, 1991. Google ScholarDigital Library
7 G. F. Cooper and E. Herskovits. "A Bayesian Method for the Induction of Probabilisitc Networks from Data". Machine Learning, 9, 1992. Google ScholarDigital Library
8 N. Friedman, I. Nachman, and D. Peer. "Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm". In Proc. of the 15th Annual Conf. on Uncertainty in AI, 1999. Google ScholarDigital Library
9 M.R. Garey and D.S. Johnson. "Computers and Intractability: A Guide to the Theory of NP-Completeness". W.H. Freeman, 1979. Google ScholarDigital Library
10 J. Gehrke, R. Ramakrishnan, and V. Ganti. "RainForest - A Framework for Fast Decision Tree Construction of Large Datasets". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarDigital Library
11 M. M. Halld' orsson. "Approximations of Weighted Independent Set and Hereditary Subset Problems". Jrnl. of Graph Algorithms and Applications, 4(1):1-16, 2000.Google ScholarCross Ref
12 H.V. Jagadish, J. Madar, and R. Ng. "Semantic Compression and Pattern Extraction with Fascicles". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, 1999. Google ScholarDigital Library
13 Y. Morimoto, H. Ishii, and S. Morishita. "Efficient Construction of Regression Trees with Range and Region Splitting". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, 1997. Google ScholarDigital Library
14 J. Pearl. "Causality - Models, Reasoning, and Inference". Cambridge University Press, 2000. Google ScholarDigital Library
15 R. Rastogi and K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning". In Proc. of the 24th Intl. Conf. on Very Large Data Bases, 1998. Google ScholarDigital Library
16 P. Spirtes, C. Glymour, and R. Scheines. "Causation, Prediction, and Search". Springer-Verlag NY, Inc., 1993.Google Scholar
17 W. Stallings. "SNMP, SNMPv2, SNMPv3, and RMON 1 and 2". Addison-Wesley Longman, Inc., 1999. (3rd Edition). Google ScholarDigital Library
18 J. Ziv and A. Lempel. "A Universal Algorithm for Sequential Data Compression". IEEE Trans. on Info. Theory, 23(3):337-343, 1977.Google ScholarDigital Library

Index Terms

SPARTAN: a model-based semantic compression system for massive data tables
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
    2. Database design and models
2. Theory of computation
  1. Semantics and reasoning
    1. Program semantics

Recommendations

SPARTAN: using constrained models for guaranteed-error semantic compression

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such ...
Read More
SPARTAN: a model-based semantic compression system for massive data tables
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such ...
Read More
Reconfigurable CRC IP core design on Xilinx Spartan 3AN FPGA

This paper presents an efficient, reconfigurable, high throughput IP core implementation of a Cyclic Redundancy Check CRC chip design on Field Programmable Gate Array FPGA. The IP core design has the advantage of correcting multiple errors based on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Editors:
Timos Sellis,
Sharad Mehrotra
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 762
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SPARTAN: a model-based semantic compression system for massive data tables

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

SPARTAN: using constrained models for guaranteed-error semantic compression

SPARTAN: a model-based semantic compression system for massive data tables

Reconfigurable CRC IP core design on Xilinx Spartan 3AN FPGA