New sampling-based summary statistics for improving approximate query answers

Authors:
Phillip B. Gibbons

Information Sciences Research Center, Bell Laboratories

Information Sciences Research Center, Bell Laboratories
View Profile

,
Yossi Matias

Department of Computer Science, Tel-Aviv University

Department of Computer Science, Tel-Aviv University
View Profile

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataJune 1998Pages 331–342https://doi.org/10.1145/276304.276334

Published:01 June 1998Publication History

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Pages 331–342

ABSTRACT

In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.

References

AMS96.N. Alon, Y. Matias, and M. Szegedi. The space complexity of approximating the frequency moments. In Proc. 28th A CM Symp. on the Theory of Computing, pages 20-29, May 1996. Google ScholarDigital Library
Ant92.G. Antoshenkov. Random sampling from pseudoranked B+ trees. In Proc. 18th International Conf. on Very Large Data Bases, pages 375-382, August 1992. Google ScholarDigital Library
AS94.R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conf. on Very Large Data Bases, pages 487-499, September 1994. Google ScholarDigital Library
AZ96.G. Antoshenkov and M. Ziauddin. Query processing and optimization in Oracle Rdb. VLDB Journal, 5(4):229-237, 1996. Google ScholarDigital Library
BDF+97.D. Barbar~i, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3- 45, 1997.Google Scholar
BM96.R.J. Bayardo, Jr. and D. P. Miranker. Processing queries for first-few answers. In Proc. 5th International Conf. on Information and Knowledge Management, pages 45-52, 1996. Google ScholarDigital Library
BMUT97.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 255-264, May 1997. Google ScholarDigital Library
FJS97.C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 36--45, August 1997. Google ScholarDigital Library
Fla85.E Flajolet. Approximate counting: a detailed analysis. BIT, 25:113-134, 1985. Google ScholarDigital Library
FM83.P. Flajolet and G. N. Martin. Probabilistic counting. In Proc. 24th IEEE Syrup. on Foundations of Computer Science, pages 76-82, October 1983.Google ScholarDigital Library
FM85.E Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985. Google ScholarDigital Library
GM95.P.B. Gibbons and Y. Matias, August 1995. Presentation and feedback during a Bell Labs-Teradata presentation to Walmart scientists and executives on proposed improvements to the Teradata DBS.Google Scholar
GM97.E B. Gibbons and Y. Matias. Synopsis data structures, concise samples, and mode statistics. Manuscript, July 1997.Google Scholar
GMP97a.P.B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.Google Scholar
GMP97b.P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 466-475, August 1997. Google ScholarDigital Library
GPA+98.E B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: System and techniques for approximate query answering. Technical report, Bell Laboratories, Murray Hill, New Jersey, February 1998.Google Scholar
HHW97.J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 171-182, May 1997. Google ScholarDigital Library
HK95.M. Hofri and N. Kechris. Probabilistic counting of a large number of events. Manuscript, 1995.Google Scholar
HNSS95.E J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, September 1995. Google ScholarDigital Library
IC93.Y.E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems, 18(4):709-748, 1993. Google ScholarDigital Library
Ioa93.Y.E. Ioannidis. Universality of serial histograms, in Proc. 19th International Conf. on Ve~. Large Data Bases, pages 256-267, August 1993. Google ScholarDigital Library
IP95.Y.E. Ioannidis and V. Poosala, Balancing histogram optimality and practicality for query result size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 233-244, May 1995. Google ScholarDigital Library
Mat92.Y. Matias. Highly Parallel Randomized Algorithmics. PhD thesis, Tel Aviv University, Israel, 1992.Google Scholar
Mor78.R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21:840-842, 1978. Google ScholarDigital Library
MSY96.Y. Matias, S. C. Sahinalp, and N. E. Young. Performance evaluation of approximate priority queues. Presented at DIMACS Fifth Implementation Challenge: Priority Queues, Dictionaries, and Point Sets, organized by D. S. Johnson and C. McGeoch, October 1996.Google Scholar
MVN93.Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic generation of discrete random variates. In Proc. 4th ACM- SIAM Symp. on Discrete Algorithms, pages 361-370, January 1993. Google ScholarDigital Library
MVY94.Y. Matias, J. S. Vitter, and N. E. Young. Approximate data structures with applications. In Proc. 5th ACM- SIAM Syrup. on Discrete Algorithms, pages 187-194, January 1994. Google ScholarDigital Library
OR89.E Olken and D. Rotem. Random sampling from t3+ trees. In Proc. 15th international Conf. on Very Large Data Bases, pages 269-277, 1989. Google ScholarDigital Library
OR92.F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Conf. on Data Engineering, pages 632-641, February 1992. Google ScholarDigital Library
PIHS96.V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294- 305, June 1996. Google ScholarDigital Library
Pre97.D. Pregibon. Mega-monitoring: Developing and using telecommunications signatures, October 1997. Invited talk at the DIMACS Workshop on Massive Data Sets in Telecommunications.Google Scholar
Vit85.J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1):37-57, 1985. Google ScholarDigital Library
VL93.S.V. Vrbsky and J. W. S. Liu. Approximatema query processor that produces monotonically improving approximate answers. IEEE Trans. on Knowledge and Data Engineering, 5(6): 1056-1068, 1993. Google ScholarDigital Library
WVZT90.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990. Google ScholarDigital Library

Index Terms

New sampling-based summary statistics for improving approximate query answers

Recommendations

New sampling-based summary statistics for improving approximate query answers

In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for ...
Read More
Approximate: A query processor that produces monotonically proving approximate answers
Read More
Approximate XML query answers
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 372
  Total Citations
  View Citations
- 1,782
  Total Downloads
- Downloads (Last 12 months)137
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

New sampling-based summary statistics for improving approximate query answers

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

New sampling-based summary statistics for improving approximate query answers

Approximate: A query processor that produces monotonically proving approximate answers

Approximate XML query answers