Random sampling for histogram construction: how much is enough?

Authors:
Surajit Chaudhuri

Microsoft Research

Microsoft Research
View Profile

,
Rajeev Motwani

Stanford University

Stanford University
View Profile

,
Vivek Narasayya

Microsoft Research

Microsoft Research
View Profile

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataJune 1998Pages 436–447https://doi.org/10.1145/276304.276343

Published:01 June 1998Publication History

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Pages 436–447

ABSTRACT

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.

References

1.J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.Google ScholarCross Ref
2.K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.Google ScholarCross Ref
3.K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.Google ScholarCross Ref
4.A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal o/Statistical Theory and Applications 11(1984): 265-270.Google Scholar
5.S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. Microsoft Research Report, In preparation, 1997.Google Scholar
6.S. Chaudhuri and V. Narasayya. An Efficient, Cost- Driven Index Selection Tool for Microsoft SQL Server. In Proc. 23rd VLDB, 1997. Google ScholarDigital Library
7.S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical Database Design for Relational Databases. A CM TODS, 13(1988): 91-128. Google ScholarDigital Library
8.P.B. Gibbons, Y. Matias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd VLDB, pages 466-475, 1997. Google ScholarDigital Library
9.L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20( 1949): 572-579.Google ScholarCross Ref
10.P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st VLDB, pages 311- 322, 1995. Google ScholarDigital Library
11.P.J. Haas and A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. A CM SIG- MOD Conference, pages 341-350, 1992. Google ScholarDigital Library
12.W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991. Google ScholarDigital Library
13.W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th A CM Symposium on Principles of Database Systems, pages 276-287, 1988. Google ScholarDigital Library
14.W. Hou, G. Ozsoyoglu, and B. Taneja. Processing aggregate relational queries with hard time constraints. In Proc. A CM SIGMOD Conference, pages 68-77, 1989. Google ScholarDigital Library
15.Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In Proc. A CM SIGMOD Conference, pages 233-244, 1995. Google ScholarDigital Library
16.Y. Ioannidis and V. Poosala. Histogram-Based Solutions to Diverse Database Estimation Problems. IEEE Data Engineering Bulletin 18(1995): 10-18.Google Scholar
17.Y. Ling and W. Sun. An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. In Proc. IEEE Conference on Data Engineering, pages 532-539, 1995. Google ScholarDigital Library
18.R.J. Lipton and J.F. Naughton. Query Size Estimation by Adaptive Sampling. In Proc. A CM PODS, pages 40-46, 1990. Google ScholarDigital Library
19.R.J. Lipton, J.F. Naughton, and D.A. Schneider. Practical Selectivity Estimation through Adaptive Sampiing. In Proc. A CM SIGMOD Conference, pages 1-11, 1990. Google ScholarDigital Library
20.R.J. Lipton, j.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226. Google ScholarDigital Library
21.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
22.J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third international Conference on Database Theory, pages 499-513, 1990. Google ScholarDigital Library
23.F. Olken. Random Sampling .from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.Google Scholar
24.F. Olken and D. Rotem. Random Sampling from Databases - A Survey. Manuscript, 1995.Google Scholar
25.G. Ozsoyoglu, K. Du, A. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proc. Conference on Database and Expert Systems Applications, pages 406-412, 1991.Google ScholarCross Ref
26.V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. in Proc. A CM SIGMOD Conference, pages 294-305, 1996. Google ScholarDigital Library
27.G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984. Google ScholarDigital Library
28.P.G. Selinger, D.D. Astrahan, R.A. Chamberlain, R.A. Lorie, and T.G. Price. Access path selection in a relational database management system. In Proc. A CM SIGMOD Conference, pages 23-34, 1979. Google ScholarDigital Library
29.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google Scholar

Recommendations

Random sampling for histogram construction: how much is enough?

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address ...
Read More
AutoAdmin “what-if” index analysis utility
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

As databases get widely deployed, it becomes increasingly important to reduce the overhead of database administration. An important aspect of data administration that critically influences performance is the ability to select indexes for a database. In ...
Read More
Effective use of block-level sampling in statistics estimation
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 186
  Total Citations
  View Citations
- 2,591
  Total Downloads
- Downloads (Last 12 months)549
- Downloads (Last 6 weeks)68
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Random sampling for histogram construction: how much is enough?

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Recommendations

Random sampling for histogram construction: how much is enough?

AutoAdmin “what-if” index analysis utility

Effective use of block-level sampling in statistics estimation