skip to main content
10.1145/276304.276343acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

Random sampling for histogram construction: how much is enough?

Published:01 June 1998Publication History

ABSTRACT

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.

References

  1. 1.J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.Google ScholarGoogle ScholarCross RefCross Ref
  2. 2.K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.Google ScholarGoogle ScholarCross RefCross Ref
  3. 3.K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.Google ScholarGoogle ScholarCross RefCross Ref
  4. 4.A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal o/Statistical Theory and Applications 11(1984): 265-270.Google ScholarGoogle Scholar
  5. 5.S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. Microsoft Research Report, In preparation, 1997.Google ScholarGoogle Scholar
  6. 6.S. Chaudhuri and V. Narasayya. An Efficient, Cost- Driven Index Selection Tool for Microsoft SQL Server. In Proc. 23rd VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical Database Design for Relational Databases. A CM TODS, 13(1988): 91-128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.P.B. Gibbons, Y. Matias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd VLDB, pages 466-475, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20( 1949): 572-579.Google ScholarGoogle ScholarCross RefCross Ref
  10. 10.P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st VLDB, pages 311- 322, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.P.J. Haas and A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. A CM SIG- MOD Conference, pages 341-350, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th A CM Symposium on Principles of Database Systems, pages 276-287, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.W. Hou, G. Ozsoyoglu, and B. Taneja. Processing aggregate relational queries with hard time constraints. In Proc. A CM SIGMOD Conference, pages 68-77, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In Proc. A CM SIGMOD Conference, pages 233-244, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.Y. Ioannidis and V. Poosala. Histogram-Based Solutions to Diverse Database Estimation Problems. IEEE Data Engineering Bulletin 18(1995): 10-18.Google ScholarGoogle Scholar
  17. 17.Y. Ling and W. Sun. An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. In Proc. IEEE Conference on Data Engineering, pages 532-539, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.R.J. Lipton and J.F. Naughton. Query Size Estimation by Adaptive Sampling. In Proc. A CM PODS, pages 40-46, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.R.J. Lipton, J.F. Naughton, and D.A. Schneider. Practical Selectivity Estimation through Adaptive Sampiing. In Proc. A CM SIGMOD Conference, pages 1-11, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.R.J. Lipton, j.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third international Conference on Database Theory, pages 499-513, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.F. Olken. Random Sampling .from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.Google ScholarGoogle Scholar
  24. 24.F. Olken and D. Rotem. Random Sampling from Databases - A Survey. Manuscript, 1995.Google ScholarGoogle Scholar
  25. 25.G. Ozsoyoglu, K. Du, A. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proc. Conference on Database and Expert Systems Applications, pages 406-412, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  26. 26.V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. in Proc. A CM SIGMOD Conference, pages 294-305, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.P.G. Selinger, D.D. Astrahan, R.A. Chamberlain, R.A. Lorie, and T.G. Price. Access path selection in a relational database management system. In Proc. A CM SIGMOD Conference, pages 23-34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
    June 1998
    599 pages
    ISBN:0897919955
    DOI:10.1145/276304

    Copyright © 1998 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 June 1998

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate785of4,003submissions,20%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader