ABSTRACT
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.
- 1.J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.Google ScholarCross Ref
- 2.K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.Google ScholarCross Ref
- 3.K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.Google ScholarCross Ref
- 4.A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal o/Statistical Theory and Applications 11(1984): 265-270.Google Scholar
- 5.S. Chaudhuri, R. Motwani, and V. Narasayya. Using Random Sampling for Histogram Construction. Microsoft Research Report, In preparation, 1997.Google Scholar
- 6.S. Chaudhuri and V. Narasayya. An Efficient, Cost- Driven Index Selection Tool for Microsoft SQL Server. In Proc. 23rd VLDB, 1997. Google ScholarDigital Library
- 7.S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical Database Design for Relational Databases. A CM TODS, 13(1988): 91-128. Google ScholarDigital Library
- 8.P.B. Gibbons, Y. Matias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd VLDB, pages 466-475, 1997. Google ScholarDigital Library
- 9.L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20( 1949): 572-579.Google ScholarCross Ref
- 10.P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st VLDB, pages 311- 322, 1995. Google ScholarDigital Library
- 11.P.J. Haas and A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. A CM SIG- MOD Conference, pages 341-350, 1992. Google ScholarDigital Library
- 12.W. Hou, G. Ozsoyoglu, and E. Dogdu. Error- Constrained COUNT Query Evaluation in Relational Databases. In Proc. A CM SIGMOD Conference, pages 278-287, 1991. Google ScholarDigital Library
- 13.W. Hou, G. Ozsoyoglu, and B. Taneja. Statistical estimators for relational algebra expressions. In Proc. 7th A CM Symposium on Principles of Database Systems, pages 276-287, 1988. Google ScholarDigital Library
- 14.W. Hou, G. Ozsoyoglu, and B. Taneja. Processing aggregate relational queries with hard time constraints. In Proc. A CM SIGMOD Conference, pages 68-77, 1989. Google ScholarDigital Library
- 15.Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In Proc. A CM SIGMOD Conference, pages 233-244, 1995. Google ScholarDigital Library
- 16.Y. Ioannidis and V. Poosala. Histogram-Based Solutions to Diverse Database Estimation Problems. IEEE Data Engineering Bulletin 18(1995): 10-18.Google Scholar
- 17.Y. Ling and W. Sun. An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. In Proc. IEEE Conference on Data Engineering, pages 532-539, 1995. Google ScholarDigital Library
- 18.R.J. Lipton and J.F. Naughton. Query Size Estimation by Adaptive Sampling. In Proc. A CM PODS, pages 40-46, 1990. Google ScholarDigital Library
- 19.R.J. Lipton, J.F. Naughton, and D.A. Schneider. Practical Selectivity Estimation through Adaptive Sampiing. In Proc. A CM SIGMOD Conference, pages 1-11, 1990. Google ScholarDigital Library
- 20.R.J. Lipton, j.F. Naughton, D.A. Schneider, and S. Seshadri. Efficient Sampling Strategies for Relational Database Operations. Theoretical Computer Science 116(1993): 195-226. Google ScholarDigital Library
- 21.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- 22.J.F. Naughton and S. Seshadri. On Estimating the Size of Projections. In Proc. Third international Conference on Database Theory, pages 499-513, 1990. Google ScholarDigital Library
- 23.F. Olken. Random Sampling .from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.Google Scholar
- 24.F. Olken and D. Rotem. Random Sampling from Databases - A Survey. Manuscript, 1995.Google Scholar
- 25.G. Ozsoyoglu, K. Du, A. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proc. Conference on Database and Expert Systems Applications, pages 406-412, 1991.Google ScholarCross Ref
- 26.V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. in Proc. A CM SIGMOD Conference, pages 294-305, 1996. Google ScholarDigital Library
- 27.G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. A CM SIGMOD Conference, pages 256-276, 1984. Google ScholarDigital Library
- 28.P.G. Selinger, D.D. Astrahan, R.A. Chamberlain, R.A. Lorie, and T.G. Price. Access path selection in a relational database management system. In Proc. A CM SIGMOD Conference, pages 23-34, 1979. Google ScholarDigital Library
- 29.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google Scholar
Recommendations
Random sampling for histogram construction: how much is enough?
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address ...
AutoAdmin “what-if” index analysis utility
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataAs databases get widely deployed, it becomes increasingly important to reduce the overhead of database administration. An important aspect of data administration that critically influences performance is the ability to select indexes for a database. In ...
Effective use of block-level sampling in statistics estimation
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of dataBlock-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of ...
Comments