Skip to main content
Log in

Query Size Estimation for Joins Using Systematic Sampling

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.

We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.

To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D.W. Aha, “A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations,” Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA 92717, Nov. 27, 1990.

    Google Scholar 

  2. D.W. Aha, D. Kibler, and M.K. Albert, “Instance-based learning algorithms,”Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.

    Google Scholar 

  3. R. Bayer, and E.M. McCreight, Organization and maintenance of large ordered indexes. Acta Informatica, Springer Verlag (Heidelberg, FRG and NewYork NY, USA) Verlag, vol. 1, no. 3, 1972, Also published in/as: ACM SIGFIDET 1970, pp. 107-141.

    Google Scholar 

  4. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Tress. Chapman & Hall, Inc., 1984.

  5. S. Chaudhuri, R. Motwani, and V. Narasaya, “On random sampling over joins,” in Proc. ACM SIGMOD Conf., 1999, pp. 263-274.

  6. C.M. Chen, and N. Roussopoulos, “Adaptive selectivity estimation using query feedback,” in Proceedings of 1994 ACM-SIGMOD International Conference on Management of Data, 1994.

  7. S. Christodoulakis, “Estimating record selectivities,” Information System, vol. 8, no. 2, pp. 105–115, 1983.

    Google Scholar 

  8. W.G. Cochran, Sampling Techniques, 2nd edition. John Wiley & Sons, Inc., 1963.

  9. D. Comer, “The ubiquitous B-tree,” ACM Computing Surveys, vol. 11, no. 2, pp. 121–138, 1979.

    Google Scholar 

  10. P.J. Haas, J.F. Naughton, S.Seshadri, and A.N. Swami, “Fixed-precision estimation of join selectivity,” in Proc. 12thACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1993, pp. 190-201.

  11. P.J. Haas and A.N. Swami, “Sequential sampling procedures for query size estimation,” in ACM SIGMOD Conference on the Management of Data, 1992, pp. 341-350.

  12. P.J. Haas and A.N. Swami, “Sampling-based selectivity estimation for joins using augmented frequency value statistics,” in The International Confererence on Data Engineering, 1995, pp. 522-531.

  13. B. Harangsri, “Query size estimation in databases,” Ph.D. thesis, Computer Science and Engineering, University of New South Wales, Australia.

  14. B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling,” in International Symposium on Cooperative Database Systems for Advanced Applications, Dec. 5-7, 1996. Heian Shrine, Kyoto, Japan, pp. 400-403.

  15. B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling.” Technical report, The University of New South Wales, School of Computer Science and Engineering, Sydney 2052, Australia, 1996.

    Google Scholar 

  16. B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using machine learning,” in Database Systems for Advanced Applications 1997 (DASFAA'97), Melbourne, Australia, 1997.

  17. W. Hou, G. Ozsoyoglu, and E. Dogdu, “Error constrained COUNT query evaluation in relational databases,” in ACM-SIGMOD Conference on the Management of Data, 1991, pp. 278-287.

  18. W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Statistical estimators for relational algebra expressions,” in Proceedings of the 7th ACM Symposium on Principles of Database Systems, 1988, pp. 276-287.

  19. W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Processing aggregates relational queries with hard time constraints,” in ACM-SIGMOD Conference on the Management of Data, 1989, pp. 68-77.

  20. Y. Ioannidis, “Universality of serial histograms,” in Proceedings of the 19th Conference on Very Large Databases, Morgan Kaufman Pubs. (Los Altos CA), Dublin, August, 1993.

    Google Scholar 

  21. Y.E. Ioannidis and S. Christodoulakis, “On the propagation of errors in the size of join results,” in Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, 1991, pp. 268-277.

  22. Y.E. Ioannidis and V. Poosala, “Balancing histogram optimality and practicality for query result size estimation,” in ACM SIGMOD International Conference on Management of Data, 1995, pp. 233-244.

  23. D. Kibler, D.W. Aha, and M.K. Albert, “Instance-based prediction of real-valued attributes,” Computational Intelligence, vol. 5, pp. 51–57, 1989.

    Google Scholar 

  24. Y. Ling and W. Sun, “An evaluation of sampling-based size estimation methods for selections in database systems,” in The International Confererence on Data Engineering, 1995, pp. 532-539.

  25. R.J. Lipton, J.F. Naughton, and D.A. Schneider, “Practical selectivity estimation through adaptive sampling,” in Proceedings of ACM SIGMOD, 1990, pp. 1-12.

  26. A. Makinouchi, M. Tezuka, H. Kitakami, and S. Adachi, “The optimization strategy for query evaluation in RDB/V1,” in Proceedings of the Seventh International Conference on Very Large Data Bases, 1981, pp. 518-519

  27. M. Muralikrishna and D. DeWitt, “Equi-depth histograms for estimating selectivity factors for multidimensional queries,” in Proceedings of the ACMSIGMOD Conf. on Management of Data, 1988, pp. 28-36.

  28. M.N. Murthy and T.J. Rao, Systematic Sampling with Illustrative Examples, Elsevier Science Publishers, Handbook of Statistics, 1988, vol. 6, Chap. 7, pp. 147-185.

    Google Scholar 

  29. G. Piatetsky-Shapiro and C. Connell, “Accurate estimation of the number of tuples satisfying a condition,” in Proceedings of the ACM SIGMOD Conference, Boston, Mass, June, ACM, New York, 1984, pp. 256-276.

    Google Scholar 

  30. J.R. Quinlan, “Combining instance-based and model-based learning,” in Proceedings of Machine Learning, Morgan Kaufmann, 1993.

  31. R.L. Scheaffer,W. Mendenhall, and L. Ott, Elementary Survey Sampling, 4th edition. PWS-KENT Publishing Company, 1990.

  32. P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price, “Access path selection in a relational database management system,” in Proc. ACM-SIGMOD International Conference on Management of Data, ACM: Boston New York, 1979, pp. 23–34.

    Google Scholar 

  33. W. Sun,Y. Ling, N. Rishe, andY. Deng, “An instant and accurate size estimation method for joins and selection in a retrieval-intensive environment,” in Proceedings of ACM SIGMOD, 1993, pp. 79-88.

  34. A. Swami and K.B. Schiefer, “On the estimation of join result sizes,” in Proc. International Confererence on Extending Database Technology (EDBT'94). Springer-Verlag, Berlin, 1994.

    Google Scholar 

  35. S. Thomson, Sampling. John Wiley & Sons, Inc. Basic and Advanced Sampling Methods, 1992.

  36. P. Turney and M. Jankulak, Summary Table of Database Statistics, 1993. File can be obtained from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/SUMMARY-TABLE. The analysis of 64 real-world databases donated by the authors.

  37. Q. Zhu, “An integrated method for estimating selectivities in a multidatabase system,” in Proceedings of Distributed Computing (CASCON' 93), Toronto, Ontario, Canada, vol. 2, 1993, pp. 832–847.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ngu, A., Harangsri, B. & Shepherd, J. Query Size Estimation for Joins Using Systematic Sampling. Distributed and Parallel Databases 15, 237–275 (2004). https://doi.org/10.1023/B:DAPD.0000018573.35050.25

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:DAPD.0000018573.35050.25

Navigation