Abstract
We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.
We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.
To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time.
Similar content being viewed by others
References
D.W. Aha, “A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations,” Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA 92717, Nov. 27, 1990.
D.W. Aha, D. Kibler, and M.K. Albert, “Instance-based learning algorithms,”Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.
R. Bayer, and E.M. McCreight, Organization and maintenance of large ordered indexes. Acta Informatica, Springer Verlag (Heidelberg, FRG and NewYork NY, USA) Verlag, vol. 1, no. 3, 1972, Also published in/as: ACM SIGFIDET 1970, pp. 107-141.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Tress. Chapman & Hall, Inc., 1984.
S. Chaudhuri, R. Motwani, and V. Narasaya, “On random sampling over joins,” in Proc. ACM SIGMOD Conf., 1999, pp. 263-274.
C.M. Chen, and N. Roussopoulos, “Adaptive selectivity estimation using query feedback,” in Proceedings of 1994 ACM-SIGMOD International Conference on Management of Data, 1994.
S. Christodoulakis, “Estimating record selectivities,” Information System, vol. 8, no. 2, pp. 105–115, 1983.
W.G. Cochran, Sampling Techniques, 2nd edition. John Wiley & Sons, Inc., 1963.
D. Comer, “The ubiquitous B-tree,” ACM Computing Surveys, vol. 11, no. 2, pp. 121–138, 1979.
P.J. Haas, J.F. Naughton, S.Seshadri, and A.N. Swami, “Fixed-precision estimation of join selectivity,” in Proc. 12thACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1993, pp. 190-201.
P.J. Haas and A.N. Swami, “Sequential sampling procedures for query size estimation,” in ACM SIGMOD Conference on the Management of Data, 1992, pp. 341-350.
P.J. Haas and A.N. Swami, “Sampling-based selectivity estimation for joins using augmented frequency value statistics,” in The International Confererence on Data Engineering, 1995, pp. 522-531.
B. Harangsri, “Query size estimation in databases,” Ph.D. thesis, Computer Science and Engineering, University of New South Wales, Australia.
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling,” in International Symposium on Cooperative Database Systems for Advanced Applications, Dec. 5-7, 1996. Heian Shrine, Kyoto, Japan, pp. 400-403.
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling.” Technical report, The University of New South Wales, School of Computer Science and Engineering, Sydney 2052, Australia, 1996.
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using machine learning,” in Database Systems for Advanced Applications 1997 (DASFAA'97), Melbourne, Australia, 1997.
W. Hou, G. Ozsoyoglu, and E. Dogdu, “Error constrained COUNT query evaluation in relational databases,” in ACM-SIGMOD Conference on the Management of Data, 1991, pp. 278-287.
W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Statistical estimators for relational algebra expressions,” in Proceedings of the 7th ACM Symposium on Principles of Database Systems, 1988, pp. 276-287.
W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Processing aggregates relational queries with hard time constraints,” in ACM-SIGMOD Conference on the Management of Data, 1989, pp. 68-77.
Y. Ioannidis, “Universality of serial histograms,” in Proceedings of the 19th Conference on Very Large Databases, Morgan Kaufman Pubs. (Los Altos CA), Dublin, August, 1993.
Y.E. Ioannidis and S. Christodoulakis, “On the propagation of errors in the size of join results,” in Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, 1991, pp. 268-277.
Y.E. Ioannidis and V. Poosala, “Balancing histogram optimality and practicality for query result size estimation,” in ACM SIGMOD International Conference on Management of Data, 1995, pp. 233-244.
D. Kibler, D.W. Aha, and M.K. Albert, “Instance-based prediction of real-valued attributes,” Computational Intelligence, vol. 5, pp. 51–57, 1989.
Y. Ling and W. Sun, “An evaluation of sampling-based size estimation methods for selections in database systems,” in The International Confererence on Data Engineering, 1995, pp. 532-539.
R.J. Lipton, J.F. Naughton, and D.A. Schneider, “Practical selectivity estimation through adaptive sampling,” in Proceedings of ACM SIGMOD, 1990, pp. 1-12.
A. Makinouchi, M. Tezuka, H. Kitakami, and S. Adachi, “The optimization strategy for query evaluation in RDB/V1,” in Proceedings of the Seventh International Conference on Very Large Data Bases, 1981, pp. 518-519
M. Muralikrishna and D. DeWitt, “Equi-depth histograms for estimating selectivity factors for multidimensional queries,” in Proceedings of the ACMSIGMOD Conf. on Management of Data, 1988, pp. 28-36.
M.N. Murthy and T.J. Rao, Systematic Sampling with Illustrative Examples, Elsevier Science Publishers, Handbook of Statistics, 1988, vol. 6, Chap. 7, pp. 147-185.
G. Piatetsky-Shapiro and C. Connell, “Accurate estimation of the number of tuples satisfying a condition,” in Proceedings of the ACM SIGMOD Conference, Boston, Mass, June, ACM, New York, 1984, pp. 256-276.
J.R. Quinlan, “Combining instance-based and model-based learning,” in Proceedings of Machine Learning, Morgan Kaufmann, 1993.
R.L. Scheaffer,W. Mendenhall, and L. Ott, Elementary Survey Sampling, 4th edition. PWS-KENT Publishing Company, 1990.
P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price, “Access path selection in a relational database management system,” in Proc. ACM-SIGMOD International Conference on Management of Data, ACM: Boston New York, 1979, pp. 23–34.
W. Sun,Y. Ling, N. Rishe, andY. Deng, “An instant and accurate size estimation method for joins and selection in a retrieval-intensive environment,” in Proceedings of ACM SIGMOD, 1993, pp. 79-88.
A. Swami and K.B. Schiefer, “On the estimation of join result sizes,” in Proc. International Confererence on Extending Database Technology (EDBT'94). Springer-Verlag, Berlin, 1994.
S. Thomson, Sampling. John Wiley & Sons, Inc. Basic and Advanced Sampling Methods, 1992.
P. Turney and M. Jankulak, Summary Table of Database Statistics, 1993. File can be obtained from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/SUMMARY-TABLE. The analysis of 64 real-world databases donated by the authors.
Q. Zhu, “An integrated method for estimating selectivities in a multidatabase system,” in Proceedings of Distributed Computing (CASCON' 93), Toronto, Ontario, Canada, vol. 2, 1993, pp. 832–847.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ngu, A., Harangsri, B. & Shepherd, J. Query Size Estimation for Joins Using Systematic Sampling. Distributed and Parallel Databases 15, 237–275 (2004). https://doi.org/10.1023/B:DAPD.0000018573.35050.25
Issue Date:
DOI: https://doi.org/10.1023/B:DAPD.0000018573.35050.25