Query Size Estimation for Joins Using Systematic Sampling

Ngu, A.H.H.; Harangsri, B.; Shepherd, J.

doi:10.1023/B:DAPD.0000018573.35050.25

Query Size Estimation for Joins Using Systematic Sampling

Published: May 2004

Volume 15, pages 237–275, (2004)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

A.H.H. Ngu¹,
B. Harangsri² &
J. Shepherd³

112 Accesses
5 Citations
Explore all metrics

Abstract

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.

We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.

To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified random sampling from streaming and stored data

Article 23 October 2020

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Multidimensional scaling for big data

Article Open access 13 April 2024

References

D.W. Aha, “A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations,” Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA 92717, Nov. 27, 1990.
Google Scholar
D.W. Aha, D. Kibler, and M.K. Albert, “Instance-based learning algorithms,”Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.
Google Scholar
R. Bayer, and E.M. McCreight, Organization and maintenance of large ordered indexes. Acta Informatica, Springer Verlag (Heidelberg, FRG and NewYork NY, USA) Verlag, vol. 1, no. 3, 1972, Also published in/as: ACM SIGFIDET 1970, pp. 107-141.
Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Tress. Chapman & Hall, Inc., 1984.
S. Chaudhuri, R. Motwani, and V. Narasaya, “On random sampling over joins,” in Proc. ACM SIGMOD Conf., 1999, pp. 263-274.
C.M. Chen, and N. Roussopoulos, “Adaptive selectivity estimation using query feedback,” in Proceedings of 1994 ACM-SIGMOD International Conference on Management of Data, 1994.
S. Christodoulakis, “Estimating record selectivities,” Information System, vol. 8, no. 2, pp. 105–115, 1983.
Google Scholar
W.G. Cochran, Sampling Techniques, 2nd edition. John Wiley & Sons, Inc., 1963.
D. Comer, “The ubiquitous B-tree,” ACM Computing Surveys, vol. 11, no. 2, pp. 121–138, 1979.
Google Scholar
P.J. Haas, J.F. Naughton, S.Seshadri, and A.N. Swami, “Fixed-precision estimation of join selectivity,” in Proc. 12thACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1993, pp. 190-201.
P.J. Haas and A.N. Swami, “Sequential sampling procedures for query size estimation,” in ACM SIGMOD Conference on the Management of Data, 1992, pp. 341-350.
P.J. Haas and A.N. Swami, “Sampling-based selectivity estimation for joins using augmented frequency value statistics,” in The International Confererence on Data Engineering, 1995, pp. 522-531.
B. Harangsri, “Query size estimation in databases,” Ph.D. thesis, Computer Science and Engineering, University of New South Wales, Australia.
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling,” in International Symposium on Cooperative Database Systems for Advanced Applications, Dec. 5-7, 1996. Heian Shrine, Kyoto, Japan, pp. 400-403.
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using systematic sampling.” Technical report, The University of New South Wales, School of Computer Science and Engineering, Sydney 2052, Australia, 1996.
Google Scholar
B. Harangsri, J. Shepherd, and A. Ngu, “Query size estimation using machine learning,” in Database Systems for Advanced Applications 1997 (DASFAA'97), Melbourne, Australia, 1997.
W. Hou, G. Ozsoyoglu, and E. Dogdu, “Error constrained COUNT query evaluation in relational databases,” in ACM-SIGMOD Conference on the Management of Data, 1991, pp. 278-287.
W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Statistical estimators for relational algebra expressions,” in Proceedings of the 7th ACM Symposium on Principles of Database Systems, 1988, pp. 276-287.
W. Hou, G. Ozsoyoglu, and B.K. Taneja, “Processing aggregates relational queries with hard time constraints,” in ACM-SIGMOD Conference on the Management of Data, 1989, pp. 68-77.
Y. Ioannidis, “Universality of serial histograms,” in Proceedings of the 19th Conference on Very Large Databases, Morgan Kaufman Pubs. (Los Altos CA), Dublin, August, 1993.
Google Scholar
Y.E. Ioannidis and S. Christodoulakis, “On the propagation of errors in the size of join results,” in Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, 1991, pp. 268-277.
Y.E. Ioannidis and V. Poosala, “Balancing histogram optimality and practicality for query result size estimation,” in ACM SIGMOD International Conference on Management of Data, 1995, pp. 233-244.
D. Kibler, D.W. Aha, and M.K. Albert, “Instance-based prediction of real-valued attributes,” Computational Intelligence, vol. 5, pp. 51–57, 1989.
Google Scholar
Y. Ling and W. Sun, “An evaluation of sampling-based size estimation methods for selections in database systems,” in The International Confererence on Data Engineering, 1995, pp. 532-539.
R.J. Lipton, J.F. Naughton, and D.A. Schneider, “Practical selectivity estimation through adaptive sampling,” in Proceedings of ACM SIGMOD, 1990, pp. 1-12.
A. Makinouchi, M. Tezuka, H. Kitakami, and S. Adachi, “The optimization strategy for query evaluation in RDB/V1,” in Proceedings of the Seventh International Conference on Very Large Data Bases, 1981, pp. 518-519
M. Muralikrishna and D. DeWitt, “Equi-depth histograms for estimating selectivity factors for multidimensional queries,” in Proceedings of the ACMSIGMOD Conf. on Management of Data, 1988, pp. 28-36.
M.N. Murthy and T.J. Rao, Systematic Sampling with Illustrative Examples, Elsevier Science Publishers, Handbook of Statistics, 1988, vol. 6, Chap. 7, pp. 147-185.
Google Scholar
G. Piatetsky-Shapiro and C. Connell, “Accurate estimation of the number of tuples satisfying a condition,” in Proceedings of the ACM SIGMOD Conference, Boston, Mass, June, ACM, New York, 1984, pp. 256-276.
Google Scholar
J.R. Quinlan, “Combining instance-based and model-based learning,” in Proceedings of Machine Learning, Morgan Kaufmann, 1993.
R.L. Scheaffer,W. Mendenhall, and L. Ott, Elementary Survey Sampling, 4th edition. PWS-KENT Publishing Company, 1990.
P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price, “Access path selection in a relational database management system,” in Proc. ACM-SIGMOD International Conference on Management of Data, ACM: Boston New York, 1979, pp. 23–34.
Google Scholar
W. Sun,Y. Ling, N. Rishe, andY. Deng, “An instant and accurate size estimation method for joins and selection in a retrieval-intensive environment,” in Proceedings of ACM SIGMOD, 1993, pp. 79-88.
A. Swami and K.B. Schiefer, “On the estimation of join result sizes,” in Proc. International Confererence on Extending Database Technology (EDBT'94). Springer-Verlag, Berlin, 1994.
Google Scholar
S. Thomson, Sampling. John Wiley & Sons, Inc. Basic and Advanced Sampling Methods, 1992.
P. Turney and M. Jankulak, Summary Table of Database Statistics, 1993. File can be obtained from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/SUMMARY-TABLE. The analysis of 64 real-world databases donated by the authors.
Q. Zhu, “An integrated method for estimating selectivities in a multidatabase system,” in Proceedings of Distributed Computing (CASCON' 93), Toronto, Ontario, Canada, vol. 2, 1993, pp. 832–847.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Texas State University, San Marcos, Texas, USA
A.H.H. Ngu
National Electronics and Computer Technology, 112 Thailand Science Park, Paholyothin Rd., Pathumthani, 12120, Thailand
B. Harangsri
School of Computer Science and Engineering, University of New South Wales, 2052, Sydney, Australia
J. Shepherd

Authors

A.H.H. Ngu
View author publications
You can also search for this author in PubMed Google Scholar
B. Harangsri
View author publications
You can also search for this author in PubMed Google Scholar
J. Shepherd
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ngu, A., Harangsri, B. & Shepherd, J. Query Size Estimation for Joins Using Systematic Sampling. Distributed and Parallel Databases 15, 237–275 (2004). https://doi.org/10.1023/B:DAPD.0000018573.35050.25

Download citation

Issue Date: May 2004
DOI: https://doi.org/10.1023/B:DAPD.0000018573.35050.25

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Query Size Estimation for Joins Using Systematic Sampling

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Multidimensional scaling for big data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Query Size Estimation for Joins Using Systematic Sampling

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Multidimensional scaling for big data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation