Abstract
Sampling-based methods for estimating relation sizes after relational operators such as selections, joins and projections have been intensively studied in recent years. Methods of this type can achieve high estimation accuracy and efficiency. Since the dominating overhead involved in a sampling-based method is the sampling cost, different variants of sampling methods are proposed so as to minimize the sampling percentage (thus reducing the sampling cost) while maintaining the estimation accuracy in terms of the confidence level and relative error (to be precisely defined later in Section 2). In order to determine the minimal sampling percentage, the overall characteristics of the data such as the mean and variance are needed. Currently, the representative sampling-based methods in literature are based on the assumption that overall characteristics of data are unavailable, and thus a significant amount of effort is dedicated to estimating these characteristics so as to approach the optimal (minimal) sampling percentage. The estimation for these characteristics incurs cost as well as suffers the estimation error. In this short essay, we point out that the exact values of these characteristics of data can be kept track of in a database system at a negligible overhead. As a result, the minimal sampling percentage while ensuring the specified relative error and confidence level can be precisely determined.
- [1] Stavros Christodoulakis. Estimating record selectivities. Information Systems, 8(2): 105-115, 1983.Google ScholarCross Ref
- [2] Stavros Christodoulakis. Estimating block selectivities. In formation Systems, 9(1): 69-79, 1984.Google Scholar
- [3] Pai-Cheng Chu. A contingency approach to estimating record selectivities. Software Engineering, 17(6): 544-552, 1991. Google ScholarDigital Library
- [4] William G. Cochran. Sampling Techniques. John Wiley & Sons, 1977.Google Scholar
- [5] Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query sise estimation. In Proceedings of the Very Large Database Conference, pages 341-350, April 1992. Google ScholarDigital Library
- [6] Wen-Chi Hou, G. Ossoyoglu, and E. Dogdu. Error-constrained count query, evaluation in relational databases. In Proceedings of the ACM-SIGMOD Conference, pages 278-287, August 1991. Google ScholarDigital Library
- [7] Wen-Chi Hou and G. Ossoyoglu. Statistical estimators, for aggregate relational algebra queries. ACM Transactions On Database Systems, 16(4): 600-654, December 1991. Google ScholarDigital Library
- [8] Wen-Chi Hou, G. Ozsoyoglu, and Baldeao K. Taneja. Processing aggregate relational queries with hard time constraints. In Proceedings of the ACM-SIGMOD Conference, pages 165-172, August 1989. Google ScholarDigital Library
- [9] Richard Lipton and Jefferey Naughton. Estimating the sise of generalised transitive closures. In Proceedings of the 15th VLDS Conference, pages 165-172, 1989. Google ScholarDigital Library
- [10] Richard Lipton and Jefferey Naughton. Query sise estimation by adaptive sampling. In Proceedings of 9th ACM Symposium on Priciples of Database Systems, Pages 40- 46, March 1990. Google ScholarDigital Library
- [11] Richard Lipton, Jeffery Naughton, and Donavan, Schneider. Practical, selectivity estimation through adaptive sampling, In Proceedings of ACM SIGMOD, pages 1-11, 1990. Google ScholarDigital Library
- [12] Clifford A. Lynch. Selectivity estimation and query optimisation in large databases with highly skewed distributions of column values. In Proceedings of the 14th VLDS Conference, pages 240-251, 1988. Google ScholarDigital Library
- [13] M. Muralikrishna and D. DeWitt. Statistical profile estimation in database system. Computing Survey, 20(3): 191- 221, 1988. Google ScholarDigital Library
- [14] P. V. Sukhatme and B. V. Sukhatme. Sampling Theory of Surveys with Application. Iowa State University Press, 1970.Google Scholar
Index Terms
- A supplement to sampling-based methods for query size estimation in a database system
Recommendations
Query Size Estimation for Joins Using Systematic Sampling
We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it ...
Sequential sampling procedures for query size estimation
SIGMOD '92: Proceedings of the 1992 ACM SIGMOD international conference on Management of dataWe provide a procedure, based on random sampling, for estimation of the size of a query result. The procedure is sequential in that sampling terminates after a random number of steps according to a stopping rule that depends upon the observations ...
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems
ICDE '95: Proceedings of the Eleventh International Conference on Data EngineeringThe results of a performance study of the representative sampling-based size estimation methods in database management systems are reported in this paper. Major performance measurement includes estimation accuracy, the amount of sample taken, and the ...
Comments