ABSTRACT
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.
- AMS96.N. Alon, Y. Matias, and M. Szegedi. The space complexity of approximating the frequency moments. In Proc. 28th A CM Symp. on the Theory of Computing, pages 20-29, May 1996. Google ScholarDigital Library
- Ant92.G. Antoshenkov. Random sampling from pseudoranked B+ trees. In Proc. 18th International Conf. on Very Large Data Bases, pages 375-382, August 1992. Google ScholarDigital Library
- AS94.R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conf. on Very Large Data Bases, pages 487-499, September 1994. Google ScholarDigital Library
- AZ96.G. Antoshenkov and M. Ziauddin. Query processing and optimization in Oracle Rdb. VLDB Journal, 5(4):229-237, 1996. Google ScholarDigital Library
- BDF+97.D. Barbar~i, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3- 45, 1997.Google Scholar
- BM96.R.J. Bayardo, Jr. and D. P. Miranker. Processing queries for first-few answers. In Proc. 5th International Conf. on Information and Knowledge Management, pages 45-52, 1996. Google ScholarDigital Library
- BMUT97.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 255-264, May 1997. Google ScholarDigital Library
- FJS97.C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 36--45, August 1997. Google ScholarDigital Library
- Fla85.E Flajolet. Approximate counting: a detailed analysis. BIT, 25:113-134, 1985. Google ScholarDigital Library
- FM83.P. Flajolet and G. N. Martin. Probabilistic counting. In Proc. 24th IEEE Syrup. on Foundations of Computer Science, pages 76-82, October 1983.Google ScholarDigital Library
- FM85.E Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985. Google ScholarDigital Library
- GM95.P.B. Gibbons and Y. Matias, August 1995. Presentation and feedback during a Bell Labs-Teradata presentation to Walmart scientists and executives on proposed improvements to the Teradata DBS.Google Scholar
- GM97.E B. Gibbons and Y. Matias. Synopsis data structures, concise samples, and mode statistics. Manuscript, July 1997.Google Scholar
- GMP97a.P.B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.Google Scholar
- GMP97b.P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 466-475, August 1997. Google ScholarDigital Library
- GPA+98.E B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: System and techniques for approximate query answering. Technical report, Bell Laboratories, Murray Hill, New Jersey, February 1998.Google Scholar
- HHW97.J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 171-182, May 1997. Google ScholarDigital Library
- HK95.M. Hofri and N. Kechris. Probabilistic counting of a large number of events. Manuscript, 1995.Google Scholar
- HNSS95.E J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, September 1995. Google ScholarDigital Library
- IC93.Y.E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems, 18(4):709-748, 1993. Google ScholarDigital Library
- Ioa93.Y.E. Ioannidis. Universality of serial histograms, in Proc. 19th International Conf. on Ve~. Large Data Bases, pages 256-267, August 1993. Google ScholarDigital Library
- IP95.Y.E. Ioannidis and V. Poosala, Balancing histogram optimality and practicality for query result size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 233-244, May 1995. Google ScholarDigital Library
- Mat92.Y. Matias. Highly Parallel Randomized Algorithmics. PhD thesis, Tel Aviv University, Israel, 1992.Google Scholar
- Mor78.R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21:840-842, 1978. Google ScholarDigital Library
- MSY96.Y. Matias, S. C. Sahinalp, and N. E. Young. Performance evaluation of approximate priority queues. Presented at DIMACS Fifth Implementation Challenge: Priority Queues, Dictionaries, and Point Sets, organized by D. S. Johnson and C. McGeoch, October 1996.Google Scholar
- MVN93.Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic generation of discrete random variates. In Proc. 4th ACM- SIAM Symp. on Discrete Algorithms, pages 361-370, January 1993. Google ScholarDigital Library
- MVY94.Y. Matias, J. S. Vitter, and N. E. Young. Approximate data structures with applications. In Proc. 5th ACM- SIAM Syrup. on Discrete Algorithms, pages 187-194, January 1994. Google ScholarDigital Library
- OR89.E Olken and D. Rotem. Random sampling from t3+ trees. In Proc. 15th international Conf. on Very Large Data Bases, pages 269-277, 1989. Google ScholarDigital Library
- OR92.F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Conf. on Data Engineering, pages 632-641, February 1992. Google ScholarDigital Library
- PIHS96.V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294- 305, June 1996. Google ScholarDigital Library
- Pre97.D. Pregibon. Mega-monitoring: Developing and using telecommunications signatures, October 1997. Invited talk at the DIMACS Workshop on Massive Data Sets in Telecommunications.Google Scholar
- Vit85.J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1):37-57, 1985. Google ScholarDigital Library
- VL93.S.V. Vrbsky and J. W. S. Liu. Approximatema query processor that produces monotonically improving approximate answers. IEEE Trans. on Knowledge and Data Engineering, 5(6): 1056-1068, 1993. Google ScholarDigital Library
- WVZT90.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990. Google ScholarDigital Library
Index Terms
- New sampling-based summary statistics for improving approximate query answers
Recommendations
New sampling-based summary statistics for improving approximate query answers
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for ...
Approximate XML query answers
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of dataThe rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result ...
Comments