skip to main content
10.1145/276304.276334acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

New sampling-based summary statistics for improving approximate query answers

Published:01 June 1998Publication History

ABSTRACT

In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.

References

  1. AMS96.N. Alon, Y. Matias, and M. Szegedi. The space complexity of approximating the frequency moments. In Proc. 28th A CM Symp. on the Theory of Computing, pages 20-29, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ant92.G. Antoshenkov. Random sampling from pseudoranked B+ trees. In Proc. 18th International Conf. on Very Large Data Bases, pages 375-382, August 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AS94.R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conf. on Very Large Data Bases, pages 487-499, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AZ96.G. Antoshenkov and M. Ziauddin. Query processing and optimization in Oracle Rdb. VLDB Journal, 5(4):229-237, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BDF+97.D. Barbar~i, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3- 45, 1997.Google ScholarGoogle Scholar
  6. BM96.R.J. Bayardo, Jr. and D. P. Miranker. Processing queries for first-few answers. In Proc. 5th International Conf. on Information and Knowledge Management, pages 45-52, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. BMUT97.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 255-264, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. FJS97.C. Faloutsos, H. V. Jagadish, and N. D. Sidiropoulos. Recovering information from summary data. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 36--45, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fla85.E Flajolet. Approximate counting: a detailed analysis. BIT, 25:113-134, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. FM83.P. Flajolet and G. N. Martin. Probabilistic counting. In Proc. 24th IEEE Syrup. on Foundations of Computer Science, pages 76-82, October 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. FM85.E Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. GM95.P.B. Gibbons and Y. Matias, August 1995. Presentation and feedback during a Bell Labs-Teradata presentation to Walmart scientists and executives on proposed improvements to the Teradata DBS.Google ScholarGoogle Scholar
  13. GM97.E B. Gibbons and Y. Matias. Synopsis data structures, concise samples, and mode statistics. Manuscript, July 1997.Google ScholarGoogle Scholar
  14. GMP97a.P.B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.Google ScholarGoogle Scholar
  15. GMP97b.P.B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. 23rd International Conf. on Ve~. Large Data Bases, pages 466-475, August 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. GPA+98.E B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: System and techniques for approximate query answering. Technical report, Bell Laboratories, Murray Hill, New Jersey, February 1998.Google ScholarGoogle Scholar
  17. HHW97.J.M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 171-182, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. HK95.M. Hofri and N. Kechris. Probabilistic counting of a large number of events. Manuscript, 1995.Google ScholarGoogle Scholar
  19. HNSS95.E J. Haas, J. E Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, September 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. IC93.Y.E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems, 18(4):709-748, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ioa93.Y.E. Ioannidis. Universality of serial histograms, in Proc. 19th International Conf. on Ve~. Large Data Bases, pages 256-267, August 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. IP95.Y.E. Ioannidis and V. Poosala, Balancing histogram optimality and practicality for query result size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 233-244, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mat92.Y. Matias. Highly Parallel Randomized Algorithmics. PhD thesis, Tel Aviv University, Israel, 1992.Google ScholarGoogle Scholar
  24. Mor78.R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21:840-842, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. MSY96.Y. Matias, S. C. Sahinalp, and N. E. Young. Performance evaluation of approximate priority queues. Presented at DIMACS Fifth Implementation Challenge: Priority Queues, Dictionaries, and Point Sets, organized by D. S. Johnson and C. McGeoch, October 1996.Google ScholarGoogle Scholar
  26. MVN93.Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic generation of discrete random variates. In Proc. 4th ACM- SIAM Symp. on Discrete Algorithms, pages 361-370, January 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. MVY94.Y. Matias, J. S. Vitter, and N. E. Young. Approximate data structures with applications. In Proc. 5th ACM- SIAM Syrup. on Discrete Algorithms, pages 187-194, January 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. OR89.E Olken and D. Rotem. Random sampling from t3+ trees. In Proc. 15th international Conf. on Very Large Data Bases, pages 269-277, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. OR92.F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Proc. 8th IEEE International Conf. on Data Engineering, pages 632-641, February 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. PIHS96.V. Poosala, Y. E. loannidis, P. J. Haas, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 294- 305, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pre97.D. Pregibon. Mega-monitoring: Developing and using telecommunications signatures, October 1997. Invited talk at the DIMACS Workshop on Massive Data Sets in Telecommunications.Google ScholarGoogle Scholar
  32. Vit85.J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1):37-57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. VL93.S.V. Vrbsky and J. W. S. Liu. Approximatema query processor that produces monotonically improving approximate answers. IEEE Trans. on Knowledge and Data Engineering, 5(6): 1056-1068, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. WVZT90.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. New sampling-based summary statistics for improving approximate query answers

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
              June 1998
              599 pages
              ISBN:0897919955
              DOI:10.1145/276304

              Copyright © 1998 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 June 1998

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate785of4,003submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader