ABSTRACT
We present new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude.
We also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e. they apply with respect to a (user controlled) confidence parameter.
We present the algorithms, their theoretical analysis and simulation results on different datasets.
- 1.P. G. Selinger, M. M. Astrahan, R. A. Lories, and T. G. Price, "Access Path Selection in a Relational Database Management System", in A CM SIGMOD 79, June 1979. Google ScholarDigital Library
- 2.G. Piatetsky-Shapiro, "Accurate Estimation of the Number of Tuples Satisfying a Condition", in A CM SIGMOD 8~, Boston, June 1984. Google ScholarDigital Library
- 3.V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, "Improved Histograms for Selectivity Estimation of Range Predicates", in ACM SIGMOD 96, pp. 294-305, Montreal, June 1996. Google ScholarDigital Library
- 4."DB2 MVS:", To be completed.Google Scholar
- 5."Informix", To be completed.Google Scholar
- 6.D. DeWitt, J. Naughton, and D. Schneider, "Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting", in Proc. Intl. Conf. on Parallel and Distributed Inf. Sys., pp. 280-291, Miami Beach, 1991. Google ScholarDigital Library
- 7.M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan, "Time Bounds for Selection", in J. Cornput. Syst. Sci., vol. 7, pp. 448-461, 1973.Google ScholarDigital Library
- 8.M. R. Paterson, "Progress in Selection", Deptt. of Computer Science, University of Warwick, Coventry, UK, 1997.Google Scholar
- 9.D. Dor, Selection Algorithms, PhD thesis, Tel-Aviv University, 1995.Google Scholar
- 10.D. Dor and U. Zwick, "Selecting the Median", in Proc. 6th Annual A CM-SIAM Symp. on Discrete Algorithms, pp. 28-37, 1995. Google ScholarDigital Library
- 11.D. Dot and U. Zwick, "Finding the anth Largest Element", Combinatorica, vol. 16, pp. 41-58, 1996.Google ScholarCross Ref
- 12.D. Dor and U. Zwick, "Median Selection Requires (2 q-e)n Comparisons", Technical Report 312/96, Department of Computer Science, Tel-Aviv University, Apr. 1996.Google Scholar
- 13.F. F. Yao, "On Lower Bounds for Selection Problems", Technical Report MAC TR-121, Massachusetts Institute of Technology, 1974. Google ScholarDigital Library
- 14.I. Pohl, "A Minimum Storage Algorithm for Computing the Median", Technical Report IBM Research Report RC 2701 (# 12713), IBM T J Watson Center, Nov. 1969.Google Scholar
- 15.J. I. Munro and M. S. Paterson, "Selection and Sorting with Limited Storage", Theoretical Computer Science, vol. 12, pp. 315-323, 1980.Google ScholarCross Ref
- 16.R. Jain and I. Chlamtac, "The p2 Algorithm for Dynamic Calculation for Quantiles and Histograms without Storing Observations", CA CM, vol. 28, pp. 1076- 1085, 1985. Google ScholarDigital Library
- 17.R. Agrawal and A. Swami, "A One-Pass Space-Efficient Algorithm for Finding Quantiles", in Proc. 7th Intl. Conf. Management of Data (COMAD-95), Pune, India, 1995.Google Scholar
- 18.K. Alsabti, S. Ranka, and V. Singh, "A One-Pass Algorithm for Accurately Estimating Quantiles for Disk- Resident Data", in Proc. $3rd VLDB Conference, Athens, Greece, 1997. Google ScholarDigital Library
- 19.W. Hoeffding, "Probability Inequalities for Sum8 of Bounded Random Variables", American Statistical Association Jornal, pp. 13-30, Mar. 1963.Google ScholarCross Ref
Index Terms
- Approximate medians and other quantiles in one pass and with limited memory
Recommendations
Approximate medians and other quantiles in one pass and with limited memory
We present new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory ...
A Bound on the Number of Distinguishable Functions which are Time-Limited and Approximately Band-Limited
Let do$\mathcal{A}_0 ( {T,W,\eta } )( {T,W, > 0,0 < \eta < 1} )$ be the set of functions $x( t )$ with support on a T-second interval, with at most unit energy, satisfying $\int _{| f | > W} | {X( f )} |^2 df\leqq \eta $, where $X( f )$ is the Fourier ...
Some Multiple Power Series with Zero-One Coefficients
The paper is concerned with sums of the type \[S_{n,j} = \sum {x_1^{a_1 } x_2^{a_2 } \cdots x_n^{a_n } } \quad (n > 1),\] where the summation is over either \[( * )\qquad ja_i \leqq a_1 + a_2 + \cdots + a_n \quad (1 \leqq j \leqq n;1 \leqq i \leqq ...
Comments