Weighted sampling without replacement from data streams

https://doi.org/10.1016/j.ipl.2015.07.007Get rights and content

Highlights

  • New results for sampling in the streaming model.

  • A new method of performing weighted random sampling without replacement using weighted random sampling with replacement.

  • The new sampling algorithm avoids losing error when using finite precision.

Abstract

Weighted sampling without replacement has proved to be a very important tool in designing new algorithms. Efraimidis and Spirakis [5] presented an algorithm for weighted sampling without replacement from data streams. Their algorithm works under the assumption of precise computations over the interval [0,1]. Cohen and Kaplan [3] used similar methods for their bottom-k sketches.

Efraimidis and Spirakis ask as an open question whether using finite precision arithmetic impacts the accuracy of their algorithm. In this paper we show a method to avoid this problem by providing a precise reduction from k-sampling without replacement to k-sampling with replacement. We call the resulting method Cascade Sampling.

Introduction

Random sampling is a fundamental tool that has many applications in computer science (see e.g., Motwani and Raghavan [12], Knuth [9], Tille [15], and Olken [13]). Random sampling methods are widely used is data stream processing because of their simplicity and efficiency [14], [8], [7], [6], [10], [11]. In a stream, the size of the domain and the probability of sampling an element both change constantly; this makes the process of sampling non-trivial. We distinguish between sampling with replacement, where all samples are independent (and thus can be repeated), and sampling without replacement, where repetitions are prohibited.

In particular, weighted sampling without replacement has proven to be a very important tool. In weighted sampling, each element is given a weight, where the probability of an element being selected is based on its weight. In their work Efraimidis and Spirakis [5] presented an algorithm for weighted sampling without replacement. Cohen and Kaplan [3] use similar methods for their bottom-k sketches. While their preliminary implementation yielded promising results, Efraimidis and Spirakis [5] state, as the main open problem of the paper, “However, the question if, and to what extent, the finite precision arithmetic affects the algorithms remains an open problem.”

In this paper we continue this work and provide a new algorithm to avoid the issue of relying on finite precision arithmetic. With this result we show that precision loss is not required in order to sample without replacement. We accomplish this by providing a precise reduction from k-sampling without replacement to k-sampling with replacement, using a special case of k-sampling with replacement, unit sampling (where k=1). Additionally, we believe that in the future our method of expressing different random samples via reduction will provide a tool that allows further translation of other sampling methods into a more effective form for streams.

Due to its fundamental nature, the problem of random sampling has received considerable attention in the last few decades.

In 2005, Vitter [16] presented uniform sampling using a reservoir (with and without replacement) over streams. Further, the question of reductions between sampling methods has been addressed before. For instance, Chaudhuri, Motwani and Narasayya [2] briefly discuss reductions for various sampling methods. Cohen and Kaplan [3] use a “mimicking process” in their papers, which is essentially a reduction from sampling without replacement to sampling with replacement.

Chaudhuri, Motwani and Narasayya [2] use the well-known method of “over-sampling”, i.e. we sample the set independently until k distinct elements are obtained. Clearly, this schema does not introduce any precision loss, since unit sampling is used as a black-box.

Unfortunately, the amount of resources required to determine this information is a function of the weight distribution for the data set, and thus can be arbitrarily large.

In particular, consider the case when there is an element with weight that is overwhelmingly larger than the rest of the population. In this case, the number of repetitions found while sampling with replacement is significantly larger then k.

Probably the first effective non-streaming solution for the weighted sampling without replacement problem was the algorithm of Wong and Easton [17]. It is used by many other algorithms (see Olken [13] for the discussion). For data streams, Efraimidis and Spirakis [5] proposed an algorithm that is based on the “exponent method”. The algorithm requires precise computations of random keys r1/w(p), where rU[0,1]. The sample generated is composed of the k elements with maximal keys. Cohen and Kaplan [3] used similar methods as a building block for their bottom-k sketches. The bottom-k sketch is an effective construction that has been extensively used for various applications including approximations of aggregative queries over data streams. As Cohen and Kaplan [3] show, these methods are very effective in practical applications and are superior to the sketches that are based on sampling with replacement.

While effective in practice, the algorithms of Efraimidis and Spirakis and Cohen and Kaplan introduce a loss of accuracy, since their techniques require additional floating point arithmetic operations.

In this paper we show that the tradeoff between precision and performance is not a necessary property of sampling without replacement from data streams and construct a precise streaming reduction from k-sampling without replacement to k-sampling with replacement. This result provides a practical improvement to the algorithms of Efraimidis and Spirakis in cases where high accuracy is required.

Our method is yields a surprisingly simple algorithm, given the importance of sampling without replacement and the existence of many previous methods. We call this algorithm Cascade Sampling. In particular, when used with the algorithm from [2] Cascade Sampling requires O(k) memory, constant time per element and the same precision as in [2].

Let Λ be any algorithm that maintains a unit weighted sample from stream D. Similarly to the over-sampling method, we maintain instances of Λ. Namely, we maintain k instances Λ1,,Λk. However, we introduce the idea of stream modification. That is, instead of applying Λ independently and symmetrically on D, we apply Λi on the modified stream Di that does not contain samples of Λj for j<i. In particular, Λi may process its input elements in an order different from the order of their arrival in D. This simple but novel idea is sufficient to solve the problem. In particular, we can claim that the input of Λi is a random set that precisely matches the definition of weighted sampling without replacement. Since we use Λ as a black box with only a constant number of auxiliary variables, specifically pointers, the resulting schema is a precise reduction.

Section snippets

Definitions

An important building block of our algorithm is the concept of a unit sample, that is, the ability to sample a single element from a set.

Definition 1

Let S be a finite set of elements and let w be a non-negative function w:SR. A random element XS with values from S is a unit weighted random sample if, for any aS, P(XS=a)=w(a)w(S). Here w(S)=aSw(a).

For an algorithm instantiating weighted unit sampling we provide Black-Box WR2 from [2]. Black-Box WR2 is a unit sample when r=1 (Algorithm 1).

Definition 2

A data stream

Cascade sampling

Let S be a finite set such that |S|k and let aS. Denote T=S{a}, and let w:TR+ be a function. Let {X1,,Xk} be a k-sample without replacement from S with respect to w. Define an ordered sequence {Y1,,Yk}3 as follows:Y1={a,w.p. w(a)w(T);X1,otherwise. For i1 define4:Li={X1,,Xi,a}{Y1,,Yi}. We will show that |Li|=1; assuming that, let Zi be the single element from Li, i.e., Li={Zi}. Put Ui=T

Precise reduction and resulting algorithm

Let Λ be an algorithm that maintains a unit weighted sample from D. The algorithm from [2] is an example of Λ but our reduction works with any algorithm for unit weighted sampling. We construct an algorithm ϒ=ϒ(Λ) such that ϒ maintains a k-sample without replacement. Specifically, we maintain k instances of Λ: Λ1,,Λk such that the input of Λi is a random substream of D that is selected in a special way. We denote the input stream for Λi as Di. Let Xi be the sample produced by Λi. The critical

Acknowledgements

We thank our anonymous reviewers for their helpful suggestions, particularly for suggesting interesting open problems for discussion.

References (17)

  • Pavlos S. Efraimidis et al.

    Weighted random sampling with a reservoir

    Inf. Process. Lett.

    (March 2006)
  • Vladimir Braverman et al.

    Smooth histograms for sliding windows

  • Surajit Chaudhuri et al.

    On random sampling over joins

  • Edith Cohen et al.

    Tighter estimation using bottom k sketches

    Proc. VLDB Endow.

    (August 2008)
  • Pavlos S. Efraimidis

    Weighted random sampling over data streams

  • Gereon Frahling et al.

    Sampling in dynamic data streams and applications

  • Phillip B. Gibbons et al.

    New sampling-based summary statistics for improving approximate query answers

  • Theodore Johnson et al.

    Sampling algorithms in a stream operator

There are more references available in the full text version of this article.

Cited by (18)

  • Efficient Dynamic Weighted Set Sampling and Its Extension

    2023, Proceedings of the VLDB Endowment
  • Parallel Weighted Random Sampling

    2022, ACM Transactions on Mathematical Software
  • Stratified random sampling from streaming and stored data

    2021, Distributed and Parallel Databases
  • A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets

    2020, RecSys 2020 - 14th ACM Conference on Recommender Systems
  • Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams

    2020, Annual ACM Symposium on Parallelism in Algorithms and Architectures
View all citing articles on Scopus
1

This material is based upon work supported in part by the National Science Foundation under Grant No. 1447639, the Google Faculty Award and DARPA grant N660001-1-2-4014. Its contents are solely the responsibility of the authors and do not represent the official view of DARPA or the Department of Defense.

2

This material is based upon work supported in part by Raytheon BBN Technologies.

View full text