Weighted sampling without replacement from data streams
Introduction
Random sampling is a fundamental tool that has many applications in computer science (see e.g., Motwani and Raghavan [12], Knuth [9], Tille [15], and Olken [13]). Random sampling methods are widely used is data stream processing because of their simplicity and efficiency [14], [8], [7], [6], [10], [11]. In a stream, the size of the domain and the probability of sampling an element both change constantly; this makes the process of sampling non-trivial. We distinguish between sampling with replacement, where all samples are independent (and thus can be repeated), and sampling without replacement, where repetitions are prohibited.
In particular, weighted sampling without replacement has proven to be a very important tool. In weighted sampling, each element is given a weight, where the probability of an element being selected is based on its weight. In their work Efraimidis and Spirakis [5] presented an algorithm for weighted sampling without replacement. Cohen and Kaplan [3] use similar methods for their bottom-k sketches. While their preliminary implementation yielded promising results, Efraimidis and Spirakis [5] state, as the main open problem of the paper, “However, the question if, and to what extent, the finite precision arithmetic affects the algorithms remains an open problem.”
In this paper we continue this work and provide a new algorithm to avoid the issue of relying on finite precision arithmetic. With this result we show that precision loss is not required in order to sample without replacement. We accomplish this by providing a precise reduction from k-sampling without replacement to k-sampling with replacement, using a special case of k-sampling with replacement, unit sampling (where ). Additionally, we believe that in the future our method of expressing different random samples via reduction will provide a tool that allows further translation of other sampling methods into a more effective form for streams.
Due to its fundamental nature, the problem of random sampling has received considerable attention in the last few decades.
In 2005, Vitter [16] presented uniform sampling using a reservoir (with and without replacement) over streams. Further, the question of reductions between sampling methods has been addressed before. For instance, Chaudhuri, Motwani and Narasayya [2] briefly discuss reductions for various sampling methods. Cohen and Kaplan [3] use a “mimicking process” in their papers, which is essentially a reduction from sampling without replacement to sampling with replacement.
Chaudhuri, Motwani and Narasayya [2] use the well-known method of “over-sampling”, i.e. we sample the set independently until k distinct elements are obtained. Clearly, this schema does not introduce any precision loss, since unit sampling is used as a black-box.
Unfortunately, the amount of resources required to determine this information is a function of the weight distribution for the data set, and thus can be arbitrarily large.
In particular, consider the case when there is an element with weight that is overwhelmingly larger than the rest of the population. In this case, the number of repetitions found while sampling with replacement is significantly larger then k.
Probably the first effective non-streaming solution for the weighted sampling without replacement problem was the algorithm of Wong and Easton [17]. It is used by many other algorithms (see Olken [13] for the discussion). For data streams, Efraimidis and Spirakis [5] proposed an algorithm that is based on the “exponent method”. The algorithm requires precise computations of random keys , where . The sample generated is composed of the k elements with maximal keys. Cohen and Kaplan [3] used similar methods as a building block for their bottom-k sketches. The bottom-k sketch is an effective construction that has been extensively used for various applications including approximations of aggregative queries over data streams. As Cohen and Kaplan [3] show, these methods are very effective in practical applications and are superior to the sketches that are based on sampling with replacement.
While effective in practice, the algorithms of Efraimidis and Spirakis and Cohen and Kaplan introduce a loss of accuracy, since their techniques require additional floating point arithmetic operations.
In this paper we show that the tradeoff between precision and performance is not a necessary property of sampling without replacement from data streams and construct a precise streaming reduction from k-sampling without replacement to k-sampling with replacement. This result provides a practical improvement to the algorithms of Efraimidis and Spirakis in cases where high accuracy is required.
Our method is yields a surprisingly simple algorithm, given the importance of sampling without replacement and the existence of many previous methods. We call this algorithm Cascade Sampling. In particular, when used with the algorithm from [2] Cascade Sampling requires memory, constant time per element and the same precision as in [2].
Let Λ be any algorithm that maintains a unit weighted sample from stream D. Similarly to the over-sampling method, we maintain instances of Λ. Namely, we maintain k instances . However, we introduce the idea of stream modification. That is, instead of applying Λ independently and symmetrically on D, we apply on the modified stream that does not contain samples of for . In particular, may process its input elements in an order different from the order of their arrival in D. This simple but novel idea is sufficient to solve the problem. In particular, we can claim that the input of is a random set that precisely matches the definition of weighted sampling without replacement. Since we use Λ as a black box with only a constant number of auxiliary variables, specifically pointers, the resulting schema is a precise reduction.
Section snippets
Definitions
An important building block of our algorithm is the concept of a unit sample, that is, the ability to sample a single element from a set.
Definition 1 Let S be a finite set of elements and let w be a non-negative function . A random element with values from S is a unit weighted random sample if, for any , . Here .
For an algorithm instantiating weighted unit sampling we provide Black-Box WR2 from [2]. Black-Box WR2 is a unit sample when (Algorithm 1).
Definition 2 A data stream
Cascade sampling
Let S be a finite set such that and let . Denote , and let be a function. Let be a k-sample without replacement from S with respect to w. Define an ordered sequence 3 as follows: For define4: We will show that ; assuming that, let be the single element from , i.e., . Put
Precise reduction and resulting algorithm
Let Λ be an algorithm that maintains a unit weighted sample from D. The algorithm from [2] is an example of Λ but our reduction works with any algorithm for unit weighted sampling. We construct an algorithm such that ϒ maintains a k-sample without replacement. Specifically, we maintain k instances of Λ: such that the input of is a random substream of D that is selected in a special way. We denote the input stream for as . Let be the sample produced by . The critical
Acknowledgements
We thank our anonymous reviewers for their helpful suggestions, particularly for suggesting interesting open problems for discussion.
References (17)
- et al.
Weighted random sampling with a reservoir
Inf. Process. Lett.
(March 2006) - et al.
Smooth histograms for sliding windows
- et al.
On random sampling over joins
- et al.
Tighter estimation using bottom k sketches
Proc. VLDB Endow.
(August 2008) Weighted random sampling over data streams
- et al.
Sampling in dynamic data streams and applications
- et al.
New sampling-based summary statistics for improving approximate query answers
- et al.
Sampling algorithms in a stream operator
Cited by (18)
An active learning method combining deep neural network and weighted sampling for structural reliability analysis
2020, Mechanical Systems and Signal ProcessingEfficient Dynamic Weighted Set Sampling and Its Extension
2023, Proceedings of the VLDB EndowmentParallel Weighted Random Sampling
2022, ACM Transactions on Mathematical SoftwareStratified random sampling from streaming and stored data
2021, Distributed and Parallel DatabasesA Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets
2020, RecSys 2020 - 14th ACM Conference on Recommender SystemsCommunication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams
2020, Annual ACM Symposium on Parallelism in Algorithms and Architectures
- 1
This material is based upon work supported in part by the National Science Foundation under Grant No. 1447639, the Google Faculty Award and DARPA grant N660001-1-2-4014. Its contents are solely the responsibility of the authors and do not represent the official view of DARPA or the Department of Defense.
- 2
This material is based upon work supported in part by Raytheon BBN Technologies.