A randomized online quantile summary in O ( 1 ε log 1 ε ) words

A quantile summary is a data structure that approximates to ε -relative error the order statistics of a much larger underlying dataset. In this paper we develop a randomized online quantile summary for the cash register data input model and comparison data domain model that uses O ( 1 ε log 1 ε ) words of memory. This improves upon the previous best upper bound of O ( 1 ε log 3 / 2 1 ε ) by Agarwal et. al. (PODS 2012). Further, by a lower bound of Hung and Ting (FAW 2010) no deterministic summary for the comparison model can outperform our randomized summary in terms of space complexity. Lastly, our summary has the nice property that O ( 1 ε log 1 ε ) words suﬃce to ensure that the success probability is 1 − e − poly(1 /ε ) .


Introduction
A quantile summary S is a fundamental data structure that summarizes an underlying dataset X of size n, in space much less than n.Given a query φ, S returns a sample y of X such that the rank of y in X is (probably) approximately φn.Quantile summaries are used in sensor networks to aggregate data in an energy-efficient manner and in database query optimizers to generate query execution plans.
Quantile summaries have been developed for a variety of different models and metrics.The data input model we consider is the standard online cash register streaming model, in which a new item is added to the dataset at each new timestep, and the total number of items is not known until the end.The data domain model we consider is the comparison model, in which stream items come from an arbitrary ordered domain (and specifically, not necessarily from the integers).
Formally, our quantile summary problem is defined over a totally ordered domain D and by an error parameter ε ≤ 1/2.There is a dataset X that is initially empty.Time occurs in discrete steps.In timestep t, stream item x t arrives and is then processed, and then any quantile queries φ in that step are received and processed.To be definite, we pick the first timestep to be 1.We write X t or X(t) for the t-item prefix stream x 1 . . .x t of X.The goal is to maintain at all times t a summary S t of the dataset X t that, given any query φ in (0, 1], can return a sample y = y(φ) so that |R(y, X t ) − φt| ≤ εt, where R(a, Z) is the rank of item a in set Z, defined as |{z ∈ Z : a ≤ z}|.For randomized summaries, we only require that ∀t∀φ, P (|R(y, X t ) − φt| ≤ εt) ≥ 2/3; that is, y's rank is only probably close to φt, not definitely close.In fact, it will be easier to deal with the rank directly, so we define ρ = φt and use that in what follows.

Previous work
The two most directly relevant pieces of prior work are randomized online quantile summaries for the cash register/comparison model.Aside from oblivious sampling algorithms (which require storing Ω(1/ε 2 ) samples) we are unaware of any other randomized online quantile summaries that work in the comparison model.
The newer of the two is that of Agarwal, Cormode, Huang, Phillips, Wei, and Yi [1] [2].Among other results, Agarwal et.al. develop a randomized online quantile summary for the cash register/comparison model that uses O( 1 ε log 3/2 1 ε ) words of memory.This summary has the nice property that any two such summaries can be combined to form a summary of the combined underlying dataset without loss of accuracy or increase in size.
The earlier such summary is that of Manku, Rajagopalan, and Lindsay [6], which uses O( 1 ε log 2 1 ε ) space.At a high level, their algorithm downsamples the input stream in a non-uniform way and feeds the downsampled stream into a deterministic summary, which periodically adjusts the downsampling rate.
We note here that our algorithm is inspired by the algorithm of Manku et.al. but has important differences.We defer a discussion of the similarities and differences to section 4 after the presentation of our algorithm in section 3.
For the comparison model, the best deterministic online summary to date is the (GK) summary of Greenwald and Khanna [3], which uses O( 1 ε log εn) space.This improved upon a deterministic (MRL) summary of Manku, Rajagopalan, and Lindsay [5] and a summary implied by Munro and Paterson [7], which use O( 1 ε log 2 εn) space.A more restrictive domain model than the comparison model is the bounded universe model, in which elements are drawn from the integers {1, . . ., u}.For this model there is a deterministic online summary by Shrivastava, Buragohain, Agrawal, and Suri [8] that uses O( log u ε ) space.Not much exists in the way of lower bounds for this problem.There is a simple lower bound of Ω(1/ε) which intuitively comes from the fact that no one sample can satisfy more than 2εn different rank queries.For the comparison model, Hung and Ting [4] developed a deterministic Ω( 1 ε log 1 ε ) lower bound.Whether this bound can be extended to hold for our weaker probabilistic guarantee, and whether our algorithm can be modified to satisfy the stronger deterministic guarantee, are both open questions.

Our results
In the next section we describe a simple O( 1 ε log 1 ε ) streaming summary that is online except that it requires n to be given up front and that it is unable to process queries until it has seen a constant fraction of the input stream.In section 3 we develop this simple summary into a fully online summary that can answer queries at any point in time.We close in section 4 by examining the similarities and differences between our summary and previous work and discuss a design approach for similar streaming problems.

A simple streaming summary
Before we describe our algorithm we must first describe its two main components in a bit more detail than was used in the introduction.The two components are Bernoulli sampling and the GK summary [3].

Bernoulli sampling
Bernoulli sampling downsamples a stream X of size n to a sample stream S by choosing to include each next item into S with independent probability m/n.(As stated this requires knowing the size of X in advance.)At the end of processing X, the expected size of S is m, and the expected rank of any sample y in S is E(R(y, S)) = m n R(y, X).In fact, for any times t ≤ n and partial streams X t and S t , where S t is the sample stream of X t , we have To generate an estimate for R(y, X t ) from S t we use R(y, X t ) = n m R(y, S t ).The following theorem bounds the probability that S is very large or that R(y, X t ) is very far from R(y, X t ) (for any given time t ≥ n/64, but not for all times t = n/64 . . .n combined).The proof is folklore, a simple application of Chernoff bounds.
Theorem 2.1.For all times t ≥ n/64, P Further, for all times t ≥ n/64 and items y, Proof.For the first part, This means that, given any 1 ≤ ρ ≤ t, if we return the sample y ∈ S t with R(y, S t ) = ρm/n, then R(y, X t ) is likely to be close to ρ.

GK summary
The GK summary is a deterministic summary that can answer queries to relative error, over any portion of the received stream.If G t is the summary after inserting the first t items X t from stream X into G then, given any 1 ≤ ρ ≤ t, G t can return a sample y ∈ X t so that |R(y, X t ) − ρ| ≤ εt/8.Greenwald and Khanna guarantee in [3] that G t uses O( 1ε log(εt)) words.We call this the GK guarantee.

Our summary
We combine Bernoulli sampling with the GK summary by downsampling the input data stream X to a sample stream S and then feeding S into a GK summary G.It looks like this: The key reason this gives us a small summary is that we never need to store S; each time we sample an item into S we immediately feed it into G.Therefore, we only use as much space as G(S(X t )) uses.In particular, as long as m = O(poly(1/ε)), we use only O( 1ε log 1 ε ) words.To answer a query ρ for X t we ask G t the query ρm/n and return the resulting sample y.There is a slight issue in that ρm/n may be larger than |S|; but if the approximation guarantee holds for the largest item in X t then ρm/n < (t+εt/8)m/n, so using min{ρm/n, |S|} instead will not cause more than ε/8 relative error in the approximation.
The probability that our sample stream S t is not too big (uses more than 2tm/n samples) is at least 1 − exp(−m/192).If this happens to be the case then the probability that all of its samples y are good (have |R(y, S t ) − E(R(y, S t ))| ≤ εtm/8n) is at least 1 − 4m exp(−ε 2 m/12288) by theorem 2.1 and the union bound.Choosing m ≥ 300000 ln 1/ε ε 2 suffices to guarantee that both events occur with total probability at least 2/3.Further, if both S t events occur then the total error introduced by both S t and G t is at most εt/2.Suppose that G t returns y when given ρm/n.This means that |R(y, S t ) − ρm/n| ≤ ε|S t | ≤ ε(2tm/n)/8 by the GK guarantee.Since both events for S t occur, we also have |R(y, S t ) − m n R(y, X t )| ≤ εtm/4n (and only εtm/8n in the case that we don't truncate ρm/n to |S|).Thus, | m n R(y, X t )−ρm/n| ≤ εtm/2n.Equivalently, |R(y, X t ) − ρ| ≤ εt/2.

Caveats
There are two serious issues with this summary.The first is that it requires us to know the value of n in advance to perform the sampling.Also, as a byproduct of the sampling, we can only obtain approximation guarantees after we have seen at least 1/64 (or at least some constant fraction) of the items.This means that while the algorithm is sufficient for approximating order statistics over streams stored on disk, more is needed to get it to work for online streaming applications, in which (1) the stream size n is not known in advance, and (2) queries can be answered approximately at all times t ≤ n and not just when t ≥ n/64.
Adapting the idea of our basic streaming summary to work online constitutes the next section and the bulk of our contribution.We start with a high-level overview of our online summary algorithm.In section 3.1 we formally define an initial version of our algorithm whose expected size at any given time is O( 1 ε log 1 ε ) words.In section 3.2 we show that our algorithm gurantees that ∀n∀ρ, P (|R(y, X n ) − ρ| ≤ εn) ≥ 1 − exp(−1/ε).In section 3.3 we discuss the slight modifications necessary to get a deterministic O( 1 ε log 1 ε ) space complexity, and also perform a time complexity analysis.

An online summary
Our algorithm works in rows, which are illustrated in appendix A. Row r is a summary of the first 2 r 32m stream items.Since we don't know how many items will actually be in the stream, we can't start all of these rows running at the outset.Therefore, we start each row r ≥ 1 once we have seen 1/64 of its total items.However, since we can't save these items for every row we start, we need to construct an approximation of this fraction of the stream, which we do by using the summary of the previous row, and join this approximating stream with the new items that arrive while the row is live.We then wait until the row has seen a full half of its items before we permit it to start answering queries; this dilutes the influence of approximating the 1/64 of its input that we couldn't store.
Operation within a row is very much like the operation of our fixed-n streaming summary.We feed the joint approximate prefix + new item stream through a Bernoulli sampler to get a sample stream, which is then fed into a GK summary (which is stored).After row r has seen half of its items, its GK summary becomes the one used to answer quantile queries.When row r + 1 has seen 1/64 of its total items, row r generates an approximation of those items from its GK summary and feeds them as a stream into row r + 1.
Row 0 is slightly different in order to bootstrap the algorithm.There is no join step since there is no previous row to join.Also, row 0 is active from the start.Lastly, we get rid of the sampling step so that we can answer queries over timesteps 1 . . .m/2.
After the first 32m items, row 0 is no longer needed, so we can clean up the space used by its GK summary.Similarly, after the first 2 r 32m items, row r is no longer needed.The upshot of this is that we never need storage for more than six rows at a time.Since each GK summary uses O( 1 ε log 1 ε ) words, the six live GK summaries use only a constant factor more.
Our error analysis, on the other hand, will require us to look back as many as Ω(log 1/ε) rows to ensure our approximation guarantee.We stress that we will not need to actually store these Ω(log 1/ε) rows for our guarantee to hold; we will only need that they didn't have any bad events (as will be defined) when they were alive.

Algorithm description
Our algorithm works in rows.Each row r has its own copy G r of the GK algorithm that approximates its input to ε/8 relative error.For each row r we define several streams: A r is the prefix stream of row r, B r is its suffix stream, R r is its prefix stream replacement (generated by the previous row), J r is the joint stream R r followed by B r , S r is its sample stream, and Q r is a one-time stream generated from G r by querying it with ranks ρ 1 . . .ρ 8/ε , where ρ q = q(ε/8)(m/64).
The prefix stream A r = X(2 r−1 m) for row r ≥ 1, importantly, is not directly received by row r.Instead, at the end of timestep 2 r−1 m, row r −1 generates Q r−1 and duplicates each of those 8/ε items 2 r−1 εm/8 times to get the replacement prefix R r , which is then immediately fed into row r before timestep 2 r−1 m+1 begins.
Each row can be live or not and active or not.Row 0 is live in timesteps 1 . . .32m and row r ≥ 1 is live in timesteps 2 r−1 m+1 . . . 2 r 32m.Live rows require space; once a row is no longer live we can free up the space it used.Row 0 is active in timesteps 1 . . .32m and row r ≥ 1 is active in timesteps 2 r 16m+1 . . . 2 r 32m.This definition means that exactly one row r(t) is active in any given timestep t.Any queries that are asked in timestep t are answered by G r(t) .Given query ρ, we ask G r(t) for ρ/2 r(t) 32 and return the result.
At each timestep t, when item x t arrives, it is fed as the next item in the suffix stream B r for each live row r.B r joined with R r defines the joined input stream J r .For r ≥ 1, J r is downsampled to the sample stream S r by sampling each item independently with probability 1/2 r 32.For row 0, no downsampling is performed, so S 0 = J 0 .Lastly, S r is fed into G r .
Appendix A shows the operation of and the communication between the first six rows.Solid arrows indicate continuous streams and dashed arrows indicate one-time messages.Appendix B is a pseudocode listing of the algorithm.

Error analysis
Define C r = x(2 r 32m+1), x(2 r 32m+2), . . .and Y r to be R r followed by B r and then C r .That is, Y r is just the continuation of J r for the entire length of the input stream.
Fix some time t.All of our claims will be relative to time t; that is, if we write S r we mean S r (t).Our error analysis proceeds as follows.We start by proving that R(y, Y r ) is a good approximation of R(y, Y r−1 ) when certain conditions hold for S r−1 .By induction, this means that R(y, Y r ) is a good approximation of R(y, X = Y 0 ) when the conditions hold for all of S 0 . . .S r−1 , and actually it's enough for the conditions to hold for just S r−log 1/ε . . .S r−1 to get a good approximation.Having proven this claim, we then prove that the result y = y(ρ) of a query to our summary has R(y, X) close to ρ. Lastly, we show that m = O(poly(1/ε)) suffices to ensure that the conditions hold for S r−log 1/ε . . .S r−1 with very high probability (1 − e −1/ε ).Lemma 3.1.Let α r be the event that |S r | > 2m and let β r be the event that any of the first ≤ 2m samples For all r ≥ 1 such that t ≥ t r = 2 r−1 m, and for all items y, if S r−1 is good then Proof.At the end of time t r we have Y r (t r ) = R r (t r ), which is each item y(ρ q ) in Q r−1 duplicated εt r /8 times.If S r−1 (t r ) is good then by theorem 2.1 and the GK guarantee we have that |R(y Fix q so that y(ρ q ) ≤ y < y(ρ q+1 ), where y(ρ 0 ) and y(ρ 1+8/ε ) are defined to be min X t and sup D for completeness.Fixing q this way implies that R(y, Y r (t r )) = 2 r−1 32ρ q .By the above bound on R(y(ρ q ), Y r−1 (t r )) we also have that 2 Putting these two bounds together, and recalling that ρ q = qεm/512, we find that By applying this lemma inductively we can bound the difference between Y r and X = Y 0 : Corollary 3.2.For all r ≥ 1 such that t ≥ t r = 2 r−1 m, if all of S 0 (t 1 ), S 1 (t 2 ), . . ., S r−1 (t r ) are good, then |R(y, Y r ) − R(y, X)| ≤ 2 r+1 εm.
To ensure that all of these S i are good would require m to grow with n, which would be bad.Happily, it is enough to require only the last log 2 1/ε sample summaries to be good, since the other items we disregard constitute only a small fraction of the total stream.

Space and time complexity
A minor issue with the algorithm is that, as written in section 3.1, we do not actually have a bound on the worst-case space complexity of the algorithm; we only have a bound on the space needed at any given point in time.This issue is due to the fact that there are low probability events in which |S r | can get arbitrarily large and the fact that over n items there are a total of Ω(log n) sample streams.The space complexity of the algorithm is O(max |S r |), and to bound this value with constant probability using the Chernoff bound appears to require that max |S r | = Ω(log log n), which is too big.Fortunately, fixing this problem is simple.Instead of feeding every sample of S r into the GK summary G r , we only feed each next sample if G r has seen < 2m samples so far.That is, we deterministically restrict G r to receiving only 2m samples.Lemmas 3.1 through 3.4 condition on the goodness of the sample streams S r , which ensures that the G r receive at most 2m samples each, and the claim of lemma 3.5 is independent of the operation of G r .Therefore, by restricting each G r to receive at most 2m inputs we can ensure that the space complexity is deterministically O( 1 ε log 1 ε ) without breaking our error guarantees.From a practical perspective, the assumption in the streaming setting is that new items arrive over the input stream X at a high rate, so both the worst-case per-item processing time as well as the amortized time to process n items are important.For our per-item time complexity, the limiting factor is the duplication step that occurs at the end of each time t r = 2 r−1 m, which makes the worst-case per-item processing time as large as Ω(n).Instead, at time t r we could generate Q r−1 and store it in O(1/ε) words, and then on each arrival t = 2 r−1 m+1 . . . 2 r m we could insert both x t and also the next item in R r .By the time t r+1 = 2t r that we generate Q r , all items in R r will have been inserted into J r .Thus the worst-case per-item time complexity is O( 1 ε T max GK ), where T max GK is the worst-case per-item time to query or insert into one of our GK summaries.Over 2 r 32m items there are at most 2m insertions into any one GK summary, so the amortized time over n items in either case is O( m log n/32m n T GK ), where T GK is the amortized per-item time to query or insert into one of our GK summaries.
The pseudocode listing in appendix B includes the changes of this section.

Discussion
Our starting point is a very natural idea of Manku et.al. [6] that due to subtle technical difficulties saw no further application to the quantiles problem for sixteen years.This key idea is to downsample the input stream and feed the resulting sample stream into a deterministic summary data structure (compare our figure 1 with figure 1 on page 254 of [6]).At a very high level, we are simply replacing their deterministic O( 1 ε log 2 εn) MRL summary [5] with the deterministic O( 1 ε log εn) GK summary [3].However, as evidenced by the fact that fourteen years after the GK summary was published the state of the art was the randomized O( 1 ε log 3/2 1 ε ) summary of Agarwal et.al. [1] [2], adapting this idea to the GK summary without superconstant overhead is nontrivial.
For each time t after t r , the new item x t changes the rank of y in both streams Y r and Y r−1 by the same additive offset, so |R(y, Y r ) − R(y, Y r−1 )| = |R(y, Y r (t r )) − R(y, Y r−1 (t r ))| ≤ 2 r εm.