Circumventing superefficiency: An effective strategy for distributed computing in non-standard problems

We propose a strategy for computing estimators in some nonstandard M-estimation problems, where the data are distributed across different servers and the observations across servers, though independent, can come from heterogeneous sub-populations, thereby violating the identically distributed assumption. Our strategy fixes the super-efficiency phenomenon observed in prior work on distributed computing in (i) the isotonic regression framework, where averaging several isotonic estimates (each computed at a local server) on a central server produces super-efficient estimates that do not replicate the properties of the global isotonic estimator, i.e. the isotonic estimate that would be constructed by transferring all the data to a single server, and (ii) certain types of M-estimation problems involving optimization of discontinuous criterion functions where M-estimates converge at the cube-root rate. The new estimators proposed in this paper work by smoothing the data on each local server, communicating the smoothed summaries to the central server, and then solving a non-linear optimization problem at the central server. They are shown to replicate the asymptotic properties of the corresponding global estimators, and also overcome the super-efficiency phenomenon exhibited by existing estimators. AMS 2000 subject classifications: Primary 62G05, 62G20, 62G08; secondary62E20.


Background
Distributed computing has now become significant in the practice of statistics as well as other branches of data science. Large volumes of data, often relating to the same or closely related studies or experiments, are no longer stored on one single computer; rather, they are distributed across a number of platforms in some structured manner, owing partly to natural memory constraints on individual machines, and partly for convenience. This, typically, poses problems for computing optimal estimates of parameters of interest from the data at hand. Conventional statistical estimates are generally obtained under the premise that the totality of the data is accessible to a single computing device and can be processed at one stroke, yielding estimates that are optimal in some quantitatively defined sense. However, this is not automatically the case in a distributed environment. The calculation of global estimates that require simultaneous processing of all available data then entails transferring the entire bulk of data from different computers to a central machine, which in itself can be both time and resource consuming, followed by a potentially complex computation on the aggregated data (of massive volume), which may be infeasible under many circumstances.
Divide and conquer algorithms are a standard approach to addressing these issues in a distributed computing environment. The idea behind this is as follows: suppose the entire data set is stored across a number of machines. On each machine, calculate a natural estimate of the parameter of interest from the data on it and transfer this estimate to a central machine. Next, combine the estimators thus obtained, at the central machine in a judicious way to produce a final estimate, the so-called pooled estimate, which replicates the properties of the natural global estimate, i.e. the one we could have computed were it feasible to store and analyze all available data on one machine. The term 'replicates the properties' can be understood in various ways and is often specific to the problem at hand: one might be able to show that the pooled and the global estimates have the same rate of convergence, or are comparable, up to constants, in terms of a certain measure of risk, or it might even be possible to demonstrate that the pooled estimate and the global estimate have the same limit distributions under appropriate conditions. The other important factor is computational burden: one would expect that the divide and conquer algorithm is not substantially more computationally onerous than the global estimator. As the literature on distributed computing is enormous, here we provide a selection of instances of research on distributed computing problems in a variety of statistical/machinelearning contexts: see, e.g. [10], [12], [26], [27], [6], [19], [24]. The above papers illustrate that the sample splitting approach buys computational dividends, yet statistical optimality [in the sense that the resulting estimator is as efficient (or minimax rate optimal) as the global estimate based on applying the estimation algorithm to the entire data set] is retained.
In the nonparametric function estimation context, most results of the divide and conquer (DC) type focus on estimation under smoothness constraints, where the essence of the strategy is to compute a smoothed estimator of the unknown function at each server and combine the estimators at the central server, by averaging; this strategy is employed, for example, in [12], [26], [27]. However, the averaging strategy leads to highly problematic pooled estimators in non-regular function estimation problems, e.g. function estimation under a monotonicity constraint, where the least squares estimates under the monotonicity constraint are non-standard/non-regular in the sense that they are non-linear in the data, and have non-Gaussian limit distributions. This is the core content of the recent work by [3] [henceforth BDS] where it is demonstrated that in monotone function estimation, the 'pooled-by-averaging' estimator [henceforth, generally referred to as BDSE] becomes super-efficient: its ARE (asymptotic relative efficiency in terms of MSE) with respect to the global monotone least squares estimator computed at any single model goes to infinity, whereas, in the uniform sense, the ARE goes to 0, i.e. the maximal MSE of BDSE over a collection of models relative to that of the global least squares estimator goes to ∞. Furthermore, BDSE has a normal limit distribution different from that of the global estimator which converges to a Chernoff limit, discussed in details below. In related work, [20] study M-estimation in non-standard cube-root problems of the type considered originally by [11] and show that the pooled-by-averaging estimator in a distributed computing framework has a different (in fact, normal) asymptotic distribution, as compared to the global M-estimator which converges to the unique maximizer of an appropriate mean 0 Gaussian process minus a quadratic drift.
Our goal in this paper is to propose new estimators under the DC framework in both the monotone function estimation problem as well as in certain versions of the M-estimation setting of [20] which do not suffer from the super-efficiency problem of the pooled-by-averaging estimators and which also recover the limiting properties of the corresponding global estimators. To this end, we first provide some details of the problem considered in BDS and the results obtained therein, as well as those dealt with in [20], as they are crucial to understanding the goal and the approach of the current work.
Consider a sample of size N (very large) from the model Y i = μ(X i ) + i which is distributed across m different servers, each server containing a subsample of size n, and m = o(N ). The function μ is known to be monotone and the X i come from a density on [0, 1]. The residual i is assumed to satisfy E( i |X i ) = 0. Computing the global isotonic estimate at a point t 0 ∈ (0, 1) involves moving all the data to a central server and performing the isotonization on all N data-points on the central server. This can be time-consuming when N is really large. The construction of BDSE involves computing the isotonic estimate of μ, say μ j , on the j'th server, and then obtaining the average of these isotonic estimates. Hence, the pooled estimate at the point BDS showed that their pooled-by-averaging estimator (BDSE) of the inverse function has dichotomous behavior. We briefly revisit this important result. For convenience and the sake of completeness, we state these results essentially in their entirety.

Consider a nonincreasing and continuously differentiable function
For an x 0 ∈ (0, 1), define a neighborhood M 0 of μ 0 as the class of all continuous nonincreasing functions μ on [0, 1] that are continuously differentiable on [0, 1], coincide with μ 0 outside of ( as N → ∞, where G = dκ Z, with Z following the Chernoff distribution, and κ > 0 being a constant. Writing N = m × n, where m and n are as defined above, as N → ∞, the BDSE of μ −1 (a), say θ m satisfies: where H has the same variance as G but is distributed differently from G. Furthermore, as N → ∞. Hence, BDSE outperforms the global inverse isotonic regression estimator in terms of point wise MSE. This phenomenon is reversed when one looks at the maximal MSEs of the two estimators over the class of models defined by M 0 , as described in Theorem 5.1 of BDS.
where the subscript m indicates that the maximal risk of the m-fold pooled estimator (m fixed) is being considered. Then E < ∞ while E m ≥ m 2/3 c 0 , for some c 0 > 0. When m = m n diverges to infinity, Therefore, from Theorem 1.1 it follows that the asymptotic maximal risk of BDSE diverges to ∞ at rate (at least) m 2/3 . Thus, the better off we are with BDSE for a fixed function, the worse off we are in the uniform sense over the class of functions M 0 . Hence, unfortunately, while maintaining a computational burden that is better than the global estimator, BDSE has undesirable statistical properties as seen above.
As discussed above, similar phenomena arise in the cube-root M-estimation problems of the Kim and Pollard type [11], studied in [20]. In [11] the Mestimator is defined as the location of the maximum of an empirical process Distributed computing in non-standard problems 1931 where θ ∈ Θ ⊂ R d , X 1 , . . . , X N are i.i.d. random variables and g(·, θ) is the criterion function that is optimized. The maximizerθ, say, is estimating θ 0 , the unique maximizer of P g(X 1 , θ) where P is the common distribution of the X i 's. The cube-root convergence rate is an outcome of the so-called 'sharp-edge effect' -the fact that the g's are discontinuous in θ, coupled with a quadratic decline in θ − θ 0 2 of P g(X 1 , θ) around θ 0 . In the distributed computing setting of [20], the N observations are stored across m servers with the j'th server containing n j observations, and the M-estimatorθ (j) on the j'th server maximizes are the data on the j'th server. The pooled estimatorθ 0 := k , reduces to the simple average when all subsamples have the same size. Theorem 2.1 of [20] provides the asymptotic distribution of the pooled estimator: it is seen that the estimator converges at rate m −1/2 n −1/3 to a normal distribution, faster than the N −1/3 rate of the global estimator. This parallels the results established in Sections 3 and 4 of BDS in the isotonic regression context.
As in BDS, [20] also encounter the super-efficiency phenomenon, which is discussed in Section B of their supplement for the location estimator (Section B.1) and the value-search estimator (Section B.2), two of the examples treated in their paper. They demonstrate in both problems that the maximal MSE of the pooled-by-averaging estimator over a collection of models in a neighborhood of a fixed model diverges to ∞ with N , while the maximal MSE of the global estimator remains bounded.
In both BDS and [20], super-efficiency results from computing the nonstandard estimator at each local machine and then averaging these estimators at the central server. To avoid this undesirable phenomenon, the key idea is to reverse these steps, i.e., first average the data on each local server in an appropriate manner (which will typically depend on the structure and the dimension of the problem) to obtain essentially sufficient summary statistics which are then transferred to the central server. The summary statistics are now used to compute a non-standard 'pooled estimator' (via an adaptation of the Mestimation procedure used to solve for the global estimator) that replicates the properties of the global estimator and manages to avoid super-efficiency. The term essentially sufficient is used in the sense that these summary statistics are enough to compute an estimator that matches the performance of the global estimator. We will illustrate the idea in details for the isotonic regression problem studied in BDS and the location search problem considered by [20], but the prescription itself can be expected to work in a broader class of problems, subject to appropriate fine-tuning. Furthermore, for our analysis, we address a broader scenario beyond i.i.d. data. Since we are thinking of large N problems, with the data being stored separately in different servers, it is natural to allow heterogeneity across servers. Thus, while our N observations will be assumed to be independent, we will no longer consider them to be identically distributed ; rather, they will be assumed to come from a number of different (m) sub-populations with the data within each sub-population being i.i.d. The different sub-populations will be linked by a common parameter of interest. Furthermore, the N pairs will be scrambled across a number of different servers (say L), with the same server hosting data from different subpopulations, as well as data from the same sub-population potentially stored on multiple servers 1 .

The new estimator for the regression function
Assume that we have m samples of respective sizes n 1 , . . . , n m and that for all j = 1, . . . , m, the j-th sample is composed of i.i.d. pairs of real valued random variables (X ji , Y ji ), i = 1, . . . , n j , such that E(Y ji |X ji ) = μ(X ji ) for all i, j and an unknown regression function μ defined on [0, 1]. We denote by F Xj the common distribution function of the covariates X ji , i = 1, . . . , n j in the jth sample. The data are stored on several servers numbered 1, . . . , L for some integer L ≥ 1. The allocation of data on the different servers is arbitrary in the sense that a sample can be spread on several servers, a server can host data from several different samples, and the number of stored observations can vary across the different servers. The number L of different servers can even grow as N → ∞. The total sample size is N = m j=1 n j . For ease of exposition, when considering simultaneously all the samples, we relabel the observations from the m samples to obtain independent pairs (X i , Y i ), i = 1, . . . , N such that E(Y i |X i ) = μ(X i ), where the distribution function of X i is one of F X1 , . . . , F Xm . Let K be a positive integer that grows to infinity as N → ∞, and for all k ∈ {1, . . . , K}, let I k = ((k − 1)/K, k/K]. Let S denote the set of indices i, such that (X i , Y i ) is stored in the 'th server. Now, for each server (1 ≤ ≤ L) record for k ∈ {1, . . . , K}. Next, for each , transfer {(T k , C k )} K k=1 to a central server. Compute a regressogram estimate on the central server in the following manner: for each k ∈ {1, . . . , K}, Our final estimator of (μ( The estimator of the regression fuction μ is obtained by piecewise-constant interpolation.
Note that the estimator does not depend on the way the observations were stored across different servers.

Computational considerations
Consider the computational burden for the new estimator. Assume, for now, that K ∼ N ζ for some 0 ≤ ζ < 1. First, focus on the computational time it takes for calculating (T k , C k ) for all and 1 ≤ k ≤ K. For each X i , one has to determine in which interval I k it falls, and then assign the pair (X i , Y i ) to the interval I k . This can be accomplished in O(log N ζ ) = O(log N ) time. Since there are N such points (scrambled across the different servers), the total time taken is O(N log N ). Next, computing (C k , T k ) for a fixed involves less than 2 n k additions, where n k is the number of (X i , Y i ) pairs assigned to I k on server .
provided L (which could grow with N ) and ζ are not too large. In addition to the total computing time, the burden also involves transferring about 2LK ∼ LN ζ numbers between machines, which is larger than the amount of data transferred in the construction of BDSE. As shall be seen below, with K slightly larger than N 1/3 -say K ∼ N 1/3 +η1 (η 1 small) -and m of a smaller order than N 1/3 , the new estimator is able to recover the properties of the global estimator: hence, so long as the number of machines is not too large -say L = N 1/3 −η2 -the total amount of data required to be transferred is of order N 2/3 +η1−η2 = o(N 2/3 ) when η 2 > η 1 .
Note that the computation of the global isotonic estimator in this situation would require transferring all data points to the central server which is exactly O(N ) and the isotonic algorithm at the central server would take O(N log N ) time. Note also that the minimum amount of data transferring needed for the new estimator above is of order K (this happens when the number of servers L is held fixed) and therefore of larger order than N 1/3 . On the other hand, in the scenario of BDS, where L = m, the BDSE is constructed using m sub-samples where m is of order at most N 1/4 : this corresponds to a data-transfer of order at most N 1/4 numbers to construct the super-efficient estimator at any given point. The additional amount of data that needs to be transferred to construct the new estimator can be viewed as the cost of alleviating the super-efficiency phenomenon exhibited by BDSE.

Characterization of the new estimators
It is a standard result in isotonic regression that the minimum in (2.1) is achieved at a unique vector ( y 1 , . . . , y K ) T . We give below a characterization of the minimizer. In the sequel, we consider the piecewise-constant left-continuous estimator μ N that is constant on the intervals [0, x 1 ], and (x k−1 , x k ] for all k = 2, . . . , K, and such that μ N (x k ) = y k for all k = 1, . . . , K. Let F N be the empirical distribution function corresponding to X 1 , . . . , X N and let Λ N be the piecewise-constant right-continuous process on [0, 1] that is constant on the intervals [0, x 1 ), and [x k−1 , x k ) or all k = 2, . . . , K such that for all j = 1, . . . , K, and Λ N (0) = 0. Then, and μ N is the left-hand slope of the least concave majorant of the cumulative sum diagram defined by the set of points {(F N (x k ), Λ N (x k )), k = 0, . . . , K} where x 0 = 0. We define the corresponding inverse estimator as follows: where x 0 = 0, argmax denotes the greatest location of the maximum, and where we recall that for every nonincreasing left-continuous function h : [0, 1] → R, the generalized inverse of h is defined as: for every a ∈ R, h −1 (a) is the greatest t ∈ [0, 1] that satisfies h(t) ≥ a, with the convention that the supremum of an empty set is zero. To see that U N = μ −1 N , note that from the characterization above of μ N as the slope of a least concave majorant, it follows that for all a ∈ R and t ∈ (0, 1], we have the equivalences whereas for t = 0, we have the equivalence We study below the asymptotic properties of U N (a) for arbitrary a and use these to deduce the asymptotic properties of μ N (t) for a fixed t ∈ (0, 1) using the switch relation that holds for all t ∈ (0, 1] and a ∈ R. It will be useful to also record similar characterizations of the global estimator μ N,G of μ, for the sake of completeness. Recall that the global estimator is the isotonic estimator that we would compute if all the data {X i , Y i } N i=1 could have been brought over (or were already there) on a central server. Letting Λ N, Then U N,G (a) =μ −1 N,G (a) and similar to the pooled estimator, we have the following characterization: that holds for all t ∈ (0, 1] and a ∈ R.

Notation and assumptions
In the sequel, we denote by g the generalized inverse of μ and by E X the conditional expectation given X 1 , . . . , X N . Being the inverse of μ, g is only defined on the interval [μ(1), μ(0)]. In the sequel, we expand g to the whole real line by setting g(a) = 0 for all a > μ(0) and g(a) = 1 for all a < μ (1). Furthermore, for all x ≥ 0, [x] denotes the integer part of x. We denote by F X the mixing distribution function Note that the function depends on N but for notational convenience, this is not made explicit in the notation.
To develop the asymptotic properties of the proposed estimator, we will impose some further conditions on the model. These are: A1. Assume that F X has a density function f X on [0, 1] that satisfies for some positive numbers C 1 and C 2 that do not depend on N .
for all i, with probablity one. A3. The regression function μ satisfies: Remarks on the assumptions: Assumption (A1) is fulfilled for instance if each F Xj , j = 1, . . . , m has a density function f Xj such that 1]. Note that Assumption (A3) is weaker than differentiability, it implies that μ is both Lipschitz and so to speak inverse Lipschitz. It also implies that the inverse function g defined above is continuous. Assumption (A4) is critical to recovering the Chernoff-type asymptotics for the pooled estimator; that K grows faster than N 1/3 ensures that the data are averaged over bins of length smaller than N −1/3 , so that the isotonic algorithm operating on these averages at the central machine can still recover the N −1/3 convergence rate. If K were to grow exactly at the rate N 1/3 or slower, the pooled estimator would no longer demonstrate Chernoff-type cube-root asymptotics. Furthermore, in (A4), we assume that the proportion n j /N of observations from the j-th sample is at least of order N −1/3 (log N ) 3 . This also plays a critical role in the subsequent analysis. Since, the conditions in (3.4) imply that the number m of different sub-samples cannot grow to fast: we must have m

Uniformly bounded MSE property of the new estimators
The Inverse Problem: We first demonstrate that the new estimator in the inverse problem exhibits uniformly bounded maximal risk (MSE) over an appropriate class of models, as N grows to ∞. This is an analogue of the first result in Theorem 4.1 of BDS for the global isotonic estimator of the inverse function, though it is established here under weaker conditions. For this task, we denote by F 1 the class of non-increasing functions μ on [0, 1] that satisfy (3.3) and sup t |μ(t)| ≤ C 5 , where C 5 > 0 is a positive number. The proof of the following theorem is given in Section 6.
The Direct Problem: An analogue of the second result in Theorem 4.1 of BDS that demonstrates that the new estimator fixes the super-effciency phenomenon in the direct problem as well, i.e.μ N has bounded uniform MSE as N → ∞ over the class F 1 , can also be established. As it involves some additional technical fine-tuning we relegate its proof to the Appendix.
We show next that under a fixed μ, the new estimator recovers the asymptotic distribution of the global estimator with the same convergence rate.

Asymptotic distributions
To establish asymptotic distributions for our new estimators, we make additional assumptions in the case that the number m of different samples goes to infinity, and we clarify the asymptotic setting further.
When considering the case where m is allowed to grow to infinity as N → ∞, we assume that there is a sequence of unknown distinct distributions {P j } j≥1 such that our set of observations is part of an infinite sequence of pairs To fix ideas, possibly rearranging the probabilities in the sequence {P j } j≥1 , we assume without loss of generality in the sequel that for all N , the m = m N distributions that appear across the first N observations are P 1 , . . . , P m . Note that the setting does not exclude that m N = 1 for all N , i.e. that all observations are drawn from the same distribution P 1 . In the case where m N > 1 for sufficiently large N , it is not excluded that m N remains bounded. In the sequel, for all j ≥ 1, we denote by σ j the function such that for all u ∈ [0, 1] and by f j the density function of X, which is assumed to exist, where (X, Y ) has distribution P j . Then, the distribution function F X in (3.1) has a density function f X on [0, 1] given by for all u ∈ [0, 1]. We next make the following technical assumptions.
The density function f X converges pointwise [and hence, uniformly] on [0, 1] as N → ∞ to a continuous function f ∞ that is bounded away from zero. This implies that (3.2) holds for some positive numbers C 1 , C 2 that do not depend on N , provided that N is sufficiently large. A 3 . The function σ 2 X defined by for all u ∈ [0, 1] converges pointwise [and hence, uniformly] to a continuous function σ 2 ∞ , bounded away from 0, as N → ∞.
The function μ is decreasing and has a continuous first derivative For notational convenience, we do not make it explicit in the notation that F X , f X , σ X , m may depend on N .

Remark:
The pointwise convergence of f X to f ∞ implies uniform convergence because by assumptionÃ 1 , the class of functions {f j } is uniformly equicontinuous, which then implies that the class {f X } is also uniformly equicontinuous.
Also, the pointwise convergence of σ 2 X to σ 2 ∞ guarantees uniform convergence, because the class of functions {σ 2 X } is uniformly equicontinuous: this follows from the uniform boundedness of the class {f j } assumed inÃ 0 , the uniform boundedness of {σ 2 j }, which is a consequence ofÃ 4 , and the uniform equicontinuity of the classes {f j } and {σ 2 j } assumed inÃ 1 .
under AssumptionsÃ 1 throughÃ 4 and A4, we have where Z := argmax u∈R {W (u) − u 2 }, W being a standard two-sided Brownian motion starting at 0, has the so-called Chernoff 's distribution.
An interesting feature of the estimator U N is that its asymptotic behavior does not depend on the way the N data are allocated on the different servers. The direct estimatorμ N shares this feature, as is shown in the next result.

Theorem 3.4. Under the same assumptions as in Theorem
where Z is as defined in Theorem 3.3.

Remark:
The estimators μ N (t) and U N (a) have the same asymptotic distributions (when centered around their respective estimands and scaled by the factor N 1/3 ) as the corresponding global isotonic estimators,μ N,G and U N,G defined in (2.6) and (2.5) respectively. In other words, the asymptotic distributions of the estimators N 1/3 (U N,G − g(a)) and N 1/3 (μ N,G (t) − μ(t)) are those arising in Theorems 3.3 and 3.4 respectively. The limit distributions of the global estimators can be established by the same set of techniques as used in the proofs of Theorems 3.3 and 3.4. Thus, the new estimators proposed in this paper not only circumvent the super-efficiency phenomenon but recover the asymptotic properties of their corresponding global versions. We note that the global isotonic estimators μ N,G (t) and U N,G (a) also possess the uniformly bounded maximal MSE property for their respective estimands, i.e. exact analogues of the results in Theorems 3.2 and 3.1 hold for respectively, and can be established by similar techniques as used in the proofs of these two theorems.

Remark:
The setting of the theorems in this section with a growing sequence of sub-populations such that conditionsÃ 1 throughÃ 5 hold is not difficult to satisfy. Consider, for example, m = N 1/4 and P j has density where f 0 and f 1 are Lipschitz continuous densities bounded away from 0 and ∞ on [0, 1], 0 < j < 1 for all j, the sequence { j } is decreasing to 0 and m j=1 j = o(m), which is easy to arrange. Let the distribution of the X i 's be P 1 and independent of the X i 's, which are also mutually independent, and μ satisfies all the desired conditions in this manuscript, in particularÃ 5 . Then, it is easy to check that all the five conditions at the beginning of this section hold, with f ∞ = f 0 and The proof of Theorem 3.3 is in the Appendix. The proof of Theorem 3.4 follows.
Proof of Theorem 3.4. It follows from the switch relation (2.4) that for all fixed using that the Chernoff distribution Z is continuous (see e.g. [9]).

The location parameter problem
The location parameter problem is one of the examples studied by [20] (Section 3.1) and falls in the genre of general cube-root M-estimation problems introduced by [11]. As discussed in Section 1, in the framework of [11], a generic estimator maximizes an empirical process P N (g, θ) = 1 . . , L} 2 , transfer each summary to the central server, and sum up to obtain i≤N g(ξ i , x k ) for each k. The final estimator is computed as the argmax of i≤N g(ξ i , θ) over θ ∈ {x 1 , . . . , x K }. For this 'pooled estimator' to recover the properties of the global estimator, we would expect that N 1/3 = o(K), as in the isotonic regression problem. We present the detailed analysis below for location estimation.

The set-up, the estimator and assumptions
Assume that we have m samples of respective sizes n 1 , . . . , n m and that for all j = 1, . . . , m, the j-th sample is composed of i.i.d. random variables X ji , i = 1, . . . , n j , with common distribution P j , such that for a fixed bandwidth r 0 , there exists a unique θ 0 such that both θ 0 − r 0 and θ 0 + r 0 are in (0, 1) and One special case of the above is the situation that each P j is unimodal with a common mode (across j). We denote by F Xj the common distribution function of the variables X ji , i = 1, . . . , n j in the j-th sample. The data are stored on several servers numbered 1, . . . , L for some integer L ≥ 1, and we allow the possibility that L grows with N ≡ m j=1 n j , the total sample size. The allocation of data on the different servers is arbitrary in the sense that a sample can be spread on several servers, a server can host data from several different samples, and the number of stored observations can vary across the different servers.
For ease of exposition, when considering simultaneously all the samples, we relabel the observations from the m samples to obtain independent variables X i , i = 1, . . . , N whose the distribution function is one of F X1 , . . . , F Xm . Let K be a positive integer that grows to infinity as N → ∞, and for all k ∈ {1, . . . , K}, to a central server. Compute an empirical distribution function on the central server in the following manner: Λ N is the piecewise-constant right-continuous process on [0, 1] that is constant on the intervals [ 1 Xi≤xj for all j = 1, . . . , K, and Λ N (0) = 0. We define the estimator θ N of θ 0 as follows: where r 0K = [Kr 0 ]/K, argmax denotes the greatest location of the maximum and where the maximum is taken over In the sequel, we use the same notation F X as in (3.1) and we make the following assumption: S. The number of bins K satisfies K −1 = o(N −1/3 ) and there exists λ ∈ (0, 1] that may depend on N and satisfies

Theoretical properties of the pooled estimator
We demonstrate below that the new estimator in the location parameter problem exhibits uniformly bounded maximal risk (MSE) over an appropriate class of models, as N grows to ∞. An asymptotic distributional result under some further assumptions along the lines of the results in Section 3.3 can also be established, but is skipped. For fixed positive numbers C 1 , C 2 , C 3 , a, , δ that do not depend on N , we denote by F 1 the class of functions f X on [0, 1] that satisfy and for the primitive F X of f X we have (4.5) It follows from the definition of θ 0 and F X that θ 0 is the unique location of the maximum Hence, the supremum in (4.5) is stricty negative for all a > 0, and we consider a class F 1 of functions where it is uniformly negative. The first condition on (4.3) is satisfied with some (small) for instance if f X is increasing on [0, θ 0 ] and decreasing on [θ 0 , 1], whereas (4.4) holds if f X is continuous in neighborhoods of both θ 0 − r 0 and θ 0 + r 0 , and a is chosen sufficiently small.
We denote by E f the underlying expectation when the distributions of the observations are such that the true density Distributed computing in non-standard problems 1943

Discussion
We have proposed new estimators for distributed computing in the isotonic regression problem and a prototypical cube-root estimation problem of the genre considered in [11] that preserve the convergence rates of the corresponding global estimators and do not suffer from the super-efficiency phenomenon. The key change from the BDS procedure or the procedure in [20] lies in smoothing the data on local servers followed by solving a non-linear optimization problem on the central server. In the isotonic setting, this is referred to as an 'SI' (smoothing-isotonization) procedure. We note here that such 'SI' procedures and their converse ('IS') procedures have been studied in monotone function problems, though not in distributed computing environments and not under the heterogeneity setting of our paper. See, for example, [16], [14], [22], [1] and [8].
An interesting topic for future work is to understand distributed computing and inference for non-standard problems in higher dimensions, e.g. the maximum score estimator treated in both [11] and [20] where the parameter is a p-dimensional vector (p > 1). For example, the partitioning into bins strategy used above that works well for 1-dimensional problems has a downside in larger dimensions, since the bins become hyper-cubes whose number increases quickly with p. This entails increased levels of communication among the different machines that an effective distributed computing strategy would seek to avoid.
Reverting to the isotonic regression problem, the ideas in this paper also have connections to other work in the monotone function literature that are worth mentioning. [25] study isotonic estimation of a decreasing density with histogram-type data based on i.i.d. data under a once differentiable assumption on the density. The domain of the density is split into bins, and the counts in each bin are available. When the number of bins grows at a rate faster than n 1/3 , Theorem 4.6 of this paper shows that the isotonic estimate based on binned data recovers the Chernoff-type asymptotic distribution of the classical Grenander estimator. A similar phenomenon transpires in our problem. The (C k , T k ) pair records the number of observations in the bin I k and the sum of the responses in that bin respectively, for the 'th server. Once these are transferred to the central server, we sum across to find the total number of observations in I k and the sum of the responses corresponding to all those observations and construct our isotonic estimator using these statistics. In our problem, K grows faster than N 1/3 and we obtain a Chernoff limit for the pooled estimator.
This naturally raises the question as to how the number of bins K for the smoothing step on the local servers would influence the distribution of the estimators developed in this paper. When N 1/3 = o(K), the grid is sufficiently dense and the corresponding bins sufficiently small, so that our isotonized regressogram estimator recovers the asymptotics of the classical, i.e. global isotonic regression estimator, but this will no longer be the case when K ∼ N 1/3 or K = o(N 1/3 ). When K ∼ N 1/3 , the results of [25] (Theorem 3.3 and Corollary 4.4) and [21] (Theorem 3.7) who study monotone function estimation with covariates supported on a grid indicate that the limit distribution of the isotonized regressogram estimator at a point will neither be normal, nor will it be given by Chernoff's distribution. When K = o(N 1/3 ), the grid is sparse enough and therefore, the regressogram estimates are ordered with probability increasing to one, so that the isotonized regressogram estimator agrees with the original estimator with increasing probability, and the results in [25] (Theorem 4.1) and [21] (Theorem 3.1) suggest an asymptotic normal distribution for our proposed estimator. We do not go into a full investigation of the details of these asymptotics in the distributed setting, since this is not relevant to the goal of the current work: produce a pooled estimator whose properties mimic the global estimator.
Some limited simulation results illustrating the role of K are presented in the Appendix. As noted above, for the pooled estimator to recover the properties of the global isotonic estimator, we need K to be of larger order than N 1/3 , but to keep data-transfer costs low we would also like K to be not much larger. Since issues with distributed computing are only important for substantially large N , we investigated how our proposed estimator behaves in terms of K when N is in the order of millions or larger. It turns out that even a logarithmic adjustment, i.e. K ∼ N 1/3 log N performs very well: the resolution of the bins is good enough that the pooled estimator replicates the behavior of the global estimator to a high level of precision. In sum, the choice of K does not appear to be a critical issue in a really 'big data' setting. This is fortunate, as a heavyduty tuning algorithm to determine K would enhance computational costs which one is trying to avoid in the first place. We also noted that changing to K ∼ N 1/3 induces significant bias in our estimators (which is compatible with our observations in the previous paragraph).
As far as inference on the parameters of interest is concerned, the limit distributions, especially in the heterogeneous data setting contain several nuisance parameters which need to be estimated. Specifically the estimation of μ in the isotonic regression problem is known to be difficult. One possibility in the isotonic regression problem is to use the likelihood ratio test for testing H 0 : μ(t 0 ) = θ 0 using the data at the central server, along the lines of the ideas developed in [5] and [4]. We believe that at least in the homogeneous setting, i.e. when the data across the different servers are i.i.d., this likelihood ratio statistic will be asymptotically pivotal. It is possible that pivotality also holds under the general heterogeneous framework of this paper, but this would require further investigation. A comprehensive treatment of effective inference strategies would be an exciting topic for future research. We note that likelihood ratio statistics in heterogeneous massive data settings, albeit in a different genre of problems have been studied elsewhere in the literature, see e.g. [13].
We believe that similar estimators can be proposed for distributed convex regression. For convex regression, a BDS type estimator is expected to fail completely, since the global convex least squares estimator is itself asymptotically biased, as suggested by the extensive simulation experiments in [2]. However, a convexified regressogram estimator in the spirit of the one considered in this paper, ought to be able to recover the properties of the global convex LS estimator provided K is selected appropriately: we conjecture that in the convex case K should be taken to be N 1/5 = o(K). This could provide a possible avenue for future research.

Proof of Theorem 3.1
The proof of the above theorem relies on a number of preliminary results which are presented, next. In the remainder of this section, we assume that assumptions (A1) to (A4) are always satisfied (though some results may require only a subset of these assumptions). Additional assumptions will be imposed when required.
The proof of this lemma is long and technical and is available in the Appendix. The next result gives a polynomial tail bound on the estimation error U N (a) − g(a) over a high-probability set that is eventually used to bound the MSE.
Since x ≥ N −1/3 , combining this with Lemma 6.1 shows that there exists c > 0 that depends only on C 1 , C 3 such that with X = {x 0 , . . . , x K }, for N ≥ N 0 . The above probablity is less than or equal to Let P X denote the conditional probability given X 1 , . . . , X N . By definition, for all u ∈ {x 0 , . . . , x K } we have The process M n can be extended to all u ∈ R using the same definition as above. Then, M N is a centered martingale under P X that satisfies for all u ≤ v, using that E X (ε 2 i ) ≤ σ 2 for all i by assumption. Hence, it follows from the Doob inequality that for all k ≥ 0, Taking the expectation on both sides of the preceding inequality yields for large enough N that where C 2 is taken from (3.2). For the penultimate inequality, we used that for all k provided that N is sufficiently large. Putting the previous inequality in (6.2) we obtain that for sufficiently large N , Since C := k≥0 2 −3k is finite, we conclude that

Similar arguments show that
and therefore, The lemma follows.
We are now ready to prove the theorem.
for N sufficiently large. On the other hand, it follows from the Fubini theorem that For the last inequality, we used (6.1) together with the fact that a probability cannot be larger than one. Hence, Combining with (6.5) yields which completes the proof of the Theorem (by taking C to be 2 + 2C) where C is the constant from Lemma (6.2).

Lemma 7.2. There exists C > 0 and
The proof is available in the Appendix.
In the sequel, we denote by P f the underlying probability when the distributions of the observations are such that the true density f X of F X defined in (3.1) is equal to f . Lemma 7.3. There exists C > 0 and N 0 > 0 such that for all f ∈ F 1 , a ∈ R, x > 0, and N ≥ N 0 Proof of Lemma 7.3. The inequality in the lemma is obvious for x ∈ (0, N −1/3 ) since for such x's, it suffices to choose C ≥ 1 so that the right hand side is larger than one. Hence, in the sequel we consider x ≥ N −1/3 . It follows from the definition of θ N and Lemma 7.1 that the event {θ N − θ 0 ≥ x} is included in the event that there exists u ∈ [0, 1] such that both u − r 0K and u + r 0K belong to the set {x 0 , . . . ,x K }, u − θ 0 ≥ x and As Λ N matches F N on the set {x 0 , . . . ,x K }, the function Λ N can be replaced by F N in the above display. We have (7.2) where K −1 N −1/3 ≤ x and therefore, we can assume without loss of generality that |r 0K − r 0 | ≤ x/2. Hence, the event This implies that the event Hence, where for all k ≥ 0, A k is the event that there exist u ∈ [0, 1] such that u − θ 0 ∈ [x2 k /2, x2 k+1 /2], and and B k is the event that there exist v ∈ [0, 1] such that v−θ 0 ∈ [x2 k /2, x2 k+1 /2], and We will deal with the first sum in the right hand side of (7.6), the second sum being similar.
Combining this with Lemma 7.2 and the Markov inequality, we conclude that Since k≥0 2 −3k is finite, we conclude that there exists C > 0 such that Similar arguments show that the same inequality holds with A k replaced by B k so we conclude from (7.6) that there exists C > 0 such that It can be proved similarly that and the lemma follows.
Proof of Theorem 4.1. It follows from the Fubini theorem that for all f ∈ F 1 and N ≥ N 0 , For the last inequality, we used (7.5) together with the fact that a probability cannot be larger than one. Hence, which completes the proof of the Theorem (by taking C to be 1 + 2C).

A.1. Preparatory lemmas
and P sup for all N and x > 0.
Proof. Let F Xj denote the common distribution function of the X i 's from sample j and denote by (X ji , Y ji ), i = 1, . . . , n j the observations from sample j. It follows from the triangle inequality that where we recall that m j=1 n j = N . Hence, for all x > 0 we have P sup Since for all fixed j, the random variables X ji , i = 1, . . . , n j are i.i.d. with distribution function F Xj , it follows from Corollary 1 in [15] that P sup Combining the two preceding displays completes the proof of (A.1). Now, consider (A.2). Since f X is supported on [0, 1], both F −1 N and F −1 X take values in [0, 1] so the sup-distance between those functions is less than or equal to one. This means that the probability on the left hand side of (A.2) is equal to zero for all x ≥ 1. Hence, it suffices to prove (A.2) for x ∈ (0, 1). As is customary, we use the notation y + = max(y, 0) and y − = − min(y, 0) for all real numbersy. This means that |y| = max(y − , y + ). Recall the switching relation for the empirical distribution and empirical quantile functions: for arbitrary a ∈ [0, 1] and t ∈ [0, 1], we have For all x ∈ (0, 1) we then have P sup .
For the last equality, we use the fact that For the last inequality, we used (A.1). On the other hand, for all x ∈ (0, 1) we have P sup using the change of variable u = F −1 X (t). Hence, with the switching relation we obtain P sup using that F X (u − x) < F X (u) − C 1 x for all x ∈ (0, 1) and u ∈ (x, 1). Using again (A.1) together with the change of variable v = u − x, we arrive at P sup Combining the previous display with (A.4) completes the proof of (A.2) since |y| ≤ y − + y + for all y ∈ R.
Proof. It follows from the Fubini theorem that E sup Combining this with (A.1) and the fact that a probability cannot be larger than one then yields E sup for sufficiently large N , where we used (3.4) for the last inequality. The result follows by computing the integral on the right-hand side.

A.2. Proof of Theorem 3.2
Denote by F N the step function on [0, 1] such that F N (x k ) = F N (x k ) for all k = 0, . . . , K, and F N is constant on all intervals [x k−1 , x k ) for k = 1, . . . , K. We denote by F −1 N the corresponding inverse function: The following lemma provides tail bound probabilities for V N .
for all i and some p ≥ 2, with probablity one. Assume that F X has a density function f X on [0, 1] that satisfies (3.2) for some positive numbers C 1 , C 2 . Then, there exists C > 0 that depends only on p, C 2 and σ such that for all a > μ(0) and for all a < μ (1).
A proof of this lemma follows the proof the main theorem.
Proof of Theorem 3.2. Similar to the proof of Theorem 3.1 for the inverse problem, we would like to restrict ourselves to the event E N from Lemma 6.1, where θ can be chosen arbitrarily large. However, we do not have an analogue of (6.5) for the direct problem since μ N is not bounded as is U N . Hence, we first prove that μ N remains bounded by a power of N apart possibly on a negligible set. For this task, consider an arbitrary A > 0 such that A + μ(0) > 0, and note that for all t ∈ [0, 1], and all non-increasing functions μ on [0, 1], we have . Hence, it follows from the Fubini theorem that for all non-increasing μ ∈ F 1 , Note that if μ N (0) > y for some y ∈ R, then for the inverse we must have U N (y) > 0. Since U N can only assume values in the set of jump points of μ N it is of the form x k = k/K for some k ≥ 1. Next, μ N can have jumps only at those x k where F N has a jump, i.e. F N (x k ) > F N (x k−1 ). Since the size of a jump of F N is at least N −1 , we must have F N (x k ) ≥ N −1 and therefore, implying that for all μ ∈ F 1 , With C taken from Lemma (A.3) where it is assumed that p > 2, we arrive at With A = N (3p−2)/(6(p−2)) , this proves that there exists C > 0 that depends only on σ, p, C 2 and C 5 such that for all t ∈ [0, 1] and μ ∈ F 1 . Now, with A = N (3p−2)/(6(p−2)) , where C depends only on σ, p, C 2 , and C 5 . This enables us to restrict to the event E N of Lemma 6.1, provided that θ is chosen sufficiently large in the lemma. Indeed, with θ > (5p − 6)/(3(p − 2)), the previous inequality implies that with A = N (3p−2)/(6(p−2)) and N sufficiently large, for all t ∈ [0, 1] and all μ ∈ F 1 . It can be shown similarly that for N sufficiently large, for all t ∈ [0, 1] and all μ ∈ F 1 , implying that for all t ∈ [0, 1]. Hence, it now suffices to prove that there exists C > 0 that depends only on σ, p, To prove this, fix μ ∈ F 1 arbitrarily, and invoke the Fubini Theorem to obtain that using the switch relation (2.4) for the last equality. We split the above integral into the sum of two integrals and first consider With C 4 taken from the definition of F 1 we have for all t ∈ [0, 1] and y ∈ [0, μ(0) − μ(t)]. Combining Lemma 6.2 with the fact that a probability cannot be larger than one then yields Next, Lemma 6.2 yields where the first term on the right-hand side is equal to for sufficiently large N , for all t ≥ δ and μ ∈ F 1 . Using the connection (A. 6) between U N and V N yields Regarding the proof of Lemma 6.1, it can be seen that on E N we have for all t ∈ [δ, 1 − δ], provided that N is sufficiently large. Hence, it follows from Lemma A.3 where it is assumed that p > 2, that For the integral on the right-hand side we have Hence, we can findC that depends only on p, σ, C 1 − C 5 such that for sufficiently large N , for all μ ∈ F 1 and t ∈ [δ, 1 − δ].
Combining this with (A.9) and (A.8) proves that there exists C > 0 that depends only on σ, p, It can be proved similarly that  = μ • g(a). Instead, we will use using that F N (0) = 0 and F N (1) = 1. Since f (a, u) ≥ 0 for all a, u (A.22) yields for all a ∈ [μ(1), μ(0)] and u ∈ {x 0 , . . . , x K }. Since Λ N (x 0 ) − a F N (x 0 ) = 0, it follows from the definition of V N that the following inequalities hold for all x > 0 and a > μ(0): ) takes the form (6.3). The first inequality in (A.10) then yields Let p ≥ 2 and σ > 0 such that E[ε p i |X i ] ≤ σ p for all i, almost surely. The process M n is a centered martingale under P X which, according to Theorem 3 in [18], satisfies for all u ∈ [0, 1] and A p = (p/2) p/2 2 p+p 2 /4 . For the penultimate inequality, we used that E X |ε i | 2 ≤ (E X |ε i | p ) 2/p thanks to the Holder inequality whereas for the last inequality, we used that N ≤ N p/2 and F p/2 N (u) ≤ F N (u). Combining the two preceding displays with the Doob inequality yields that for all x > 0, With C 2 taken from (3.2) we conclude that for all x > 0, which proves the first assertion. For the second assertion, since x K = F N (x K ) = 1, we write for a < μ (1) and The first inequality in (A.10) then yields that and we use the Doob inequality, similar as above. Details are omitted.

A.3. Proof of Theorem 3.3
It follows from (2.3) together with Lemma 6.2 that with probablity tending to one, where v N is an arbitrary sequence that diverges to infinity as N → ∞.
In the sequel, we consider a sequence v N such that v N ≤ log N for all N . Hence, with probablity that tends to one, g(a)) is the location of the maximum over u ∈ H N of We extend M N and e(a, . ) as constant functions in between two consecutive points in H N so that Now, since a = μ(t) + N −1/3 x for some fixed x ∈ R and t ∈ (0, 1), and g = 1/μ • g on (μ(1), μ(0)) is bounded by assumption, we have where the small o-term is uniform, by continuity of μ over the compact interval It can be seen from the proof of Lemma 6.1 that (A.26) holds on the event E N , whose probability tends to one as N → ∞, implying that uniformy over u ∈ H N . Since (A.24) holds for all a ∈ [μ(1), μ(0)], combining the two preceding displays yields e(a, g(a) Next, we invoke (A.31), that holds on the event E N uniformly over a and u, to conclude that e(a, g(a) uniformly over u ∈ H N , provided that v N min{log N ; N −1/3 K}. By assumption, N −1/3 K diverges to infinity as N → ∞, so we can find a sequence v N that satisfies the above condition and that diverges to infinity as N → ∞, as required in the definition of H N . In the sequel, we consider a sequence v N that satisfies the above conditions and in addition, the below condition: Note that by assumption, the right-hand side of the inequality in the above display diverges to infinity as N → ∞, which ensures existence of such a sequence v N . We then have e(a, g(a) + N −1/3 u) using that for u ≥ 0 (and similarly for u ≤ 0), thanks to (A.13), (A.15) and the assumption that v n log N . Similarly, and therefore, e(a, g(a) + N −1/3 u) Hence we obtain On the other hand, with where M N is as defined in (6.3) for all u ∈ {x 0 , . . . , x K }, we have where we denote by (X ji , Y ji ), i = 1, . . . , n j the observations from sample j, for j = 1, . . . , m, and ε ji = Y ji − μ(X ji ). Note that the process Z N is centered and has been extended to R by being constant in between two consecutive points in where the last equality is obtained by conditioning with respect to X ji and using that u ≥ v ≥ 0. With u, v fixed, this implies that where ω(δ) → 0 as δ → 0 by assumption, and Hence, for all fixed real numbers u ≥ v ≥ 0. It follows that since ω(δ) → 0 as δ → 0. The Jensen inequality for conditional expectation combined with AssumptionÃ 4 shows that σ 2 j (t) ≤ σ 2 for all i and t and therefore, We conclude that for all The case of negative u and v can be treated likewise and therefore, cov(Z N (u), Z N (v)) converges to σ 2 ∞ (t)(|u| ∧ |v|) if uv ≥ 0. It can be seen similarly that it converges to zero if uv < 0 (hence u and v have different signs). Hence, the covariance converges to σ ∞ (t)cov(W (u), W (v)), so we conclude from the Lindeberg-Feller theorem that jointly, Let π be the permutation such that the X π(j) are ordered in j, that is X π(1) < · · · < X π(N ) a.s. Let P X denote the conditional probability given X 1 , . . . , X N . Since ε π(1) , . . . , ε π(N ) are centered and independent under P X , the process , t ≤ kδ} is a reverse centered martingale conditionally on X 1 , . . . , X N , for all k. Hence, it follows from the Doob inequality that for all k, Note that the inequalities above are first obtained for the conditional probabilities and then integrated over the distribution of X for the unconditional. Now, it follows from the Rosenthal inequality, see [18], that for all k and a constant C that depends only on p, we have Here, and N −1/3 (k + 2)δ both belong to H N ) and p is taken from AssumptionÃ 4 .
Hence, with f X taken from (3.5) we have It follows from the AssumptionÃ 1 that f X is bounded by a constant A that does not depend on N and therefore, for N sufficiently large. Arguing similarly for E|Z N (kδ) − Z N ((k − 2)δ)| p we conclude from (A.20) that there exists C > 0 that depends only on p and A such that for all k. Summing up this inequality over all k on the right-hand side of (A. 19), we obtain that there exists C > 0 that depends only on p and A such that for all δ > 0 and > 0, is the location of the maximum of a process that weakly converges to the continuous Gaussian process The above process achieves its maximum at a unique point T by Lemma 2.6 of [11], and it follows from Lemma 6.2 that N 1/3 (U N (a) − g(a)) is uniformly tight. Hence, Corollary 5.58 in van der Vaart shows that N 1/3 (U N (a)−g(a)) converges in distribution to T. Now, T is also the unique location of the maximum of the process Changing scale in the Brownian motion finally shows that T has the same distribution as Z, which completes the proof.

A.4. Proof of Lemma 6.1
For all a ∈ R and u ∈ {x 0 , . . . , Distributed computing in non-standard problems 1969 Now, X i = [Kg(a)]K −1 for all i, almost surely since X i has a continuous distribution function, so (3.3) implies that implying that with a decreasing function μ, where Using again (3.3), we obtain that for all a ∈ [μ(1), On the other hand, since F X has a bounded derivative that satisfies (3.2) we have for sufficiently large N , using that K −1 = o(N −1/3 ) whereas |u − g(a)| ≥ N −1/3 for the last inequality. Next, since m ≤ N , it follows from (A.1) in the Appendix that P sup we conclude from the previous display that Hence, the inequality in (A.28) holds for all a ∈ R and u ∈ {x 0 , . . . , x K }. Using uniformly over all a and u such that |u − g(a)| ≥ N −1/3 . Hence, it suffices to prove that with f (a, u) taken from (A.23), there existsc > 0 that only depends on C 1 such that on an event E N whose probability is larger than 1 − N −θ , and such that E N ⊂ E N , we have In the sequel, we consider It follows from (A.27) and (A.30) that 1 − P(E N ) N −θ so in particular, P(E N ) ≥ 1−N −θ for sufficiently large N . It remains to show that (A.29) holds on E N . Since X 1 , . . . , X N are independent with a continuous distribution function, they are all distinct from each other and for all i, there exists a (unique) random j such that X i = F −1 N (j/N ), where F −1 N is the empirical quantile function corresponding to X 1 , . . . , X N . Hence, reordering the terms in the sum in (A. 23), we obtain that f (a, u) Using that F −1 N is constant on all intervals ((i − 1)N −1 , iN −1 ] we arrive at It follows that Now, on E N we also have for all x lying between F N (u) and F N ([Kg(a)]K −1 ). For such x's, we obtain on E N that for all a and u, for sufficiently large N . Therefore, with K ≥ 1 we obtain on E N that Therefore, for any c < C 3c , for all sufficiently large N , e(a, u) ≤ −c(g(a) − u) 2 on E N . This completes the proof of the lemma.

A.5. Proof of Lemma 7.2
The proof rests on the following proposition.
Proposition P. Let F be a class of functions from X to R and let X 1 , X 2 , . . . , X n be independent (but not necessarily identically distributed) random variables defined on X . Let F be a measurable envelope for the class F and assume that E(F 2 (X i )) < ∞ for each 1 ≤ i ≤ n. Define and let G n F := sup f ∈F |G n f |. Then, for p ≥ 2, there exists A p > 0 that depends on p only such that where E * denotes outer expectation and J (1, F) is taken from Section 2.14.1 of [23].
The proof of the Proposition P is essentially the same as the proof of Theorem 2.14.1 in [23], by noting that the steps in the proof remain valid even if the X i 's are not i.i.d but are independent with potentially different distributions. Hence, the proof is omitted.
We now apply the above proposition with n replaced by N , F := {f w (·) : |w| ≤ x} where f w (t) = 1(t ≤ u 0 + w) − 1(u ≤ u 0 ) with envelope function F (t) = 1(t ≤ u 0 + x) − 1(t ≤ u 0 − x), and p = 2. The supremum in G N F is taken over a finite set and hence, it is measurable. This implies that the outer expectation can be replaced by an expectation in Proposition P. Moreover, the class of functions under consideration is a VC class of dimension 3 and therefore J (1, F) < ∞. Note that E(F 2 (X i )) ≤ C 2 x since the densities of the X i 's satisfy (4.2). The assertion of the lemma now follows directly from Proposition P upon dividing both sides of the inequality in the proposition by N .

A.6. Limited simulation results
Simulation results are presented for the following setting. The model is Y = X 2 + with X ∼ Unif(0, 1) and ∼ N (0, 1) independently of X. We are interested in estimating μ(x) ≡ x 2 at the point x 0 = 0.5. The limit distribution of the global isotonic estimator, i.e. the limiting law of N 1/3 (μ G (x 0 ) − μ(x 0 )) is given, for example, in Equation (1.2) of BDSE. Note that the setting considered in the simulations corresponds to the homogeneous case, i.e. the case where m = 1.
We took N = 10 6 and generated 5000 replicates from the distribution of the global isotonic estimator, and the isotonic estimator formed by the binning procedure with K = N 1/3 log N . Selected quantiles from the empirical distributions of N 1/3 (μ G (x 0 ) − μ(x 0 )) and N 1/3 (μ binned (x 0 ) − μ(x 0 )) [based on the 5000 replicates] were then compared to the quantiles of the limit distribution which were generated by plugging in the true value of κ that appears in (1.2) of BDSE and using the numerical values of the quantiles of Chernoff's distribution provided in [9]. The QQ-plots are given in the left and right panels of the first figure respectively: both panels indicate close agreement of the empirical distributions with the truth. In general, the prescription K = N 1/3 log N does quite well, when N is in the order of millions or larger. Smaller N 's are not terribly interesting from the big data perspective and were not considered. For smaller N , like in the thousands, much more fine-tuning will be needed to find an adequate K but this is not too relevant to the problem this paper seeks to address.
The second figure shows that with binning of order N 1/3 the limit distribution deviates from the Chernoff limit. The second and third figures show the QQ-plots when we use K = N 1/3 and K = 2N 1/3 respectively. In both cases, the empirical distribution deviates systematically from the limit, with the deviation in the former case much more pronounced owing to a larger bias.