Contraction and uniform convergence of isotonic regression

We consider the problem of isotonic regression, where the underlying signal $x$ is assumed to satisfy a monotonicity constraint, that is, $x$ lies in the cone $\{ x\in\mathbb{R}^n : x_1 \leq \dots \leq x_n\}$. We study the isotonic projection operator (projection to this cone), and find a necessary and sufficient condition characterizing all norms with respect to which this projection is contractive. This enables a simple and non-asymptotic analysis of the convergence properties of isotonic regression, yielding uniform confidence bands that adapt to the local Lipschitz properties of the signal.


Introduction
Isotonic regression is a powerful nonparametric tool used for estimating a monotone signal from noisy data. Specifically, our data consists of observations y 1 , . . . , y n ∈ R, which are assumed to be noisy observations of some monotone increasing signal-for instance, we might assume that E [y 1 ] ≤ · · · ≤ E [y n ]. Isotonic (least-squares) regression solves the optimization problem Minimize y − x 2 2 subject to x 1 ≤ · · · ≤ x n in order to estimate the underlying signal.
This regression problem can be viewed as a convex projection, since K iso = {x ∈ R n : x 1 ≤ · · · ≤ x n } is a convex cone. We will write iso(y) := P K iso (y) = arg min x∈R n y − x 2 2 : x 1 ≤ · · · ≤ x n to denote the projection to this cone, which solves the least-squares isotonic regression problem. This projection can be computed in finite time with the Pool Adjacent Violators Algorithm (PAVA), developed by Barlow et al. [2].
In fact, mapping y to iso(y) is known to also solve the isotonic binary regression problem. This arises when the data is binary, that is, y ∈ {0, 1} n . If we assume that y i ∼ Bernoulli(x i ), then the constrained maximum likelihood estimator is exactly equal to the projection iso(y) (Robertson et al. [20]).
In this paper, we examine the properties of the isotonic projection operator x → iso(x), with respect to different norms · on R n . We examine the conditions on · needed in order to ensure that x → iso(x) is contractive with respect to this norm, and in particular, we define the new "sliding window norm" which measures weighted averages over "windows" of the vector x, i.e. contiguous stretches of the form (x i , x i+1 , . . . , x j−1 , x j ) for some indices 1 ≤ i ≤ j ≤ n. This sliding window norm then provides a tool for analyzing the convergence behavior of isotonic regression in a setting where our data is given by y i = x i + noise. If the underlying signal x is believed to be (approximately) monotone increasing, then iso(y) will provide a substantially better estimate of x than the observed vector y itself. By using our results on contractions with respect to the isotonic

Background
There is extensive literature studying convergence rates of isotonic regression, in both finite-sample and asymptotic settings. For an asymptotic formulation of the problem, since the signal x ∈ R n must necessarily change as n → ∞, a standard method for framing this as a sequence of problems indexed by n is to consider a fixed function f : [0, 1] → ∞, and then for each n, define x i = f (i/n) (or more generally, x i = f (t i ) for points t i that are roughly uniformly spaced). Most models in the literature assume that y i = x i + σ · i , where the noise terms i are i.i.d. standard normal variables (or, more generally, are zero-mean variables that satisfy some moment assumptions or are subgaussian).
One class of existing results treats global convergence rates, where the goal is to bound the error x − iso(y) 2 , or more generally to bound x − iso(y) p for some p norm. The estimation error under the 2 norm was studied by Van de Geer [21], Wang and Chen [23], Meyer and Woodroofe [17], among others. Van de Geer [22] obtains the asymptotic risk bounds under under Hellinger distance, whereas Zhang [25] establishes the nonasymptotic risk bounds for general p norm. Recent work by Chatterjee et al. [8] considers non-asymptotic minimax rates for the estimation error, focusing specifically on x − x 2 for any estimator x to obtain a minimax rate. Under a Gaussian noise model, they prove that the minimax rate scales as x − x 2 (n/ log(n)) −1/3 over the class of monotone and Lipschitz signals x, which matches the error rate of the constrained maximum likelihood estimator (i.e. the isotonic least-squares projection, iso(y)). They also study minimax rates in a range of settings, including piecewise constant signals, which we will discuss later on. 1 A separate class of results considers local convergence rates, where the error at a particular index, i.e. x i − iso(y) i for some particular i, may scale differently in different regions of the vector. In an asymptotic setting, where n → ∞ and the underlying signal comes from a function f : [0, 1] → R, we may consider an estimator f : [0, 1] → R, where f (t) is estimated via iso(y) i for t ≈ i/n. Results in the literature for this setting study the asymptotic rate of convergence of f (t) − f (t) , which depends on the local properties of f near t. Brunk [5] establishes the convergence rate as well as the limiting distribution when f (t) is positive, whereas Wright [24] generalizes the result to the case of t lying in a flat region, i.e. f (t) = 0. Cator [7] shows that the isotonic estimator adapts to the unknown function locally and is asymptotically minimax optimal for local behavior. Relatedly, Dümbgen [10] gives confidence bands for a related problem, that of estimating an increasing drift in a Brownian motion, by taking averages over windows of the data curve, i.e. ranges of the form [t 0 , t 1 ] near the point t of interest.
In addition, many researchers have considered the related problem of monotone density estimation, where we aim to estimate a monotone decreasing density from n samples drawn from that distribution. This problem was first studied by Grenander [13], and has attracted much attention since then, see Rao [18], Groeneboom [14], Birgé and Massart [3], Carolan and Dykstra [6], Balabdaoui et al. [1], Jankowski [16], among others. Durot et al. [11] shows that, for Lipschitz and bounded densities on [0, 1], asymptotically the error rate for estimating f (t) scales as (n/ log(n)) −1/3 , uniformly over all t bounded away from the endpoints-the same rate as the isotonic regression problem. Adaptive convergence rates are studied by Cator [7]. Later we will show that our results yield a non-asymptotic error bound for this problem as well, which matches this known rate.
Several related problems for isotonic regression have also been studied. First, assuming the model y i = x i + σ · i for standard normal error terms i , estimating σ has been studied by Meyer and Woodroofe [17], among others. Estimators of σ for general distribution of i are also available, see Rice [19], Gasser et al. [12]. We discuss the relevance of these tools for constructing our confidence bands in Section 4. Second, we can hope 06.18.17 that our estimator iso(y) can recover x accurately only if x itself is monotone (or approximately monotone); thus, testing this hypothesis is important for knowing whether our confidence band can be expected to cover x itself or only its best monotone approximation, iso(x). Drton and Klivans [9] study the problem of testing the null hypothesis x ∈ K iso (or more generally, whether the signal x belongs to some arbitrary pre-specified cone K), based on the volumes of lower-dimensional faces of the cone (see Drton and Klivans [9, Theorem 2 and Section 3]).
Main contributions In the context of the existing literature, our main contributions are: (1) the new analysis of the contraction properties of isotonic projection, and the specific example of the sliding window norm, and (2) clean, finite-sample estimation bounds for isotonic regression, which are locally adaptive to the local Lipschitz behavior of the underlying signal, and match known asymptotic convergence rates. The contraction and sliding window norm allow us to prove the isotonic regression convergence results in just a few simple lines of calculations, while the arguments in the existing literature are generally substantially more technical (for example, approximating the estimation process via a Brownian motion or Brownian bridge).

Contractions under isotonic projection
In this section, we examine the contractive behavior of the isotonic projection, with respect to various norms on R n . Since this operator projects x onto a convex set (the cone K iso of all ordered vectors), it is trivially true that but we may ask whether the same property holds when we consider norms other than the 2 norm.
Formally, we defined our question as follows: Definition 1. For a seminorm · on R n , we say that isotonic projection is contractive with respect to · if We recall that a seminorm must satisfy a scaling law, c · x = |c| · x , and the triangle inequality, x + y ≤ x + y , but may have x = 0 even if x = 0. From this point on, for simplicity, we will simply say "norm" to refer to any seminorm.
For which types of norms can we expect this contraction property to hold? To answer this question, we first define a simple property to help our analysis: For a norm · on R n , we say that · is nonincreasing under neighbor averaging (NUNA) if . , x n ≤ x for all x ∈ R n and all i = 1, . . . , n − 1.
Our first main result proves that the NUNA property exactly characterizes the contractive behavior of isotonic projection-NUNA is both necessary and sufficient for isotonic projection to be contractive.
Theorem 1. For any norm · on R n , isotonic projection is contractive with respect to · if and only if · is nonincreasing under neighbor averaging (NUNA).
(The proof of Theorem 1 will be given in Section 6.) In particular, this theorem allows us to easily prove that isotonic projection is contractive with respect to the p norm for any p ∈ [1, ∞], and more generally as well, via the following lemma: 06.18.17 Lemma 1. Suppose that · is a norm that is invariant to permutations of the entries of the vector, that is, for any x ∈ R n and any permutation σ on {1, . . . , n}, x = x σ where x σ := (x σ(1) , . . . , x σ(n) ).
(In particular, the p norm, for any p ∈ [1, ∞], satisfies this property.) Then · satisfies the NUNA property, and therefore isotonic projection is a contraction with respect to · .
Proof of Lemma 1. Let σ swap indices i and i + 1, so that x σ = (x 1 , . . . , x i−1 , x i+1 , x i , x i+2 , . . . , x n ). Then where we apply the triangle inequality, and the assumption that x σ = x . This proves that · satisfies NUNA. By Theorem 1, this implies that isotonic projection is contractive with respect to · .

The sliding window norm
We now introduce a sliding window norm, which will later be a useful tool for obtaining uniform convergence guarantees for isotonic regression. For any pair of indices 1 ≤ i ≤ j ≤ n, we write i : j to denote the stretch of j − i + 1 many coordinates indexed by {i, . . . , j}, Fix any function The sliding window norm is defined as where x i:j = xi+···+xj j−i+1 denotes the average over the window i : j. The following key lemma proves that our contraction theorem, Theorem 1, can be applied to this sliding window norm. (This lemma, and all lemmas following, will be proved in Appendix A.2.) Lemma 2. For any function ψ satisfying the conditions (1), the sliding window norm · SW ψ satisfies the NUNA property, and therefore, isotonic projection is contractive with respect to this norm.
This lemma is a key ingredient to our convergence analyses for isotonic regression. It will allow us to use the sliding window norm to understand the behavior of iso(y) as an estimator of iso(x), where y is a vector of noisy observations of some target signal x. In particular, we will consider the special case of subgaussian noise. The following lemma can be proved with a very basic union bound argument: Lemma 3. Let x ∈ R n be a fixed vector, and let y i = x i + σ i , where the i 's are independent, zero-mean, and subgaussian. Then taking and for any δ > 0, As a specific example, in a Bernoulli model, if the signal is given by x ∈ [0, 1] n and our observations are given by y i ∼ Bernoulli(x i ) (each drawn independently), then this model satisfies the subgaussian noise model with σ = 1.

Estimation bands
In this section, we will develop a range of results bounding our estimation error when we observe a (nearly) monotone signal plus noise. These results will all use the sliding window contraction result in Lemma 2 as the main ingredient in our analysis.
We begin with a deterministic statement that is a straightforward consequence of the sliding window contraction result: For any x, y ∈ R n , for all indices k = 1, . . . , n, Note that these two statements are symmetric; they are identical up to reversing the roles of x and y.
, where the first inequality uses the monotonicity of iso(x) while the second uses the definition of the sliding window norm along with the fact that iso( These bounds bound the difference between iso(x) and iso(y), computed using either y (as in (2)) or x (as in (3)). Thus far, the two results are entirely symmetrical-they are the same if we swap the vectors x and y.
We will next study the statistical setting where we aim to estimate a signal x based on noisy observations y, in which case the vectors x and y play distinct roles, and so the two versions of the bands will carry entirely different meanings. Before proceeding, we note that the above bounds cannot give results on x itself, but only on its projection iso(x). If x is far from monotonic, we cannot hope that the monotonic vector iso(y) would give a good estimate of x. We will consider a relaxed monotonicity constraint: we say that x ∈ R n is iso -monotone if (If x is monotonic then we can simply set iso = 0.) We find that iso corresponds roughly to the ∞ distance between x and its isotonic projection iso(x): With this in place, we turn to our results for the statistical setting.

Statistical setting
We will consider a subgaussian noise model, where x ∈ R n is a fixed signal, and the observation vector y is generated as where the i 's are independent, zero-mean, and subgaussian.
Lemma 3 proves that, in this case, setting ψ(m) = √ m yields x − y SW ψ ≤ 2σ 2 log n 2 +n δ with probability at least 1 − δ. Of course, we could consider other models as well, e.g. involving correlated noise or heavy-tailed noise, but restrict our attention to this simple model for the sake of giving an intuitive illustration of our results.
In order for this bound on the sliding window to be useful in practice, we need to obtain a bound or an estimate for the noise level σ. Under the Bernoulli model y i ∼ Bernoulli(x i ), we can simply set σ = 1. More generally, it may be possible to estimate σ from the data itself, for instance if the noise terms i are i.i.d. standard normal, Meyer and Woodroofe [17] propose estimating the noise level σ with the maximum likelihood estimator (MLE),

or the bias-corrected MLE given by
where c 1 is a known constant while df iso(y) is the number of "degrees of freedom" in the monotone vector iso(y), i.e. the number of distinct values in this vector. We next consider the two different types of statistical guarantees that can be obtained, using the two symmetric formulations in Theorem 2 above.

Data-adaptive bands
We first consider the problem of providing a confidence band for the signal x in a practical setting, where we can only observe the noisy data y and do not have access to other information. In this setting, the bound (2) in Theorem 2, combined with Lemma 3's bound on x − y SW ψ for the subgaussian model, yields the following result: Theorem 3. For any signal x ∈ R n and any δ > 0, under the subgaussian noise model (4), then with probability at least 1 − δ, for all k = 1, . . . , n, We emphasize that these bounds give us a uniform confidence band for iso(x) (or for x itself, if it is monotone) that can be computed without assuming anything about the properties of the signal; for instance, we do not assume that the signal is Lipschitz with some known constant, or anything of this sort. We only need to know the noise level σ, which can be estimated as discussed in Section 4.1. In this sense, the bounds are dataadaptive-they are computed using the observed projection iso(y), and adapt to the properties of the signal (for instance, if x is locally constant near k, then the upper and lower confidence bounds will be closer together).

Convergence rates
While the results of Theorem 3 give data-adaptive bounds that do not depend on properties of x, from a theoretical point of view we would also like to understand how the estimation error depends on these properties. For the data-adaptive bands, we used the result (2) relating iso(x) and iso(y), but for this question, we will use the symmetric result (3) instead, which immediately yields the following theorem.
Theorem 4. For any signal x ∈ R n and any δ > 0, under the subgaussian noise model (4), then with probability at least 1 − δ, for all k = 1, . . . , n, If additionally x is iso -monotone, then we also have Proof of Theorem 4. For the first bound (7), we simply subtract iso(x) k from the inequalities (3). For the second bound (8) in the case that x is approximately monotone, we instead subtract x k from (3), and also use the fact that x − iso(x) ∞ ≤ iso by Lemma 4, which implies that x k: and similarly Comparison to existing work In the monotone setting (i.e. x = iso(x)), Chatterjee et al. [8] derive related results bounding the pointwise error x k − iso(y) k . Specifically, they use the "minmax" formulation of the isotonic projection, iso(y) k = min j≥k max i≤k y i:j , and give the following argument: where the first step defines m = j − k + 1 and uses the "minmax" formulation, while the third uses the assumption that x is monotone. They then bound the error term (Err) in expectation. We can instead bound it as (Err) ≤ x−y SW ψ √ m , which is exactly the same as the upper bound in our result (8). Their "minmax" strategy can analogously be used to obtain the corresponding lower bound as well.

Locally constant and locally Lipschitz signals
If the signal x is monotone, Chatterjee et al. [8]'s results, which are analogous to our bounds in (8), yield implications for many different classes of signals: for instance, they show that for a piecewise constant signal x taking only s many unique values, the 2 error scales as We therefore see that for "most" indices k when the signal is piecewise constant.
We can instead consider a Lipschitz signal: we say that (Rescaling by n is natural as we often think of x i = f (i/n) for some underlying function f ). In this setting, our results in Theorem 4 immediately yield the bound where the term L(m−1) , in which case we obtain the bound for all m ≤ k ≤ n − m + 1. (For indices k nearer to the endpoints, we are forced to choose a smaller m, and the scaling will be worse.) We can also compute convergence rates in a more general setting, where the signal x is locally Lipschitz-its behavior may vary across different regions of the signal. As discussed in Section 1.1, many papers in the literature consider asymptotic local convergence rates-local in the sense of giving pointwise error bounds, which for a single signal x = (x 1 , . . . , x n ), may be larger for indices i falling within a region of the signal that is strictly increasing, and smaller for indices i falling into a locally flat region. We would hope to see some interpolation between the n −1/3 rate expected for a strictly increasing stretch of the signal, as in (11), versus the improved parametric rate of n −1/2 in a locally constant region as in (9).
Our confidence bands can also be viewed as providing error bounds that are local in this sense, i.e. that adapt to the local behavior of the signal x as we move from index i = 1 to i = n. To make this more precise, we will show how our bounds scale locally with the sample size n to obtain the n −1/3 and n −1/2 rates described above. Consider any monotone signal x. Suppose the signal x is locally constant near k, with x k−cn+1 = · · · = x k = · · · = x k+cn−1 for some positive constant c > 0. Then our bound (8) applied with m = cn yields For other indices, however, where the signal is locally strictly increasing with a Lipschitz constant L, then taking m ∼ σ 2 n log(n) L 2/3 yields the n −1/3 scaling obtained above in (11). It is of course also possible to achieve an interpolation between the n −1/2 and n −1/3 rates via our results, as well. 06. 18.17 Many works in the literature consider the local adaptivity problem in an asymptotic setting; here we will describe the results of Cator [7]. Consider an asymptotic setting where the signal x = (x 1 , . . . , x n ) comes from measuring (at n many points) a monotone function f : [0, 1] → R, and we are interested in the local convergence rate at some fixed t ∈ (0, 1). Cator [7] show that if the first α derivatives of f at t satisfy f (1) (t) = · · · = f (α−1) (t) = 0 and f (α) (t) > 0, the convergence rate for estimating f (t) scales as n −α/(2α+1) . In particular, if α = 1 (f is strictly increasing at t) then they obtain the n −1/3 rate seen before, while if α = ∞ (f is locally constant near t) then they obtain the faster parametric n −1/2 rate. Of course, any α in between 1 and ∞ will produce some power of n between these two. Our work can be viewed as a finite-sample version of these types of results.

Simulation: local adaptivity
To demonstrate this local adaptivity in practice, we run a simple simulation. The signal is generated from an underlying function f (t) defined over t ∈ [0, 1], with as illustrated in Figure 1(a). For a fixed sample size n, we set x i = f i n+1 and y i = x i + N (0, 1). We then compute a data-adaptive confidence band as given in (6), with known noise level σ = 1, and with iso = 0 as the signal x is known to be monotone. For sample size n = 1000, the resulting estimate iso(y) and confidence band are illustrated in Figure 1 Note that our data-adaptive estimation bands given by Theorem 3 are calculated without using prior knowledge of the signal's local behavior (locally constant / locally Lipschitz)-the confidence bands computed in Theorem 3 are able to adapt to this unknown structure automatically.

Density estimation
As a second application of the tools developed for the sliding window norm, we consider the problem of estimating a monotone nonincreasing density g on the interval [0, 1], using n samples drawn from this density.
Let Z 1 , . . . , Z n iid ∼ g be n samples drawn from the target density f , sorted into an ordered list Z (1) ≤ · · · ≤ Z (n) . The Grenander estimator for the monotone density g is defined as follows. Let G n be the empirical cumulative distribution function for this sample,   illustrated in Figure 3. It is known (Robertson et al. [20]) that g Gren can be computed with a simple isotonic projection of a sequence. Namely, for i = 1, . . . , n, let y i = n(Z (i) − Z (i−1) ) where we set Z (0) := 0, and calculate the isotonic projection iso(y). Then the Grenander estimator is given by If we assume that f is Lipschitz and lower-bounded, then our error bounds for isotonic regression transfer easily into this setting, yielding the following theorem (proved in Appendix A.1): Theorem 5. Let g : [0, 1] → [c, ∞) be a nonincreasing L-Lipschitz density, let Z 1 , . . . , Z n be i.i.d. draws from g, and define the Grenander estimator g Gren as in (13). Then for any δ > 0, if This result is similar to that of Durot et al. [11], which also obtains a (n/ log(n)) −1/3 convergence rate uniformly over t (although in their work, t is allowed to be slightly closer to the endpoints, by a log factor). Their results are asymptotic, while our work gives a finite-sample guarantee. As mentioned in Section 1.1, Cator [7] also derives locally adaptive error bounds whose scaling depends on the local Lipschitz behavior or local derivatives of f . Our locally adaptive results for sequences may also be applied here to obtain a locally adaptive confidence band on the density g, but we do not give details here.

Proof for contractive isotonic projection
In this section, we prove our main result Theorem 1 showing that, for any semi-norm, the nonincreasing-underneighbor-averaging (NUNA) property is necessary and sufficient for isotonic projection to be contractive under this semi-norm.
Before proving the theorem, we introduce a few definitions. First, for any index i = 1, . . . , n − 1, we define the matrix which averages entries i and i + 1. That is, We also define an algorithm for isotonic projection that differs from PAVA, and in fact does not converge in finite time, but is useful for the purpose of theoretical analysis. For any x ∈ R n and any index i = 1, . . . , n − 1, define In other words, if neighboring entries i and i + 1 violate the monotonicity constraint, then we average them.
The following lemma shows that, by iterating this step infinitely many times (while cycling through the indices i = 1, . . . , n − 1), we converge to the isotonic projection of x.
With this slow projection algorithm in place, we turn to the proof of our theorem.
Proof of Theorem 1. First suppose that · satisfies the NUNA property. We will prove that, for any x, y ∈ R n and any index i = 1, . . . , n − 1, If this is true, then by Lemma 5, this is sufficient to see that isotonic projection is contractive with respect to · , since the map x → iso(x) is just a composition of (infinitely many) steps of the form x → iso i (x). More concretely, defining x (t) and y (t) as in Lemma 5,(16) proves that x (t) − y (t) ≤ x (t−1) − y (t−1) for each t ≥ 1. Applying this inductively proves that x (t) − y (t) ≤ x − y for all t ≥ 1, and then taking the limit as t → ∞, we obtain iso(x) − iso(y) ≤ x − y . Now we turn to proving (16). We will split into four cases.
• Case 1: x i ≤ x i+1 and y i ≤ y i+1 . In this case, iso i (x) = x and iso i (y) = y, and so trivially, • Case 2: x i > x i+1 and y i > y i+1 . In this case, we have while all entries j ∈ {i, i + 1} are unchanged. Therefore, we can write Since · satisfies the NUNA property, therefore, Note that t ∈ [0, 1] by the definition of this case. A trivial calculation shows that This means that we have since · satisfies NUNA.
• Case 4: x i > x i+1 and y i ≤ y i+1 . By symmetry, this is equivalent to Case 3.
This proves (16), and therefore, is sufficient to show that isotonic projection is a contraction with respect to · . Now we prove the converse. Suppose that · does not satisfy NUNA. Then we can find some x and some i such that Without loss of generality we can assume x i ≤ x i+1 (otherwise simply replace x with −x-since · is a norm, we will have

Discussion
We study contraction properties of isotonic regression with an application on a novel sliding window norm. We then use these tools to construct data-adaptive estimation bands and obtain non-asymptotic uniform estimation bound of isotonic estimator. Our results are adaptive to the local behavior of the unknown signal, and can be used in a related density estimation problem. The analysis tools we developed are potentially useful for other shape-restricted problems.
We expect to apply our results on the high dimensional inference or calibration problems, where isotonic regression could serve as an important tool. Furthermore, ordering constraints, and more generally shape constraints, can take many different forms in various applications-for instance, two commonly studied constraints are isotonicity over a two-dimensional grid (in contrast to the one-dimensional ordering studied here), and convexity or concavity. It may be possible to extend the contraction results to a more general shape-constrained regression setting. We leave these problems for future work.

A Additional proofs A.1 Proof of density estimation result (Theorem 5)
Let G(t) = t s=0 g(s) ds be the cumulative distribution function for the density g. Since g(t) ≥ c everywhere, this means that G(t) is strictly increasing, and is therefore invertible. Using this lower bound on g, and the assumption that g is L-Lipschitz, we can furthermore calculate Let Note that x gives the difference in quantiles of the distribution, while y estimates these gaps empirically.
The following lemma (proved in Appendix A.2) gives a concentration result on the Z (i) 's: Then for any δ > 0, with probability at least 1 − δ, and From now on, we assume that these bounds (18) and (19) both hold. Plugging our definitions of x and y into these two bounds, this proves that x i:j − y i:j ≤ 3(j − i + 1) log((n 2 + n)/δ) + 2 log((n 2 + n)/δ) c · (j − i + 1) + 4L c 3 · log((n 2 + n)/δ) n for all 1 ≤ i ≤ j ≤ n. (If i = 1 then we use the bound (18) while if i > 1 then we use the bound (19).) Now, defining (Note that ψ is nondecreasing and i → i/ψ(i) is concave, as required by (1).) Next we check that x is a Lipschitz sequence. We have for some s, t ∈ [0, 1], by Taylor's theorem. The first-order terms cancel, and we know by (17) that (G −1 ) is bounded by L c 3 . Therefore, x is L c 3 -Lipschitz. Finally, x is monotone nondecreasing since g is a monotone nonincreasing density.
We then apply the calculations (10) for the Lipschitz signal setting (with x−y SW ψ ψ(m) taking the place of 2σ 2 log n 2 +n δ m , which was specific to the subgaussian noise model setting). We see that for any index m ≥ 1, max m≤k≤n−m+1 Set m = n log((n 2 + n)/δ) 2/3 . Since ∆ < 1 c+L ≤ 1 we know log((n 2 + n)/δ) ≤ n, so we can simplify the above bound to max m≤k≤n−m+1

A.2 Proofs of lemmas
Proof of Lemma 2. By Theorem 1, we only need to prove that · SW ψ satisfies NUNA. Fix any x ∈ R n and any index k = 1, . . . , n − 1. Let y = iso k (x). Note that we have Take any indices 1 ≤ i ≤ j ≤ n. We need to prove that |y i: • Case 1: if j < k or if i > k + 1, then neither of the indices k, k + 1 are included in the window i : j, and therefore x i:j = y i:j (i.e. all entries in the stretch of indices i : j are equal). So, • Case 2: If i ≤ k and j ≥ k + 1, then both indices k, k + 1 are included in the window i : j. Since y k + y k+1 = x k + x k+1 and all other entries of x and y coincide, we can trivially see that • Case 3: if i < k and j = k, then where the last inequality holds since i → i/ψ(i) is concave by assumption on ψ.
• Case 4: if i = k + 1 and j > k + 1, by symmetry this case is analogous to Case 3.
• Case 5: if i = j = k, then due to the assumption that ψ is nondecreasing.
• Case 6: if i = j = k + 1, then by symmetry this case is analogous to Case 5. Therefore, |y i:j | · ψ(j − i + 1) ≤ x SW ψ for all indices 1 ≤ i ≤ j ≤ n, and so y SW ψ ≤ x SW ψ , as desired.
Proof of Lemma 3. For any indices 1 ≤ i ≤ j ≤ n, y i:j − x i:j = σ i:j , and we know that √ j − i + 1 · i:j is subgaussian, that is, for any t ≥ 0. Now we set t = 2 log n 2 +n δ , and take a union bound over all n + n 2 = n 2 +n 2 possible pairs of indices i ≤ j. We then have P max 1≤i≤j≤n j − i + 1 · i:j ≤ 2 log n 2 + n δ ≥ 1 − δ.
Setting ψ(t) = √ t proves that, on this event, x − y SW ψ ≤ σ 2 log n 2 +n δ , as desired. For the bound in expectation, we have a similar calculation: it is known that E [max k=1,...,N |Z k |] ≤ 2 log(2N ) for any (not necessarily independent) subgaussian random variables Z k . Setting Z k = √ j − i + 1 · i:j for each of the N = n 2 +n 2 possible pairs i, j, we obtain E max 1≤i≤j≤n j − i + 1 · i:j ≤ 2 log(n 2 + n). 06.18.17 Proof of Lemma 4. Assume that x is iso -monotone, and fix any index 1 ≤ i ≤ n. Let j = max{k ≤ n : iso(x) k = iso(x) i }. Then i ≤ j ≤ n, iso(x) i = iso(x) j , and either j = n or iso(x) j < iso(x) j+1 . Therefore, we must have x j ≤ iso(x) j by properties of the isotonic projection. (This is because, if x j > iso(x) j , then writing e j for the jth basis vector and taking some sufficiently small > 0, the vector iso(x) + · e j is an isotonic vector that is strictly closer to x than iso(x), which is a contradiction.) Therefore, x i ≤ x j + iso ≤ iso(x) j + iso = iso(x) i + iso . The reverse bound is proved similarly. Now we turn to the converse. For any 1 ≤ i ≤ j ≤ n, we have x i ≤ iso(x) i + ≤ iso(x) j + ≤ x j + 2 , where the first and third inequalities use the bound x − iso(x) ∞ ≤ , while the second uses the fact that iso(x) is monotone.
In general, it is known that a cyclic projection algorithm starting at some point x = x (0) is guaranteed to converge to some point in the intersection of the respective convex sets, i.e. lim t→∞ x (t) = x * ∈ ∩ n−1 i=1 K i = K iso , but without any assumptions on the nature of the convex sets K i , this point may not necessarily be the projection of x onto the intersection of the sets (Bregman [4], Han [15]). Therefore, we need to check that for our specific choice of the sets K i , the cyclic projection algorithm (15) in fact converges to iso(x) as claimed in the lemma.
Finally, we need to prove (20). Fix any index i and any x ∈ R n . If x i ≤ x i+1 , then iso i (x) = x and the statement holds trivially. If not, then x i > x i+1 and we have iso i (x) = A i x (recalling the definition of A i in (14) earlier). Now let y = iso(x) and z = iso(A i x). It is trivially true that, since x i > x i+1 , we must have y i = y i+1 . Also, x − y, z − y ≤ 0 by properties of projection to the convex set K iso , so we can calculate where the last step holds since z i ≤ z i+1 due to z ∈ K iso and x i ≥ x i+1 by assumption. We also have A i x − z 2 2 ≤ A i x − y 2 2 since z = iso(A i x), which combined with (21) proves that y = z. Thus (20) holds, as desired.