An Improved Global Risk Bound in Concave Regression

A new risk bound is presented for the problem of convex/concave function estimation, using the least squares estimator. The best known risk bound, as had appeared in \citet{GSvex}, scaled like $\log(en) n^{-4/5}$ under the mean squared error loss, up to a constant factor. The authors in \cite{GSvex} had conjectured that the logarithmic term may be an artifact of their proof. We show that indeed the logarithmic term is unnecessary and prove a risk bound which scales like $n^{-4/5}$ up to constant factors. Our proof technique has one extra peeling step than in a usual chaining type argument. Our risk bound holds in expectation as well as with high probability and also extends to the case of model misspecification, where the true function may not be concave.


Introduction
In this paper we consider the problem of estimating a concave function in a standard additive noise regression model.We observe y = (y 1 , . . ., y n ) which are noisy evaluations of an underlying concave function defined on the unit interval [0, 1] on sorted design points x = (x 1 < • • • < x n ) ∈ [0, 1] n as follows; with z = (z 1 , . . ., z n ) being independent N (0, σ 2 ) error for some unknown σ.The task of estimating the regression function f when knowing that the underlying f is concave is known as the concave regression problem.This problem has a relatively long history.
Interestingly, the problem of concave regression was discussed in the literature by Hildreth (1954) even before the problem of monotone regression (see Robertson et al. (1988), Ayer et al. (1955)), which is perhaps the most well studied problem in shape constrained estimation.Hildreth (1954) discusses some applications in economics where concavity arises naturally, such as in production functions and utility functions.A very natural estimator of a concave regression function is the constrained least squares estimator (LSE) which is defined as follows: As is clear, the LSE f is not unique but is uniquely defined at the design points.This estimator was proposed in Hildreth (1954) along with quadratic programming methods for solving (1.2).
The first statistical analysis for the LSE was carried out by Hanson and Pledger (1976) where the consistency of the LSE was established under the supremum loss over a compact interval.The problem of estimation of the concave function under a local loss function, that is the problem of estimating the regression function f at a given point x 0 , has also received attention in the literature.Notable papers are Mammen (1991), Groeneboom et al. (2001a) where the asymptotic distribution of the LSE at the point x 0 was studied.The first work which studied rates of convergence of the LSE under a global loss appeared in Dümbgen et al. (2004).The authors there showed that under the supremum loss over a compact interval containing the design points, the rates of convergence varies from ( log n n ) 1/3 to ( log n n ) 2/5 depending on the smoothness of the underlying concave function.For instance, if the regression function is also twice differentiable the rate of convergence under the supremum loss is ( log n n ) 2/5 .In the most recent work on concave regression under a global loss function, Guntuboyina and Sen (2013) considered the following sequence formulation of the concave regression problem.Assuming that the design points are equally spaced, note that the mean regression vector (f (x 1 ), . . ., f (x n )) lies in a polyhedral cone (defined by the concavity constraint on f ) K n ⊂ R n , n ≥ 3 defined as follows: Then it is clear that if x 1 < x 2 • • • < x n are equispaced design points in R then one can also write K n = {θ ∈ R n : θ = (f (x 1 ), . . ., f (x n )) for some concave function f : R → R}.
The problem of estimating the entire concave function in the sequence model then becomes the problem of estimating θ * ∈ K n from observations where z i are i.i.d N (0, σ 2 ) random variables.
In this paper, we study this sequence formulation of the concave regression problem, assuming equally spaced design points.Specifically, we study the risk properties of the Least Squares Estimator θ defined as θ = arg min (1.4)where • refers to the usual Euclidean norm on R n .We are interested in studying the risk behaviour of θ under the natural mean squared error loss function defined as where the expectation is taken under the distribution of y as given in (1.3).
To describe the results of Guntuboyina and Sen (2013) under the risk function R( θ, θ * ), we first have to set up some notation.Let L be the subspace of R n spanned by the constant vector (1, . . ., 1) and the vector (1, 2, . . ., n).In words, L is the subspace of n dimensional affine sequences.Let P L denote the orthogonal projection matrix to the subspace L. Define G θ * = max{1, 1 n (I −P L )θ * 2 }.Then Theorem 2.2 in Guntuboyina and Sen (2013) shows that for any θ * ∈ K n there exists a universal constant C such that whenever the sample size n ≥ C σ 2 (G θ * ) 2 (log en 2 ) 5/4 , we have To interpret the theorem, it is instructive to think of a regime where G(θ * ) stays bounded as n grows.For example, this would be the case if θ * are evaluations on a grid of a concave function In this case, the bound in 1.6 yields R( θ, θ * ) ≤ (log en 2 )n −4/5 upto a multiplicative constant factor.Guntuboyina and Sen (2013) conjectured that there should be no logarithmic term in the upper bound and perhaps the logarithmic term is an artifact of their proof.The goal of this paper is to show that one can derive a risk bound in the same setting as in (1.6), but without the logarithmic factor.
The paper is organized as follows.In Section 2 we present our main result on the global risk bound, comment on its interpretations, and sketch the proof ideas.In Section 3, we present the proof of our main result.In Section 4, we state and prove an extension of our main result in the case of model misspecification.

Main Result
We now describe the main result of this paper and discuss its consequences.For any vector θ ∈ R n let us define V (θ) = max 1≤i≤n θ i − min 1≤i≤n θ i .
Theorem 2.1.Fix any positive integer n.Fix any θ * ∈ K n .Fix any x > 0. The following upper bound on the risk holds for a universal constant C with probability greater than 1 − exp(−x) − exp(−x 2 /16) : Moreover, the above high probability risk bound also immediately implies a risk bound in expectation for a universal constant C: (2.2) Theorem 2.1 proves a high probability bound for 1 n θ − θ * 2 which then immediately leads to a bound in expectation.We show in Section 4 that we can also extend our risk bound to the case of model misspecification.By model misspecification, we mean the case when the true underlying sequence θ * is not concave.In this case, we prove that the risk of the LSE R( θ, θ * ) is at most the squared Euclidean distance of θ * to K n divided by n, plus a term which goes down to zero at the rate n −4/5 .
There are two essential differences between the risk bound (2.2) and the bound (1.6).The first difference is that we have no logarithmic terms in our risk bound.In this way, we improve over the best known risk bound in concave/convex regression.The second difference is that the term G(θ * ) is replaced by V (I − P L )θ * .In general, V (I − P L )θ * could be larger than G(θ * ) but stays bounded by a constant if θ * is a vector of evaluations of a concave function f on a grid.
Our analysis actually carries through for slightly more general design points than just equally spaced ones; see Remark 3.2 for more explanation.Coming to the issue of optimality of the LSE, the rate n −4/5 is known to be minimax rate optimal even if one considers the parameter space to be an appropriate local ball around a convex function with positive curvature everywhere; see Theorem 5.1 in Guntuboyina and Sen (2013) for the precise result.Apart from minimax rate optimality, a further motivation to study the LSE in concave regression is that it was shown to have automatic adaptation properties first in Guntuboyina and Sen (2013) and subsequently in Chatterjee et al. (2015) and Bellec (2015).Specifically, it was shown that the risk R( θ, θ * ) scaled like k n log n for piecewise linear concave sequences θ * with k pieces.In this paper though, our focus is not on the automatic adaptive properties of the LSE, but rather in proving an improved worst case performance.
It is worthwhile to point out that we do not assume any smoothness assumptions on the underlying concave function.Considering the result of Dümbgen et al. (2004), their loss function is the supremum loss function over an appropriate compact interval.Notwithstanding some details, the mean squared error loss function we consider is naturally bounded by the square of the supremum loss function that is considered in Dümbgen et al. (2004).Hence their risk bound directly applied to the mean squared error loss scales like ( log n n ) 4/5 only when the underlying concave function is twice differentiable.Our analysis shows that when the design points are equally spaced, we do not need any smoothness assumptions on the underlying concave function to obtain the n −4/5 rate in addition to showing that the logarithmic factor is unnecessary.

Proof Sketch for Theorem 2.1
The goal of this subsection is to provide a high level overview of the method of proof of Theorem 2.1.We take the standard empirical process based approach in proving our risk bound.Specifically we take the recipe proposed by Chatterjee (2014) which says that a key ingredient in proving risk bounds for the LSE is to control an expected Gaussian suprema term described as follows.Let a, b denote the usual inner product between any two vectors a, b.For any θ * ∈ R n define the function f θ * : R + → R as follows: where z is a random Gaussian vector with each entry being independent, has mean zero and variance σ 2 .Chatterjee ( 2014) also shows that if one obtains s > 0 such that f θ * (s) ≤ s 2 2 then essentially one gets the risk bound R( θ, θ * ) ≤ s 2 /n up to constant factors.Precise statements are given in later sections.Therefore, it suffices to tightly upper bound the function f θ * and this paper basically gives a new way of upper bounding the function f θ * (t) for all t ≥ 0.
Since f θ * (t) is an expected Gaussian maxima, a standard tool in empirical process theory to upper bound f θ * (t) is to use the Dudley's entropy integral bound which requires good estimates of the covering number of the set K n ∩ B(θ * , t) where B(θ * , t) refers to the Euclidean ball of radius t centred at θ * .Tight estimates (without logarithmic factors) of the covering numbers for the space of bounded convex sequences are available in the literature, see Dryanov (2009).Specifically the result of Dryanov (2009) gives us tight upper bounds of the covering number for the space {θ ∈ K n : max 1≤i≤n |θ i | ≤ B} for any constant B. But we require covering numbers for K n intersected with a Euclidean ball of radius t.This was done in Guntuboyina and Sen (2013) by applying the basic result of Dryanov (2009) in appropriate subintervals.This approach gives rise to logarithmic factors in the risk bound as had appeared in Guntuboyina and Sen (2013).
Our approach is to do one more step of refining the function f θ * .Essentially for any θ ∈ K n ∩ B(θ * , t) we define its truncated version θ ∈ K n where the set K n ⊂ R n is not a much larger set than K n and hence has comparable metric entropy.Note that one can write (2.4) We control the two terms on the right side of the above inequality separately.We show it is possible to define the truncation θ of θ such that it satisfies two critical properties.The first property is that the first term on the right side of (2.4) is upper bounded by t 2 4 .Since we finally have to compare f θ * (t) with t 2 2 an extra factor of t 2 4 can only affect the risk bound upto constants.Now we are left with the task of upper bounding the second term in the right side of (2.4).The second critical property that θ satisfies is that θ is bounded entrywise by C V (θ * ) + σ where C is a universal constant.This means that θ is bounded by a constant factor.Hence the second term in the right side of (2.4) can now be controlled by a direct application of Dudley's entropy integral bound.This only requires getting tight estimates of the covering number of bounded sequences in K n .We show that this covering number is very similar to the one given in Dryanov (2009) (and hence has no logarithmic factors) as K n is not much bigger than K n .In this way our extra refinement step enables us to save a logarithmic factor of n.
The main crux of this paper lies in defining θ as described in the previous paragraph.This is done in Lemma 3.5 which is perhaps the most important step in our entire argument.In Lemma 3.5 we actually assume our underlying concave sequence θ * is also monotonic.Since a concave sequence always first increases and then decreases, it actually suffices to analyze the case when θ * is monotonic concave.We use the monotonicity of θ * crucially in coming up with a definition of the truncation θ of θ, satisfying the critical properties as explained in the previous paragraph.

Proof of Theorem 2.1
The goal of this section is to state and prove Theorem 2.1 which improves the best known risk bound in concave regression by removing a logarithmic term.Before starting to prove the above theorem, we first go through some background results.By now it is known that a key ingredient in proving risk bounds for the LSE is to control the expected Gaussian suprema function f θ * as defined in (2.3).For any θ * ∈ K n it was actually shown in Theorem 1.1 in Chatterjee ( 2014) that the loss term θ − θ * concentrates around a deterministic value 2 .Another result providing upper bounds on the loss term θ − θ * in terms of the function f θ * has been given in Theorem 12 in Bellec (2015) which we use as our starting point.Before describing this result, we claim that since the projection onto the cone K n is a sum of projection onto the subspace L and the projection onto the cone K n ∩ L ⊥ it almost suffices to study the risk of the least squares estimator constrained to the cone K n ∩ L ⊥ denoted by θ Kn∩L ⊥ .Specifically we mean Here L ⊥ is the subspace of R n orthogonal to L. We make this claim clearer when we prove Theorem 2.1.With this viewpoint, we now state the following lemma which is a direct consequence of Theorem 12 in Bellec ( 2015), applied to the cone Then for any x > 0 the following inequality holds with probability greater than 1 − exp(−x), Here θ Kn∩L ⊥ is the least squares estimator constrained to the cone K n ∩ L ⊥ .Also z is a n dimensional gaussian random vector with mean zero and covariance matrix σ 2 I.
Remark 3.1.Actually Theorem 12 in Bellec (2015) implies a slightly stronger bound and is useful in the case of model misspecification.This issue is discussed in Section 4.
Recall the definition of the function f θ * in (2.3).Also recall that for any vector θ ∈ R n the range of θ is denoted by V (θ).The key step in proving Theorem 2.1 is to obtain the following key upper bound on f θ * which we state as a proposition: Proposition 3.1.Fix a positive integer n and fix any θ * ∈ K n .Also fix any t > 0. Then there exists a universal constant C such that the following inequality holds : We prove the above proposition in the next section.We first prove Theorem 2.1 assuming the above proposition is true.
Proof of Theorem 2.1.Let K = K n for simplicity.Fix θ * ∈ K. Recall that the subspace spanned by the n dimensional vectors (1, . . ., 1) and (1, . . ., n) is denoted as L. It is clear that L is the smallest subspace contained in K. Also recall that P L is the orthogonal projection matrix to the subspace L. Let the projection of any θ ∈ R n onto any closed convex cone C ⊂ R n be denoted by Π C (θ).The projection exists and is unique because C is a closed convex set.Now by definition of the LSE and by orthogonal decomposition of projections we have The first equality is a sum of squares decomposition using orthogonality.The second equality is because This can be checked by the usual KKT conditions for the projection of a point onto a closed convex cone.Now since L is a subspace with dimension 2 the term is distributed as a χ 2 random variable with 2 degrees of freedom.Hence by standard tail inequalities of a χ 2 random variable we have the following inequality which holds for any x > 0 with probability greater than 1 − exp(−x 2 /16), (3.2) Now since µ * ∈ K ∩ L ⊥ and Π K∩L ⊥ (µ * + z) − µ * 2 is exactly the squared error loss term for the least squares estimator constrained to the cone K ∩ L ⊥ we can use Lemma 3.1 to upper bound Π K∩L ⊥ (µ * + z) − µ * 2 .First we use Proposition 3.1 to get the following inequality inequality for all t ≥ 0, where C is a universal constant.Now it is not too hard to check that by setting s = Cσ V (µ * ) + σ 1/4 n 1/8 4/5 for a large enough universal constant C we have f µ * (s) < s 2 2 .Setting this value of s in Lemma 3.1 alongwith the last display then gives us the following bound which holds for any x > 0 with probability greater than 1 − exp(−x), Using the upperbounds (3.2) and (3.3) and combining them in (3.1) by a simple union bound argument finishes the proof for the high probability bound statement in (2.1).To prove the risk bound in expectation, let us denote W = Π K∩L ⊥ (µ * + z) − µ * 2 .Then we have the following inequality for any v ≥ 0 due to (3.3): where C is a universal constant.Also we have is a χ 2 random variable with degrees of freedom 2. The last two displays alongwith (3.1) finish the proof of the risk bound (4.1).
Remark 3.2.We remark that we have defined K to be the space of concave sequences obtained by evaluations of a concave function f : [0, 1] → R on equally spaced design points (x 1 , . . ., x n ) ∈ R n .Our risk bound calculations carry through if (x 1 , . . ., x n ) are not equally spaced, the only difference being that the universal constant C would now be replaced by a constant which would only depend on the ratio max i (x i+1 − x i )/ min i (x i+1 − x i ).This means that our risk bounds would continue to hold in the slightly more general situation where the gaps between consecutive design points lie between c 1 n and c 2 n for some constants c 1 , c 2 .This is in the same spirit as in the risk analysis given in Guntuboyina and Sen (2013).

Proof of Proposition 3.1
In order to prove Proposition 3.1, we will prove several lemmas along the way.We are required to upper bound the expected Gaussian maxima function f θ * .To do this, we use a standard chaining bound as follows: Theorem 3.1 (Chaining).Let F ⊂ R n and fix any θ * ∈ F. Let d = sup θ,θ ∈F θ − θ be the diameter of F. Then we have where z is again a n dimensional Gaussian random vector with mean zero and covariance matrix σ 2 I.
We also use a standard Gaussian Concentration Inequality.The proof can be found in the argument after equation(2.35) in Ledoux (2001): Theorem 3.2.[Gaussian Concentration Inequality] Let z be an n dimensional Gaussian random vector with covariance matrix σ 2 I. Let f : R n → R be a function that is L lipschtitz, that is it satisfies |f (x) − f (y)| ≤ L x − y for all x and y, where L is a positive constant.Then the following is true for any t ≥ 0, We also need to use a log covering number bound for the space of bounded convex functions defined on the unit interval, as proved in Dryanov (2009).We first set up some notations.For a metric space F with metric D and > 0, let N ( , F, D) denote the -covering number of F under the metric D. That is, N ( , F, D) is the minimum number of balls of radius required to cover F. If F ⊂ R n and D is the usual Euclidean metric we simply denote the covering number by N ( , F). Lemma 3.2 (Dryanov).Let C[0, 1, B] be the space of real valued concave functions defined on the unit interval [0, 1] with absolute value bounded by B for some Modifying the above result we can now prove a log covering number bound for the set of concave sequences bounded by a number B.
The following is true for any n ≥ 3 and a universal constant C, and extend f θ to all other points in the unit interval by linear interpolation.Clearly f θ ∈ C[0, 1, B].For each g ∈ F check whether there exists ν ∈ K n,B such that L 2 (g, f ν ) ≤ τ.If there is, choose such a ν arbitratrily and name it ν g .Let F be the set of such ν g obtained as we vary g ∈ F. Clearly we then have for a universal constant C, (3.4) Now we claim that a 2τ covering set for K n,B has cardinality atmost |F | which will suffice to prove the lemma.
Take any θ ∈ K n,B .By definition of F there exists g ∈ F such that L 2 (f θ , g) ≤ τ.Therefore for this g, there exists ν g ∈ F such that L 2 (g, f νg ) ≤ τ.Hence by the triangle inequality, we have L 2 (f θ , f νg ) ≤ 2τ.Now a direct consequence of Lemma(A.4) in Guntuboyina and Sen (2013) shows the following for a universal constant C,: Setting the value of in (3.4) now finishes the proof of the lemma.
The space of n dimensional vectors with atmost three concave blocks plays an important role in our analysis.We now define the space of sequences with atmost three concave blocks as follows.
For any vector θ The interpretation is that if m 1 = 0 then the first block is empty.Likewise if m 2 = n + 1 then the third block is empty.It is not too difficult to extend Lemma 3.3 to give a bound on the log covering number of all bounded sequences in K3 as shown in our next lemma.
Lemma 3.4.For every > 0 we have the following: There are three concave pieces and it suffices to cover each of them separately at radius / √ 3 to get an cover for sequences which are concave on B 1 , B 2 and B 3 separately.Using Lemma 3.3 on each of these pieces one obtains a log covering number bound which is atmost 3Cn 1/4 ( B τ ) 1/2 .Now there are exactly n+2 2 ways of choosing m 1 and m 2 .We could cover each of the spaces defined by a fixed m 1 , m 2 at radius separately and take the union of the covers.That would be a cover for K 3 at radius .The log cardinality of this cover is clearly upper bounded by Cn 1/4 ( B ) 1/2 + log n+2 2 .This finishes the proof of the lemma.We now embark on proving a key result which gives an upper bound on the function f θ * in case θ * is a concave monotonic sequence.As mentioned before, since any concave function first increases and then decreases, a critical step is to understand the behaviour of f θ * when θ * is monotonic, in addition to being concave.Recallling the definition of f θ * in (2.3), it is a expected supremum of Gaussian random variables where the supremum is over all θ ∈ K n which also lie within a Euclidean ball around θ * .As a first step we are only going to take the supremum over all θ ∈ K n lying within a Euclidean ball around θ * and having a maxima at a fixed index k.To explain further, let us define the set C k of concave sequences with maxima at k as follows: Our next result is a key lemma controlling the expected Gaussian maxima term E sup θ∈C k : θ−θ * ≤t z, θ − θ * where the supremum is taken over the restricted set C k intersected with a Euclidean ball.Our bound would hold uniformly over k, as a function of t whenever the underlying θ * is concave and monotonic(non decreasing or non increasing).
Lemma 3.5.Fix a positive integer n and fix any 1 ≤ k ≤ n.Also fix a non decreasing concave sequence θ * ∈ K n .For all t ≥ 0 the following inequality is true for a universal constant C,: Remark 3.3.The conclusion for Lemma 3.5 works even when θ * is concave and non increasing by reasons of symmetry.
Let us define A as follows: where L is a fixed positive number to be chosen later.
For any θ ∈ A we will define a truncated version of θ belonging in A which will be denoted by θ .Then we will have the inequality (3.5) Fix an arbitrary θ ∈ A. Let us denote S 1 = {i : θ i < θ * 1 − L} and S 2 = {i : θ i > θ * n + L}.We now define θ as follows for each 1 ≤ i ≤ n: where I is the usual indicator function.
From the above definition, it is clear that min 1≤i≤n θ 1 ≥ θ * 1 − L, max 1≤i≤n θ i ≤ θ * n + L. Also by construction of θ , we have the following contractive property for any 1 ≤ i ≤ n: Now by concavity of θ, the set S 1 is necessarily a union of atmost two intervals of the form The proof of Lemma 3.5 is based on a truncation θ (in red) of an arbitrary concave sequence θ ∈ C k (in black) with respect to a fixed monotone increasing and concave sequence θ * (in blue).The tails of θ are raised to As defined, θ is a constant vector on the two intervals in S 1 and remains a concave vector on the complement of S 1 which is an interval.Hence, θ ∈ K3.Combining these properties of θ imply that θ ∈ A .
We now proceed to control the first term on the right side of the inequality in (3.5).We have the following inequality by definition of θ : , is an interval of length no greater than 2v j .The figure above indicates this set.Similarly, the set i ∈ S 1 : 2 j L < θ * 1 − θ i is the union of at most two intervals.Each has size no larger than v j .
where I denotes the indicator function and the last inequality follows from the inequalities Fix a non negative integer j.We now note that for any θ A since θ − θ * ≤ t we have (3.9) We now make some further observations.Since θ is concave, any set of the form {i : θ i < a} for some number a is necessarily atmost a union of two intervals.One interval, if non empty, has to contain the index 1 and the other interval, if non empty has to contain the index n.Hence there exists integers 0 ≤ w 1 < w 2 ≤ n + 1 such that the following holds: where the first inequality is because θ * is non decreasing and the second inequality is due to (3.9).The last two displays imply the following inequality: Similarly, by concavity of θ and since θ ∈ C k any set of the form {i : θ i > a} for some number a is necessarily an interval containing k if it is non empty.Hence there exists integers 1 ≤ w 3 ≤ k ≤ w 4 ≤ n such that the following holds: where the first inequality is because θ * is non decreasing and the second inequality is due to (3.9).The last two displays imply the following inequality: Note that in our argument θ is an arbitrary element in A and the upper bound in the previous inequality does not depend on the choice of θ.Therefore, using the fact E|z i | = σ 2/π ≤ σ alongwith (3.9), the last display gives us We now set L = 128σ to finally get Now we come to controlling the second term in the right side of (3.5).Setting F = A in the chaining result in Theorem 3.1 we get the upper bound (3.13) The upper limit of the integral is the diameter of A which is atmost 2t.By definition of A we can now apply Lemma 3.4 with Using (3.13) and integrating the above expression gives us where we have also used the elementary inequality √ a + b ≤ √ a + √ b for any two positive numbers a, b.Combining the last display with (3.12) and (3.5) finishes the proof of the lemma.Our next step is to extend Lemma 3.5 to the case when the supremum of the Gaussian inner products are taken over all of K n (not just C k ) intersected with a Euclidean ball.This result is presented in our next lemma.
Lemma 3.6.Fix a positive integer n.Also fix a concave sequence θ * which is non decreasing sequence or non increasing.For all t ≥ 0 the following inequality is true for a universal constant C,: Proof.We prove the lemma when θ * is a concave non decreasing sequence.The proof when θ * is concave non increasing is analogous.For each 1 ≤ k ≤ n and t > 0 define the random variables We first note that sup Applying Lemma A.3 (see appendix) we see that the random variables X k (t) are lipschitz functions of z with lipschitz constant t.Hence using the Gaussian Concentration Theorem 3.2 for lipschitz functions we get for all x > 0 and all 1 ≤ k ≤ n, A standard argument involving maxima of random variables with Gaussian like tails is given in Lemma A.2 for the sake of completeness.Using this lemma and the last display we finally get for a universal constant C, Now Lemma 3.5 gives us an upper bound on the term max 1≤k≤n EX k (t) because the upper bound in Lemma 3.5 does not depend on k.Using this upper bound alongwith the last display finishes the proof of the lemma.
We are now finally ready to prove Proposition 3.1.The main idea is to use the fact that any concave sequence first increases and then decreases and use Lemma 3.6.

Proof of Proposition
where θ * 1 is an i * dimensional vector and θ * 2 is an n − i * dimensional vector.Similarly, let z = (z 1 , z 2 ).We can write There are two terms on the right side of the above inequality.Let us first bound the first term on the right side.The second term can be bounded exactly in the same way.Since θ * 1 is non decreasing, we can use Lemma 3.6 to obtain for all t ≥ 0: where we also use the fact that (i * ) 1/8 ≤ n 1/8 .We upper bound the second term by using Lemma 3.6 exactly similarly.We then obtain for an appropriate universal constant C, This finishes the proof.

Model Misspecification
Our risk analysis actually extends to the case of model misspecification.A natural quantity in the misspecified case for measuring the performance of an estimator θ is to control what is called the regret, defined as θ − θ * 2 − min θ∈Kn θ − θ * 2 .The goal of this section is to prove the next theorem upper bounding the regret of the LSE.This theorem gives an oracle risk bound generalizing Theorem 2.1 to the case when the true underlying sequence θ * is not necessarily concave.Recall that Π C denotes the projection operator to any closed convex cone C and for any vector θ ∈ R n the range of the vector is denoted by V (θ) = max 1≤i≤n θ i − min 1≤i≤n θ i .The above theorem implies that in case the true underlying sequence θ * is non concave then the regret of the LSE converges at the rate n −4/5 upto constant factors.
Our starting point in proving the above theorem is actually a stronger version of Lemma 3.1 and is a direct consequence of Theorem 12 in Bellec (2015) when applied to the cone K n ∩ L ⊥ ⊂ K n .
We now prove Theorem 4.1 which can be done in exactly the same way as the proof of Theorem 2.1.
Proof of Theorem 4.1.Let K = K n for simplicity.Let θ * ∈ R n .Recall that the subspace spanned by the n dimensional vectors (1, . . ., 1) and (1, . . ., n) is denoted as L. It is clear that L is the smallest subspace contained in K. Also recall that P L is the orthogonal projection matrix to the subspace L. By definition of the LSE and by orthogonal decomposition of projections we have θ = Π K∩L ⊥ (θ * + z) + P L (θ * + z).The first equality is a sum of squares decomposition using orthogonality.The second equality is because Π K∩L ⊥ (θ * + z) = Π K∩L ⊥ (µ * + z) by definition of µ * .This can be again checked by the usual KKT conditions for the projection of a point onto a closed convex cone.Now the term Π K∩L ⊥ (µ * + z) − µ * 2 can be controlled by an application of Lemma 4.1 and Proposition 3.1 which can be used since Π K (µ * ) ∈ K.The rest of the proof goes through verbatim as in the proof of Theorem 2.1.