Estimation from Non-Linear Observations via Convex Programming with Application to Bilinear Regression

We propose a computationally efficient estimator, formulated as a convex program, for a broad class of non-linear regression problems that involve difference of convex (DC) non-linearities. The proposed method can be viewed as a significant extension of the"anchored regression"method formulated and analyzed in [10] for regression with convex non-linearities. Our main assumption, in addition to other mild statistical and computational assumptions, is availability of a certain approximation oracle for the average of the gradients of the observation functions at a ground truth. Under this assumption and using a PAC-Bayesian analysis we show that the proposed estimator produces an accurate estimate with high probability. As a concrete example, we study the proposed framework in the bilinear regression problem with Gaussian factors and quantify a sufficient sample complexity for exact recovery. Furthermore, we describe a computationally tractable scheme that provably produces the required approximation oracle in the considered bilinear regression problem.


Introduction
Let f + 1 , f + 2 , . . ., f + n be i.i.d.copies of a random convex function f + : R d → R. Similarly, let f − 1 , f − 2 , . . ., f − n be i.i.d.copies of a random convex function f − : R d → R. For simplicity, we also assume that the functions f + and f − are differentiable. 1We observe x ∈ R d indirectly through the measurements where ξ i s denote additive noise.Given the data the goal is to accurately estimate x by a computationally tractable procedure.One can immediately notice the difference of convex (DC) structure2 in the observation model (1).A broad set of functions can be expressed in the DC form; for instance, any smooth function can be expressed in the DC form using positive and negative semidefinite parts of its Hessian.Therefore, many parametric regression problems can be abstracted as (1).

Assumptions
In the context of the model (1), the standard estimators based on empirical risk minimization such as (non-linear) least squares would lead to non-convex optimization problems that are generally computationally hard.Thus, without making any assumption, our search for a computationally efficient estimator for (1) may be futile.Below we discuss the main assumptions we rely on in our analysis.
Here and throughout we use the notation E D or E D n to denote the expectation with respect to a single or multiple observations.Outer product of vectors is denoted by the binary operation ⊗.Furthermore, • , • F , and • op respectively denote the usual Euclidean norm, Frobenius norm, and operator norm.
To avoid long expressions, for i = 1, . . ., n, we define which are clearly non-negative and positive homogeneous.Therefore, by the triangle inequality, they also satisfy for every pair of h, h ∈ R d .A crucial point in our analysis is that E D (q(z)) has a linear growth in terms of z .As it becomes clear in the sequel, the central piece in our analysis is to establish a lower bound for the empirical process 1 n n i=1 q i (h) uniformly for every h.If E D (q(z)) does not have a lower bound that linearly grows with z , then a non-trivial lower bound for the mentioned empirical process that holds uniformly in an arbitrarily small neighborhood of the origin might not exist.The consequence is that we may get an error bound that does not vanish even with no added noise.These circumstances are observed and well-understood, for instance, in the contexts of ratio limit theorems [17; 18], the issue of a non-trivial version space in learning problems [35; 36], and implicitly in specific applications such one-bit compressed sensing and its generalizations [38; 39].
We use the random function q(h) = 1 2 ∇f + (x ) − ∇f − (x ), h , which has the same law as the functions q i , to define a few important quantities below.
Non-degeneracy: Let S d−1 denote the unit Euclidean sphere in R d .Given S ⊆ S d−1 , we define λ D and Λ D as and The dependence of λ D and Λ D on S will always be clear from the context, thus we do not make this dependence explicit merely to simplify the notation.
While in general we simply choose S = S d−1 in (4) and (5), in some applications we may choose S to be a proper subset of S d−1 .This restriction helps us avoiding a degeneracy that leads to λ D = 0 and vacuous error bounds.An interesting example occurs in the bilinear regression problem discussed in Section 4.
imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20,2018 To see that having λ D > 0 is a sufficient condition to ensure the identifiability of x , suppose that for some x distinct from x we have which by the convexity of f + and f − implies that almost surely.This cannot happen, as the non-degeneracy condition λ D > 0 ensures that q(x−x ) = 0 holds almost surely only if x = x .

Regularity:
For technical reasons we also need some regularity for the data distribution.Specifically, for some constant η D > 1 we assume that holds for all z ∈ R d .The purpose of this mild assumption is to exclude pathologically heavy-tailed data distributions.Furthermore, with g denoting a standard normal random variable, we define which is a measure of smoothness of the functions q i near the origin.

Approximation oracle:
We assume an approximation oracle is available that provides a vector a 0 ∈ R d which, for some ε ∈ (0, 1], obeys Having access to the approximation oracle above is the strongest assumption we make.This assumption could be excessive for prediction tasks where the goals is merely accurate approximation of f + (x )−f − (x ) for the unseen data.In this paper, however, we are analyzing an estimation task in which accurate estimation of x is the goal.A standard approach to such estimation problems is to optimize an empirical risk that quantifies the consistency of any candidate estimate with the observations.Because these risk functions are generally non-convex, accuracy guarantees for iterative estimation procedures is often established assuming that they are initialized at a point, say x 0 , in a relatively small neighborhood of the ground truth x (i.e., x 0 ≈ x ).The imposed bound (8) Finally, if the vectors ∇f + (x ) + ∇f − (x ) are sufficiently light-tailed in the sense of being bounded in a certain Orlicz norm 3 , then we can simply require and then resort to a matrix concentration inequality such as the matrix Bernstein inequality [25,Theorem 2.7] or the matrix Rosenthal inequality [15; 23; 31] to recover the condition (8).We do not attempt to provide a general framework to address these details in this paper.However, in the context of the bilinear regression problem, we provide an explicit example for an implementable approximation oracle in Section 4.
The three assumptions stated above are primarily related to the statistical model.We also make the following assumptions on the computational model in order to provide a tractable method.

Computational assumptions:
We acknowledge that our approach requires the access to the DC-decomposition of the (non-convex) observation function.Computing such a decomposition can be intractable in general (see, e.g., [2] and references therein).However, having access to an efficiently computable DC form is a reasonable assumption for creating a concrete computational framework.Moreover, in many applications the DC decomposition is provided explicitly.Of course, as a natural requirement for implementing optimization algorithms, we also need the components of the DC decomposition (and their gradients) to be computable.

The estimator and the main results
Given a 0 , the output of the approximation oracle which obeys (8), we formulate the estimator of x as which is a convex program that can be solved efficiently.
With the definitions and assumptions stated in Section 1.1, our main result is the following theorem that provides the sample complexity for accuracy (9) in a generic setting.This theorem is a simple consequence of Proposition 1 and Lemma 1 stated further below.
Theorem 1.Given a set S ⊆ S d−1 and parameter ε ∈ (0, 1), suppose (4), ( 5), ( 6), (7), and (8) hold.Furthermore, for a solution x of (9), suppose that we have If the number of measurements obeys for a sufficiently large absolute constant C > 0, then with probability ≥ 1 − δ we have Proof.Given the lower bound on n, Proposition 1 below together with (4) guarantee that, with probability ≥ 1 − δ, we have for all h ∈ S. The desired error bound follows immediately from Lemma 1 with x 0 = x and ε 0 = 0.
In a generic scenario we may take S = S d−1 and the condition (10) trivially holds.The purpose of this condition, however, is to address specific cases where a set of equivalent ground truth vectors x exists and we only need to prove accuracy with respect to the closest point in this set.This relaxed accuracy requirement induces additional structure on x − x that can be captured by (10).The bilinear regression problem discussed below in Section 4 is an example where it is important to have a non-trivial set S.
Under the assumptions stated in Section (1.1), accuracy of the estimator (9) can be reduced to the existence of an appropriate uniform lower bound for the empirical process 1 n n i=1 q i (h) as a function of h.The following Lemma 1, proved in Section 3.2, provides the precise form of this reduction.
Lemma 1.Let x 0 be one of the possibly many vectors equivalent to x meaning that almost surely.Given a set S ⊆ S d−1 , recall the definition (4) and assume that an analog of the condition (8) with respect to x 0 holds, namely, for some constant parameter ε ∈ (0, 1).Furthermore, suppose that (10) holds and that for a certain absolute constant ε 0 ∈ [0, 1), holds for every h ∈ S. Then the estimate x obeys Again, in the generic case we choose S = S d−1 , x 0 = x , and ε 0 = 0 in Lemma 1.For the structured problems mentioned above, however, with a non-trivial choice of S in (10), we may need to choose x 0 = x and an appropriate ε 0 > 0.
Lemma 1 provides an error bound that is proportional to This dependence is satisfactory for a deterministic noise model where we ought to consider the worst-case scenarios.However, we may obtain improved noise dependence for random noise models.In fact, simple modifications in the proof of Lemma 1 allow us to replace 1 These expressions may provide much tighter bounds when the noise is random with a well-behaved distribution.For instance, if ξ 1 , . . ., ξ n are i.i.d.zero-mean Gaussian random variables, the first expression reduces to the Gaussian complexity of the functions which may be of order n −1/2 .To keep the exposition simple, we focus on the deterministic noise model in this paper.
Clearly, to prove accuracy of ( 9) through Lemma 1, establishing an inequality of the form ( 12) is crucial.Proposition 1 below can provide a guarantee for such an inequality in the case x 0 = x and under the assumptions made in Section 1.

Related work
In a prior work [9], we considered the "convex regression" model, a special case of (1) with purely convex non-linearities (i.e., f − i ≡ 0 and f + i ≡ f i for convex functions f i ).With a slightly weaker approximation oracle that produces an anchor a 0 for which a 0 , x / a 0 x is non-vanishing, statistical accuracy of estimation via the convex program is studied in [9].The effect of convex regularization (e.g., 1 -regularization) in structured estimation (e.g., sparse estimation) is also considered and analyzed in [9].Evidently, the solution of the convex program above is insensitive to (positive) scaling of the anchor a 0 .The estimator ( 9) is, however, sensitive to the scaling of a 0 which is a main reason for the need for a slightly stronger approximation oracle in this paper.An interesting example where the described convex regression applies is the phase retrieval problem that was previously studied in [8; 10; 19; 20].
As will be seen in Section 4, bilinear regression and the closely related blind deconvolution problem can be modeled by (1) as well.Succinctly, the goal in a bilinear regression problem is to recover signal components x (1) and x (2) , up to the inevitable scaling ambiguity, from bilinear observations of the form a i , x (2) for i = 1, . . ., n.In the context of blind deconvolution, solving such imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 a system of bilinear equations in the lifted domain through nuclear-norm minimization has been analyzed in [3] and [7].Despite their accuracy guarantees, the nuclear-norm minimization methods are practically not scalable to large problem sizes which motivated the analysis of non-convex techniques (see, e.g., [29; 30]).Inspired by the results on the phase retrieval problem mentioned above, [1] proposed and analyzed a convex program for bilinear regression that operates in the natural space of the signals, thereby avoiding the prohibitive computational cost of the lifted convex formulations.Unlike the mentioned methods for phase retrieval that only require a (directional) approximation of the ground truth, the proposed estimator in [1] requires the exact knowledge of the signs of all of the multiplied linear forms a (or a 2) ).This requirement is rather strong and may severely limit the applicability of the considered method.In Section 4, we look into the problem of bilinear regression as a special case of the general regression problem (1); under the common Gaussian model for the measurement vectors we derive the sample complexity of ( 9) and explain an efficient method to construct an admissible vector a 0 only using the given observations.

Main proofs
There are various techniques under the umbrella of empirical process theory that can be employed to establish Proposition 1 and thereby Theorem 1.For instance, techniques relying on the concepts of VC dimension [40; 41] or Rademacher complexity [11; 24; 27] including the small-ball method [26; 35; 36] that are primarily developed in the field of statistical learning theory.In this paper, we use another common technique known as the PAC-Bayesian (or pseudo-Bayesian) method.This method, proposed in [33], has been successfully used for establishing various generalization bounds for classification [see e.g., 16; 28; 33] and accuracy in regression problems [4-6; 14; 37].The bounds obtained using this technique appear in different forms; we refer the interested reader to the survey paper [32] and the monograph [13] for a broader view of the related results and techniques.Our analysis below in Section 3.1 parallels that used in [37] which in turn was inspired by [6].The technical tools we use can be found in the PAC-Bayesian literature; we provide the proofs to make the manuscript self-contained.We emphasize that the novelty of this work is the general regression model (1) and the computationally efficient estimator (9) rather than the methods of analysis.
The core idea in PAC-Bayes theory is the variational inequality4 where D KL (µ, ν) = E z∼µ log dµ(x) dν(x) denotes the Kullback-Leibler divergence (or relative entropy) between probability measures µ and ν with µ ν.In PAC-Bayesian analyses, the fact that this bound is deterministic and holds for any probability measure µ ν is leveraged to control the supremum of stochastic processes.In particular, for a stochastic process R(•) with the domain X , we may approximate sup x∈X R(x) by the supremum of sup µx:x∈X E z∼µx R(z) with respect to a certain set of probability measures µ x that are indexed by the elements of X (e.g., E z∼µx (z) = x).Then, under some regularity conditions on the stochastic process, the approximated bound can be converted to an exact bound.

A PAC-Bayesian proof of Proposition 1
We use the PAC-Bayesian analysis to establish Proposition 1, the main ingredient in proving the accuracy of (9).
Proof of Proposition 1.For i = 1, . . ., n, let where [u] ≤1 def = min (u, 1) and α > 0 is a normalizing factor to be specified later.Furthermore, let γ h denote the normal distribution with mean h and covariance σ 2 I for a parameter σ.By (13), for every h ∈ S ⊆ S d−1 , we have (15) Furthermore, by Markov's inequality with probability ≥ 1 − δ/2 we have where the exchange of expectations on the second line is valid as the argument is a bounded function.Therefore, on the same event because of (15) for every h ∈ S we have or equivalently The definition (14) and the facts that , for all u ≥ 0, imply the bounds and Using (3), (7), and the Cauchy-Schwarz inequality, we also have Thus, it follows from ( 16) that Applying Lemmas 2 and 3, stated and proved in the appendix, to (17) shows that for all h ∈ S, with probability ≥ 1 − δ, we have By rearranging the terms, we reach at Recalling (4), (5), and (7), we can choose and to obtain .
It is then straightforward to deduce with a sufficiently large hidden constant.

Proof of Lemma 1
Below we provide a proof of Lemma 1.
Proof of Lemma 1.By optimality of x in (9) we have Therefore, we deduce that or equivalently imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Invoking the assumption (11) and using Cauchy-Schwarz inequality we can write Rearranging the terms gives the equivalent inequality Using the assumption that (12) holds, we obtain Therefore, we conclude that

Application to bilinear regression
Suppose that the vectors x (1) and x (2) are observed through the bilinear measurements with known vector pairs (a i ).Let x denote the concatenation of x . Similarly, for i = 1, . . ., n let a + i (respectively a − i ) denote the concatenation of a i ).It is easy to verify that the bilinear measurements above can also be expressed in the form which is a special case of the DC observation model ( 1) with Denote the × identity matrix by I .We assume that a 1 , a 2 , . . ., a n are i.i.d.copies of a (1) i.i.d.
∼ Normal(0, I d 1 ), and are independent of a 2 , . . ., a n that are i.i.d.copies of a (2) i.i.d.∼ Normal(0, I d 2 ).Naturally, we also denote concatenation of a (1) and a (2) (respectively a (1) and −a (2) ) by a + (respectively a − ).The functions f ± are defined analogous to f ± i with a ± replacing a ± i .For brevity, we set d = d 1 +d 2 .Furthermore, some of the unspecified constants in the derivations below are overloaded and may take different values from line to line.For any vector x ∈ R d with partitions as x (1) ∈ R d 1 and x (2) ∈ R d 2 , we use the notation x − to denote the concatenation of x (1) and −x (2) .
Evidently, any reciprocal scaling of x (1) and x (2) is also consistent with the bilinear measurements (18) and will be considered a valid solution.Throughout this section, we choose x to be a "balanced" solution meaning that x (1) = x (2) .Also, without loss of generality, we may assume a 0 , x ≥ 0. The accuracy, however, is measured with respect to a closest consistent solution x ∈ argmin To state the accuracy guarantees for (9) in the described bilinear regression problem, we first bound the important quantities given by ( 4), ( 5), (6), and (7) for the restriction set This choice of S allows us to find a non-trivial bound for λ D and it is important in the proof of Theorem 2.

Quantifying λ D , Λ D , Γ D , and η D
Let h ∈ S be a vector partitioned into h (1) ∈ R d 1 and h (2) ∈ R d 2 .We can write Using the Cauchy-Schwarz inequality and Lemma 7 in the appendix, respectively, we obtain and Observe that = X , h ⊗ h , where x (1) 2 x , and it is easy to verify that 3 8 Therefore, (22) implies that Note that without the restriction of h to the prescribed set S in ( 21), we could have had λ D = 0, which leads to vacuous bounds.We can also evaluate Γ D as + a (2) a (1) , x imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Furthermore, using the lower bound in (22), we can write 1) , a (1) a (2) , x (2) + x (1) , a (1) a (2) , Thus, we are guaranteed to have

Accuracy guarantee
To prove accuracy of ( 9) in the considered bilinear regression problem, we need to apply Lemma 1 with x given by ( 20) as the reference ground truth.Therefore, we also use Proposition 1 with a non-trivial restriction set S in our analysis to establish an inequality of the form (12).Because x depends on the observations, however, it cannot be used as the reference ground truth in Proposition 1. Lemma 4 in the appendix shows that the bound obtained using Proposition 1, with the balanced ground truth (i.e., x ) as the reference point and the restrictions set (21), can be extended to the cases where other equivalent solutions are considered as the reference ground truth.For any t ∈ R\0, let D t : R d → R d be the reciprocal scaling operator described by where x is the concatenation of x (1) ∈ R d 1 and x (2) ∈ R d 2 .Furthermore, for θ ∈ [0, 1], we define the cone K t,θ as This specific choice of the cone K t,θ is important for the following reason: If, for some t opt ∈ R\{0}, x = D topt (x ) is the solution described by (20), then elementary calculus shows that which allows us to invoke Lemma 4.
The following theorem establishes the sample complexity of ( 9) for exact recovery in the noiseless bilinear regression problem.
imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Proof.Because of ( 27), we may assume where C > 0 is the constant in Lemma 6.With ε = 6ε − 5, it follows from (40) in Lemma 6 that holds with probability ≥ 1 − δ/2.Furthermore, the approximations ( 23), (24), and (25) show that because of ( 27), Proposition 1, with x taken as the reference ground truth, ensures with probability ≥ 1 − δ/2.On the same event, if t opt , defined for all h ∈ S d−1 ∩ K topt,0 .Note that the expectations on the right-hand side are only with respect to f ± ; the vector x and the scalar t opt should be treated as deterministic variables.Using (4) we obtain Therefore, the bound It only remains to bound |t opt | appropriately, not only to approximate min t −1 opt , |t opt | , but also to satisfy the condition 2/3 ≤ |t opt | ≤ 3/2 used previously.First, we show that x − x is small through Lemma 6.Note that the previous application of Lemma 6, in which we had (28), also guarantees imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Therefore, using the upper bound in (23), we get , and x is balanced, we also have which together with the previous inequality imply Therefore, we obtain where the fourth line holds since 1/4 ≤ ε = 6ε − 5 ≤ 1.Using the derived bound in (30) yields Hence, in view of (11), we may invoke Lemma 1 with x 0 = x , ε in place of ε, and ε 0 = ε , and prove the exact recovery (i.e., x = x ), which occurs with probability ≥ 1 − δ.

Approximation oracle
We provide a computationally tractable procedure that can serve as the approximation oracle discussed in Section 1.1 and requires no information other than the given measurements (18).We use the measurements to find an approximation a 0 of x /2 and show, by Lemma 5, that x /2 itself is an approximation for . Let λ max and v max be respectively the leading eigenvalue and eigenvector of The fact that S n has an all-zero diagonal and is symmetric ensures that λ max ≥ 0. We show that imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 meets the required condition (8) with high probability.To this end, first we show that S n is wellconcentrated around its expectation.Observe that a − i , x = a + i , x − and similarly a − i , x − = a + i , x .Thus, we obtain By the triangle inequality we can write Each of the summands on the right-hand side is small for a sufficiently large n.For example, as shown in [12,Lemma 7.4], if n ≥ C τ d log d for a sufficiently large constant C τ that depends only on τ ∈ (0, 1), then Clearly, we can write similar inequalities for the other three summands and by a simple union bound conclude that holds with probability ≥ 1 − c τ d −2 for some absolute constant c τ depending only on τ .Recall that 2) and by the construction of x and x − we have x , x − = 0. Thus x and x − are eigenvectors of E D S n .We may assume that v max , x ≥ 0; otherwise we can simply use −x as the target.Then, on the event (32), a variant of the Davis-Kahan theorem [43,Corollary 3] ensures Since a 0 is defined by (31), we equivalently obtain Using (32), it is also easy to show that Therefore, we deduce that It follows from Lemma 5 for x = x , (33), and (23), that if Choosing an appropriate value for τ in terms of ε, the constant C τ can also be made smaller than (1 − ε)/2, thereby guaranteeing (8).

Numerical Experiments
To evaluate the proposed method numerically, we ran 100 trials with the standard Gaussian measurements for each pair of d 1 = d 2 = d/2 ∈ {50, 100, 150} and n/d ∈ {5, 6, 7, 8, 9}.The signal pairs x (1) and x (2) are drawn independently from the d/2-dimensional unit sphere in each trial.We solved an equivalent form of (9) which is the quadratically-constrained linear maximization max x,w where 1 n denotes the n-dimensional all-one vector, using the Gurobi solver [22] through the CVX package [21].This solver relies on an interior point method for solving the second order cone program (SOCP) corresponding to (34).For better scalability, first order methods including stochastic and incremental methods can be used to solve (9) directly.We did not intend in this paper to find the best convex optimization method for solving (9). Figure 1 shows the median of the relative error computed as Relative Error = x (2) / x (1) x (1) − x .
The experiment suggests that the proposed method succeeds when the oversampling ratio is around eight (i.e., n ≈ 8(d 1 + d 2 ) = 8d).
Proof.The triangle inequality and subadditivity of u → [u] ≤1 over the non-negative real numbers yields Clearly, [αq i (h)] ≤1 ≤ αq i (h).Thus, we only need to bound the second term in the above inequality.This term is well-concentrated around its expectation with respect to the data distribution.Namely, using the bounded difference inequality [34] with probability ≥ 1 − δ/2 we obtain for all h.Therefore, on this event we have where the first inequality follows from (3) and the concavity of u → [u] ≤1 , and the second inequality follows from the fact that [u] ≤1 ≤ u, the Cauchy-Schwarz inequality, and the definition (7).

Lemma 3.
For all h, we have Proof.It immediately follows from the triangle inequality, (3), and the equivalence of z ∼ γ h and z − h ∼ γ 0 , that which by the assumption (6) and definition (7) yields is the desired bound.

A.2. Lemmas used in Section 4
Note that in the following lemmas the functions f + i and f − i are defined as in (19).Lemma 4. With K t,θ defined by (26), suppose that . Then, for all t ∈ R\{0} with 2/3 ≤ |t| ≤ 3/2, and all vectors h ∈ K t,0 , we have Proof.We have the identity imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 for every h and t ∈ R\{0}.Furthermore, because h ∈ K t,0 , by definition D t (x − ), h = 0, thereby we have the following Applying the bound 2/3 ≤ |t| ≤ 3/2, we obtain . Therefore, it follows from the assumption of the lemma that which, using (35), implies We use standard matrix concentration inequalities to establish Lemmas 5 and 6 below.We can upper bound with probability ≥ 1 − δ/2 for a sufficiently large absolute constant C > 0. Similarly, we have with probability ≥ 1 − δ/2.The first lemma below is an immediate consequence of the matrix concentration inequalities above and is stated merely for reference.Lemma 5. On the event that ( 36) and ( 37) hold, we have for every x ∈ R d .Proof.By definition for any x.A simple application of triangle inequality yields The result follows immediately using the matrix concentration inequalities ( 36) and (37).( Proof.First we prove (39).By optimality of x in (9), we can write Then, subtracting imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Applying the Cauchy-Schwarz inequality to the first line, and the standard matrix concentration inequalities (36) and (37) in the third line, we obtain with probability ≥ 1 − δ that Finally, it follows from the triangle inequality and the definition of x in (20) that which together with the previous bound guarantees (39).Next, we prove (40).By the triangle inequality and the fact that ∇f ± i (x) is linear in x, we have Recall, from the first part of the proof, that (36) and (37) hold with probability ≥ 1 − δ.Then, on the same event, Lemma 5 implies that Therefore, if n is sufficiently large to ensure the right-hand side of ( 39) is non-negative, we deduce that (40) holds as well.
n n i=1 |ξ i | in the error bound by the expressions sup x 1 n n i=1

Lemma 6 .
There exists an absolute constant C > 0 such that holds with probability ≥ 1 − δ.Furthermore, for a sufficiently large n, on the same event we have ) + ∇f − i (x ) .
For any matrix A and standard normal random vector z (of appropriate dimension) we have Proof.The Euclidean and Frobenius norms as well as the standard normal distribution are rotationally invariant.Thus, the claim can be reduced to the case where A is diagonal with non-zero diagonal entries s 1 , s 2 , . . ., s r and Az = imsart-generic ver.2014/10/16 file: manuscript.texdate: June 20, 2018 Lemma 7.