Errors-in-variables models with dependent measurements

Suppose that we observe $y \in \mathbb{R}^n$ and $X \in \mathbb{R}^{n \times m}$ in the following errors-in-variables model: \begin{eqnarray*} y&=&X_0 \beta^* +\epsilon \\ X&=&X_0 + W, \end{eqnarray*} where $X_0$ is an $n \times m$ design matrix with independent subgaussian row vectors, $\epsilon \in \mathbb{R}^n$ is a noise vector and $W$ is a mean zero $n \times m$ random noise matrix with independent subgaussian column vectors, independent of $X_0$ and $\epsilon$. This model is significantly different from those analyzed in the literature in the sense that we allow the measurement error for each covariate to be a dependent vector across its $n$ observations. Such error structures appear in the science literature when modeling the trial-to-trial fluctuations in response strength shared across a set of neurons. Under sparsity and restrictive eigenvalue type of conditions, we show that one is able to recover a sparse vector $\beta^* \in \mathbb{R}^m$ from the model given a single observation matrix $X$ and the response vector $y$. We establish consistency in estimating $\beta^*$ and obtain the rates of convergence in the $\ell_q$ norm, where $q = 1, 2$. We show error bounds which approach that of the regular Lasso and the Dantzig selector in case the errors in $W$ are tending to 0. We analyze the convergence rates of the gradient descent methods for solving the nonconvex programs and show that the composite gradient descent algorithm is guaranteed to converge at a geometric rate to a neighborhood of the global minimizers: the size of the neighborhood is bounded by the statistical error in the $\ell_2$ norm. Our analysis reveals interesting connections between computational and statistical efficiency and the concentration of measure phenomenon in random matrix theory. We provide simulation evidence illuminating the theoretical predictions.


Introduction
The matrix variate normal model has a long history in psychology and social sciences. In recent years, it is becoming increasingly popular in biology and genomics, neuroscience, econometric theory, image and signal processing, wireless communication, and machine learning; see for example [15,22,17,52,5,54,18,2,26] and references therein. We call the random matrix X, which contains n rows and m columns a single data matrix, or one instance from the matrix variate normal distribution. We say that an n × m random matrix X follows a matrix normal distribution with a separable covariance matrix Σ X = A ⊗ B and mean M ∈ R n×m , which we write X n×m ∼ N n,m (M, A m×m ⊗ B n×n ). This is equivalent to say vec { X } follows a multivariate normal distribution with mean vec { M } and covariance Σ X = A ⊗ B.
Here, vec { X } is formed by stacking the columns of X into a vector in R mn . Intuitively, A describes the covariance between columns of X, while B describes the covariance between rows of X. See [15,22] for more characterization and examples.
In this paper, we introduce the related sum of Kronecker product models to encode the covariance structure of a matrix variate distribution. The proposed models and methods incorporate ideas from recent advances in graphical models, high-dimensional regression model with observation errors, and matrix decomposition. Let A m×m , B n×n be symmetric positive definite covariance matrices. Denote the Kronecker sum of A = (a ij ) and B = (b ij ) by imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where I n is an n × n identity matrix. This covariance model arises naturally from the context of errors-in-variables regression model defined as follows.
Suppose that we observe y ∈ R n and X ∈ R n×m in the following model: where X 0 is an n × m design matrix with independent row vectors, ∈ R n is a noise vector and W is a mean zero n × m random noise matrix, independent of X 0 and , with independent column vectors ω 1 , . . . , ω m .
In particular, we are interested in the additive model of X = X 0 + W such that where we use one covariance component A ⊗ I n to describe the covariance of matrix X 0 ∈ R n×m , which is considered as the signal matrix, and the other component I m ⊗B to describe that of the noise matrix W ∈ R n×m , where Eω j ⊗ ω j = B for all j, where ω j denotes the j th column vector of W . Our focus is on deriving the statistical properties of two estimators for estimating β * in (1.1a) and (1.1b) despite the presence of the additive error W in the observation matrix X. We will show that our theory and analysis works with a model much more general than that in (1.2), which we will define in Section 1.1.
Before we go on to define our estimators, we now use an example to motiviate (1.2) and its subgaussian generalization in (1.4). Suppose that there are n patients in a particular study, for which we use X 0 to model the "systolic blood pressure" and W to model the seasonal effects. In this case, X models the fact that among the n patients we measure, each patient has its own row vector of observed set of blood pressures across time, and each column vector in W models the seasonal variation on top of the true signal at a particular day/time. Thus we consider X as measurement of X 0 with W being the observation error. That is, we model the seasonal effects on blood pressures across a set of patients in a particular study with a vector of dependent entries. Thus W is a matrix which consists of repeated independent sampling of spatially dependent vectors, if we regard the individuals as having spatial coordinates, for example, through their geographic locations. We will come back to discuss this example in Section 1.4.

The model and the method
We first need to define an independent isotropic vector with subgaussian marginals as in Definition 1.1. For a vector y = (y 1 , . . . , y p ) in R p , denote by y 2 = j y 2 j the length of y. Definition 1.1. Let Y be a random vector in R p 1. Y is called isotropic if for every y ∈ R p , E | Y, y | 2 = y 2. Y is ψ 2 with a constant α if for every y ∈ R p , Y, y ψ2 := inf{t : E exp( Y, y 2 /t 2 ) ≤ 2} ≤ α y 2 . (1. 3) The ψ 2 condition on a scalar random variable V is equivalent to the subgaussian tail decay of V , which means P (|V | > t) ≤ 2 exp(−t 2 /c 2 ), for all t > 0.
Throughout this paper, we use ψ 2 vector, a vector with subgaussian marginals and subgaussian vector interchangeably.
The model. Let Z be an n × m random matrix with independent entries Z ij satisfying EZ ij = 0, 1 = EZ 2 ij ≤ Z ij ψ2 ≤ K. Let Z 1 , Z 2 be independent copies of Z. Let such that X 0 = Z 1 A 1/2 is the design matrix with independent subgaussian row vectors, and W = B 1/2 Z 2 is a random noise matrix with independent subgaussian column vectors.
Assumption (A1) allows the covariance model in (1.2) and its subgaussian variant in (1.4) to be identifiable.
(A1) We assume tr(A) = m is a known parameter, where tr(A) denotes the trace of matrix A.
In the Kronecker sum model, we could assume we know tr(B), in order not to assume knowing tr(A). Assuming one or the other is known is unavoidable as the covariance model is not identifiable otherwise. Moreover where (a) + = a ∨ 0 and X 2 F := i j X 2 ij . We first introduce the corrected Lasso estimator, adapted from those as considered in [30].
Suppose that tr(B) is an estimator for tr(B); for example, as constructed in (1.5). Let Γ = 1 n X T X − 1 n tr(B)I m and γ = 1 n X T y. (1.6) For a chosen penalization parameter λ ≥ 0, and parameters b 0 and d, we consider the following regularized estimation with the 1 -norm penalty, which is a variation of the Lasso [48] or the Basis Pursuit [12] estimator. Although in our analysis, we set b 0 ≥ β * 2 and d = |supp(β * )| := {j : β * j = 0} for simplicity, in practice, both b 0 and d are understood to be parameters chosen to provide an upper bound on the 2 norm and the sparsity of the true β * .

Gradient descent algorithms
In order to obtain fast, approximate solutions to the optimization goal as in (1.10), we adopt the computational framework of [1,30], namely, the composite gradient descent method due to Nesterov [34] to analyze our computational and statistical errors in an integrated manner. First we denote the population and empirical loss functions by respectively. We consider regularizers that are separable across all coordinates and write Throughout this paper, we denote by From the formulation (1.7), the corrected linear regression estimator is given by minimizing the penalized loss function φ(β) subject to the constraint that g(β) ≤ R: β ∈ arg min β∈R m ,g(β)≤R where g(β) is a convex function, which is allowed to be identical to β 1 and R is a second tuning parameter that is chosen to confine the solution β within the 1 ball of radius R, while at the same time ensuring that β * is a feasible solution. The gradient descent method generates a sequence {β t } ∞ t=0 of iterates by first initializing to some parameter β 0 ∈ R m , and then for t = 0, 1, 2, . . ., applying the recursive updates: β t+1 = arg min β∈R m ,g(β)≤R L n (β t ) + ∇L n (β t ), β − β t + ζ 2 β − β t 2 2 + ρ λ (β) (1.11) imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where ζ is the step size parameter.
More generally, we consider loss function L n : R m → R and ρ λ which are possibly nonconvex and consider the regularized M-estimator of the form β ∈ arg min β∈R m ,g(β)≤R {L n (β; X) + ρ λ (β)} (1.12) where ρ λ : R m → R is a regularizer depending on a tuning parameter λ > 0. Because of this potential nonconvexity, we also include a side constraint in the form of g(β) ≤ R, where (1. 13) so that this choice of g is convex for properly chosen parameter µ ≥ 0 for a class of weakly convex penalty functions ρ [51]; See Assumption 1 in [31] where properties of g and ρ λ are stated in terms of the univariate function ρ λ : R → R and the parameter µ ≥ 0. While our results hold for the general nonconvex penalty ρ λ that is weakly convex in the sense that (1.13) holds for some parameter µ > 0, we focus our discussion to the choice of ρ λ (β) = λ β 1 and µ = 0 in the present paper.

Our contributions
We provide a unified analysis of the rates of convergence for both the corrected Lasso estimator (1.7) and the Conic programming estimator (1.8), which is a Dantzig selectortype, although under slightly different conditions. We will show the rates of convergence in the q norm for q = 1, 2 for estimating a sparse vector β * ∈ R m in the model (1.1a) and (1.1b) using the corrected Lasso estimator (1.7) in Theorems 3 and 6, and the Conic programming estimator (1.8) in Theorems 4 and 7 for 1 ≤ q ≤ 2. We also show bounds on the predictive errors for the Conic programming estimator. The bounds we derive in Theorems 3 and 4 focus on cases where the errors in W are not too small in their magnitudes in the sense that τ B := tr(B)/n is bounded from below. For the extreme case when τ B approaches 0, one hopes to recover bounds close to those for the regular Lasso or the Dantzig selector since the effect of the noise in matrix W on the procedure becomes negligible. We show in Theorems 6 and 7 that this is indeed the case. These results are new to the best of our knowledge.
In Theorems 3 to 7, we consider the regression model in (1.1a) and (1.1b) with subgaussian random design, where X 0 = Z 1 A 1/2 is a subgaussian random matrix with independent row vectors, and W = B 1/2 Z 2 is an n × m random noise matrix with independent column vectors, This model is significantly different from those analyzed in the literature. For example, unlike the present work, the authors in [30] apply Theorem 16 which states a general result on statistical convergence properties of the estimator (1.7) to cases where W is composed of independent subgaussian row vectors, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 when the row vectors of X 0 are either independent or follow a Gaussian vector autoregressive model. See also [35,36,3] for the corresponding results on the compensated MU selectors, variations on the Conic programming estimator (1.8).
The second key difference between our framework and the existing work is that we assume that only one observation matrix X with the single measurement error matrix W is available. Assuming (A1) allows us to estimate EW T W as required in the estimation procedure (1.6) directly, given the knowledge that W is composed of independent column vectors. In contrast, existing work needs to assume that the covariance matrix Σ W := 1 n EW T W of the independent row vectors of W or its functionals are either known a priori, or can be estimated from a dataset independent of X, or from replicated X measuring the same X 0 ; see for example [35,36,3,30,10]. Although the model we consider is different from those in the literature, the identifiability issue, which arises from the fact that we observe the data under an additive error model, is common. Such repeated measurements are not always available or costly to obtain in practice [10]. We will explore such tradeoffs in future work.
A noticeable exception is the work of [11], which deals with the scenario when the noise covariance is not assumed to be known. We now elaborate on their result, which is a variant of the orthogonal matching pursuit (OMP) algorithm [49,50]. Their support recovery result, that is, recovering the support set of β * , applies only to the case when both signal matrix and the measurement error matrix have isotropic subgaussian row vectors. In other words, they assume independence among both rows and columns in X (X 0 and W ). Moreover, their algorithm requires the knowledge of the sparsity parameter d, which is the number of non-zero entries in β * , as well as a β min condition: min j∈supp β * β * j = Ω log m n ( β * 2 + 1) . Under these conditions, they recover essentially the same 2 -error bounds as in the current work, and [30], where the covariance Σ W is assumed to be known.
Finally, we present in Theorems 2 and 9 the optimization error for the gradient descent algorithms in solving (1.12) and more specifically (1.7). Let β be a global optimizer of (1.12). Let λ max (A) and λ min (A) be the largest and smallest eigenvalues, and κ(A) be the condition number for matrix A. Let 0 < κ < 1 be a contraction factor to be defined in (2.11). Similar to the work of [1,30], we show that the geometric convergence is not guaranteed to an arbitrary precision, but only to an accuracy related to statistical precision of the problem, measured by the 2 error: β − β * 2 2 =: ε 2 stat between the global optimizer β and the true parameter β * .
More precisely, our analysis guarantees geometric convergence of the sequence {β t } ∞ t=0 to a parameter β * up to a neighborhood of radius defined through the statistical error bound ε 2 where κ is a contraction coefficient to be defined (2.11), so that for all t ≥ T * (δ) as imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 in (2.17), α λ min (A) and α u λ max (A), for λ, ζ ≥ α u appropriately chosen, R = O( n log m ) and n = Ω (d log m), where the O(·) and Ω(·) symbols hide spectral parameters regarding A and B. To quantify such results, we first need to introduce some conditions in Section 2. See Theorem 2 and Corollary 10 for the precise conditions and statements.

Discussion
The theory on matrix variate normal data show that having replicates will allow one to estimate more complicated graphical structures and achieve faster rates of convergence under less restrictive assumptions [56]. Our consistency results in the present work deal with only a single random matrix following the model (1.4), assuming that tr(A) is known. With replicates, this assumption can be lifted off immediately. Assume there exists a replicate (1.14) then we can use X − X = W − W to estimate B using existing methods. The rationale for considering such an option is one may have a repeated measurement of X 0 for which the errors W and W follow the same error distribution. Such external data or knowledge of the noise distribution is needed in order to do inference under such additive measurement error model [10].
The second key modeling question is: would each row vector in W for a particular patient across all time points be a correlated normal or subgaussian vector as well? It is our conjecture that combining the newly developed techniques, namely, the concentration of measure inequalities we have derived in the current framework with techniques from existing work [56], we can handle the case when W follows a matrix normal distribution with a separable covariance matrix Σ W = C ⊗ B, where C is an m × m positive semi-definite covariance matrix. Moreover, for this type of "seasonal effects" as the measurement errors, the time varying covariance model would make more sense to model W , which we elaborate in the second example.
In neuroscience applications, population encoding refers to the information contained in the combined activity of multiple neurons [27]. The relationship between population encoding and correlations is complicated and is an area of active investigation, see for example [40,13]. It becomes more often that repeated measurements (trials) simultaneously recorded across a set of neurons and over an ensemble of stimuli are available.
In this context, one can use a random matrix X 0 ∼ N n,m (µ, A ⊗ B) which follows a matrix-variate normal distribution, or its subgaussian correspondent, to model the ensemble of mean response variables, e.g., the membrane potential, corresponding to the cross-trial average over a set of experiments. Here we use A to model the task correlations and B to model the baseline correlation structure among all pairs of neurons at the signal level. It has been observed that the onset of stimulus and task events not only change the cross-trial mean response in µ, but also alter the structure and correlation of the noise for a set of neurons, which correspond to the trial-to-trial fluctuations of the neuron responses. We use W to model such task-specific trial-to-trial fluctuations of a set of neurons recorded over the time-course of a variety of tasks. Models as in (1.1a) and (1.1b) are useful in predicting the response of set of neurons based on the current and past mean responses of all neurons. Moreover, we could incorporate non-i.i.d.
where z(1), . . . , z(m) are independent isotropic subgaussian random vectors and B(t) 0 for all t, to model the time-varying correlated noise as observed in the trial-to-trial fluctuations. It is possible to combine the techniques developed in the present paper with those in [57,56] to develop estimators for A, B and the time varying B(t), which is itself an interesting topic, however, beyond the scope of the current work.
In summary, oblivion in Σ W and a general dependency condition in the data matrix X are not simultaneously allowed in existing work. In contrast, while we assume that X 0 is composed of independent subgaussian row vectors, we allow rows of W to be dependent, which brings dependency to the row vectors of the observation matrix X.
In the current paper, we focus on the proof-of-the-concept on using the Kronecker sum covariance and additive model to model two way dependency in data matrix X, and derive bounds in statistical and computational convergence for (1.7) and (1.8). In some sense, we are considering a parsimonious model for fitting observation data with two-way dependencies: we use the signal matrix to encode column-wise dependency among covariates in X, and error matrix W to explain its row-wise dependency. When replicates of X or W are available, we are able to study more sophisticated models and inference problems, some of which are described earlier in this section.
We leave the investigation of this more general modeling framework and relevant statistical questions to future work. We refer to [10] for an excellent survey of the classical as well as modern developments in measurement error models. In future work, we will also extend the estimation methods to the settings where the covariates are measured with multiplicative errors which are shown to be reducible to the additive error problem as studied in the present work [36,30]. Moreover, we are interested in applying the analysis and concentration of measure results developed in the current paper and in our ongoing work to the more general contexts and settings where measurement error models are introduced and investigated; see for example [16,8,44,24,20,45,9,7,14,46,25,28,47,53,23,29,32,2,43,41,42] and references therein.
Notation. Let e 1 , . . . , e p be the canonical basis of R p . For a set J ⊂ {1, . . . , p}, denote E J = span{e j : j ∈ J}. For a matrix A, we use A 2 to denote its operator norm. For a set V ⊂ R p , we let conv V denote the convex hull of V . For a finite set Y , the cardinality is denoted by |Y |. Let B p 1 , B p 2 and S p−1 be the unit 1 ball, the unit Euclidean ball and the unit sphere respectively. For a matrix A = (a ij ) 1≤i,j≤m , let A max = max i,j |a ij | denote the entry-wise max norm. Let A 1 = max j m i=1 |a ij | denote the imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 matrix 1 norm. The Frobenius norm is given by ij . Let |A| denote the determinant and tr(A) be the trace of A. The operator or 2 norm A 2 2 is given by λ max (AA T ). For a matrix A, denote by r(A) the effective rank tr(A)/ A 2 . Let A 2 F / A 2 2 denote the stable rank for matrix A. We write diag(A) for a diagonal matrix with the same diagonal as A. For a symmetric matrix A, let Υ(A) = (υ ij ) where υ ij = I(a ij = 0), where I(·) is the indicator function. Let I be the identity matrix. For two numbers a, b, a ∧ b := min(a, b) and a ∨ b := max(a, b). For a function g : R m → R, we write ∇g to denote a gradient or subgradient, if it exists. We write a b if ca ≤ b ≤ Ca for some positive absolute constants c, C which are independent of n, m or sparsity parameters. Let (a) + := a ∨ 0. We write a = O(b) if a ≤ Cb for some positive absolute constants C which are independent of n, m or sparsity parameters. The absolute constants C, C 1 , c, c 1 , . . . may change line by line.

Assumptions and preliminary results
We will now define some parameters related to the restricted and sparse eigenvalue conditions that are needed to state our main results. We also state a preliminary result in Lemma 1 regarding the relationships between the two conditions in Definitions 2.1 and 2.2. Definition 2.1. (Restricted eigenvalue condition RE(s 0 , k 0 , A)). Let 1 ≤ s 0 ≤ p, and let k 0 be a positive number. We say that a q × p matrix A satisfies RE(s 0 , k 0 , A) condition with parameter K(s 0 , k 0 , A) if for any υ = 0, where υ J represents the subvector of υ ∈ R p confined to a subset J of {1, . . . , p}.
It is clear that when s 0 and k 0 become smaller, this condition is easier to satisfy. We also consider the following variation of the baseline RE condition. Definition 2.2. (Lower-RE condition) [30] The matrix Γ satisfies a Lower-RE condition with curvature α > 0 and tolerance τ > 0 if where θ 1 := j |θ j |. As α becomes smaller, or as τ becomes larger, the Lower-RE condition is easier to be satisfied. Lemma 1. Suppose that the Lower-RE condition holds for Γ : Assume that RE((k 0 + 1) 2 , k 0 , A) holds. Then the Lower-RE condition holds for Γ = imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16,2018 where s 0 = (k 0 + 1) 2 , and τ > 0 which satisfies The condition above holds for any τ ≥ 4 (k0+1) 3 K 2 (s0,k0,A) − 4λmin(Γ) (k0+1) 2 . The first part of Lemma 1 means that, if k 0 is fixed, then smaller values of τ guarantee RE(s 0 , k 0 , A) holds with larger s 0 , that is, a stronger RE condition. The second part of the Lemma implies that a weak RE condition implies that the Lower-RE (LRE) holds with a large τ . On the other hand, if one assumes RE((k 0 + 1) 2 , k 0 , A) holds with a large value of k 0 (in other words, a strong RE condition), this would imply LRE with a small τ . In short, the two conditions are similar but require tweaking the parameters. Weaker RE condition implies LRE condition holds with a larger τ , and Lower-RE condition with a smaller τ , that is, stronger LRE implies stronger RE. We prove Lemma 1 in Section 9. Definition 2.3. (Upper-RE condition) [30] The matrix Γ satisfies an upper-RE condition with smoothness α > 0 and tolerance τ > 0 if Definition 2.4. Define the largest and smallest d-sparse eigenvalue of a p × q matrix A to be Before stating some general result for the optimization program (1.12) and its implications for the Lasso-type estimator (1.7) in terms of statistical and optimization errors, we need to introduce some more notation and the following assumptions. Let a max = max i a ii and b max = max i b ii be the maximum diagonal entries of A and B respectively. In general, under (A1), one can think of λ min (A) ≤ 1 and for s ≥ 1, where λ max (A) denotes the maximum eigenvalue of A.
(A2) The minimal eigenvalue λ min (A) of the covariance matrix A is bounded: (A3) Moreover, we assume that the condition number κ(A) is upper bounded by O n log m and τ B = O(λ max (A)).
Throughout the rest of the paper, s 0 ≥ 32 is understood to be the largest integer chosen such that the following inequality still holds: imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where we denote by τ B = tr(B)/n and C is to be defined. Denote by Throughout this paper, we denote by A 0 the event that the modified gram matrix Γ as defined in (1.6) satisfies the Lower as well as Upper RE conditions with and tolerance log m n for α, α and τ as defined in Definitions 2.2 and 2.3, and C, s 0 , (s 0 ) in (2.6).
To bound the optimization errors, we show that the corrected linear regression loss function (1.9) satisfies the following Restricted Strong Convexity (RSC) and Restricted Smoothness (RSM) conditions when the sample size and effective rank of matrix B satisfy certain lower bounds (cf. Theorem 3); namely, for all vectors β 0 , β 1 ∈ R m and we show that for some parameters (α , τ (L n )) and (α u , τ u (L n )), Applied to (1.12), the composite gradient descent procedure of [34] produces a sequence of iterates {β t } ∞ t=0 via the updates where 1 ζ is the step size. Let ν = 64dτ (L n ) andᾱ := α − ν . We show that the composite gradient updates exhibit a type of globally geometric convergence in terms of the compound contraction coefficient For simplicity, we present in Theorem 2 the case for ρ λ (β) = λ β 1 only.

16)
where ν = 64dτ (L n ), τ (L n ) log m n , and  [31], we simplified the condition on λ by not imposing an upper bound. Moreover, we present refined analysis on the sample requirement and illuminate its dependence upon the condition number κ(A) and the tolerance parameter τ when applied to the corrected linear regression problem (1.10). It is understood throughout the paper that for the same C as in (2.7), and it is helpful to consider M A as being upper bounded by O(κ(A)) in view of (2.5) and (A3). Toward this end, we prove in Section 5 that under event A 0 ∩B 0 , the RSC and RSM conditions as stated in Theorem 2 hold with α λ min (A) and α u λ max (A) and τ (L n ) = τ u (L n ) τ ; then we have for all t ≥ T * (δ) as defined in (2.17) and for δ 2 imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where 0 < κ < 1 so long as ζ λ max (A) and n = Ω(κ(A)M 2 A d log m). We now check the conditions on λ in Theorem 2. First, we note that both types of conditions on λ are also required in the present paper for the statistical error bounds shown in Theorems 3 and 6. We state in Theorem 16 a deterministic result from [30] on the statistical error for the corrected linear model, which requires that as defined in (2.18) and dτ ≤ α 32 in order to obtain the statistical error bound for the corrected linear model at the order of Under suitable conditions on the sample size n and the effective rank of matrix B to be stated in Theorem 3, we show that for the loss function (1.9), the RSC and RSM conditions hold under event A 0 (cf. Lemma 15) following the Lower and Upper-RE conditions as derived in Lemma 15, Compared with the lower bound imposed on λ as in (2.20) that we use to derive statistical error bounds, the penalty now involves a term ξ 1−κ that crucially depends on the condition number κ(A) in (2.13); Assuming that ζ ≥ α u , then the second condition in (2.13) on λ implies that λ = Ω(Rτ (L n )κ(A)) given which now depends explicitly on the condition number κ(A) in addition to the radius R b 0 √ d and the tolerance parameter τ . This is expected given that both RSC and RSM conditions are needed in order to derive the computational convergence bounds, while for the statistical error, we only require the RSC (Lower RE) condition to hold.
Remarks. Consider the regression model in (1.1a) and (1.1b) with independent random matrices X 0 , W as in (1.4), and an error vector ∈ R n independent of X 0 , W , with independent entries j satisfying E j = 0 and j ψ2 ≤ M . Theorem 12 and its corollaries provide an upper bound on the ∞ norm of the gradient ∇L n (β * ) = Γβ * − γ of the loss function in the corrected linear model, where Γ and γ are as defined in (1.6). Let The bound (2.15) characterizes the excess loss φ(β t ) − φ( β) for solving (1.7) using the composite gradient algorithm; moreover, for any iterate β t such that (2.15) holds, the following bound on the optimization error β t − β follows immediately: where ν = 64dτ (L n ) and 4τ (L n ) 2 = 64τ (L n ) δ 4 λ 2 by definition of 2 in view of (2.21). Finally, we note that Theorem 2 holds for a class of weakly convex penalties as considered in [31] with suitable adaptation of RSC and parameters and conditions to involve µ, following exactly the same sequence of arguments. Notable examples of such weakly convex penalty functions are SCAD [19] and MCP [55].
The rest of the paper is organized as follows. In Section 3, we present two main results in Theorems 3 and 4. In Section 4, we state more precise results which improve upon Theorems 3 and 4; these results are more precise in the sense that our bounds and penalty parameters now take tr(B), the parameter that measures the magnitudes of errors in W , into consideration. In Section 5, we show that the RSC and RSM conditions hold for the corrected linear loss function and present our computational convergence bounds with regard to (1.7) in Theorem 9 and Corollary 10. In Section 6, we outline the proof of the main theorems. In particular, we outline the proofs for Theorems 3, 4, 6 and 7 in Section 6, 6.3 and 6.5 respectively. In Section 7, we show a deterministic result as well as its application to the random matrix Γ − A for Γ as in (1.6) with regards to the upper and Lower RE conditions. In Section 8, we present results from numerical simulations designed to validate the theoretical predictions in previous sections. The technical details of proofs are collected at the end of the paper. We prove Theorem 3 in Section 10. We prove Theorem 4 in Section 11. We prove Theorems 6 and 7 in Section 12 and Section 13 respectively. We defer the proof of Theorem 2 to Section B. The paper concludes with a discussion of the results in Section 16. We list a set of symbols we use throughout the paper in Table 1. Additional proofs and theoretical results are collected in the Appendix.

Main results on the statistical error
In this section, we will state our main results in Theorems 3 and 4 where we consider the regression model in (1.1a) and (1.1b) with random matrices X 0 , W ∈ R n×m as defined in (1.4). For the corrected Lasso estimator, we are interested in the case where the smallest eigenvalue of the column-wise covariance matrix A does not approach 0 too quickly and the effective rank of the row-wise covariance matrix B is bounded from below (cf. (3.2)). More precisely, (A2) thus ensures that the Lower-RE condition as in Definition 2.2 is not vacuous. (A3) ensures that (2.6) holds for some s 0 ≥ 1.
Throughout this paper, for the corrected Lasso estimator, we will use the expression where M A is as defined in (2.7). Let Let b 0 , φ be numbers which satisfy Assume that the sparsity of β * satisfies for some 0 < φ ≤ 1 Let β be an optimal solution to the corrected Lasso estimator as in (1.7) with Then for any d-sparse vectors β * ∈ R m , such that we have with probability at least 1−4 exp − c3n We give an outline of the proof of Theorem 3 in Section 6.2. We prove Theorem 3 in Section 10. We defer discussions on conditions appearing Theorem 3 in Section 3.2.
For the Conic programming estimator, we impose a restricted eigenvalue condition as formulated in [4,38] on A and assume that the sparsity of β * is bounded by o( n/ log m). These conditions will be relaxed in Section 4 where we allow τ B to approach 0. Theorem 4. Suppose (A1) holds. Set 0 < δ < 1. Suppose that n < m exp(n) and 1 ≤ d 0 < n. Let λ > 0 be the same parameter as in (1.8).
Suppose that the sparsity of β * is bounded by (3.10) imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Consider the regression model in (1.1a) and (1.1b) with X 0 , W as in (1.4) and an error vector ∈ R n , independent of X 0 , W , with independent entries j satisfying E j = 0 and j ψ2 ≤ M . Let β be an optimal solution to the Conic programming estimator as in (1.8) with input ( γ, Γ) as defined in (1.6). Recall τ B := tr(B)/n. Choose for D 0 , D 2 as in (3.1) and µ D 2 K 2 log m n and ω D 0 KM log m n .
Then with probability at least 1 − c for 2 ≥ q ≥ 1. Under the same assumptions, the predictive risk admits the following bounds with the same probability as above, where c , C 0 , C, C > 0 are some absolute constants.
We give an outline of the proof of Theorem 4 in Section 6 while leaving the detailed proof in Section 11.

Regarding the M A constant
Denote by • So the condition (3.4) in Theorem 3 allows d n/ log m in the optimal setting when the condition number κ(A) is understood to be a constant. As κ(A) increases, the conservative worst case upper bound on d needs to be adjusted correspondingly. Moreover, this adjustment is also crucial in order to ensure the composite gradient algorithm to converge in the sense of Theorem 2. We will illustrate such dependencies on κ(A) in numerical examples in Section 8.
• The condition τ B = O(λ max (A)) puts an upper bound on how large the measurement error in W can be. We do not allow the measurement error to overwhelm the signal entirely. When τ B → 0, we recover the ordinary Lasso bound in [4], which we elaborate in the next two sections.

Discussions
Throughout our analysis, we set the parameter b 0 ≥ β * 2 and d = |supp(β * )| := {j : β * j = 0} for the corrected Lasso estimator. In practice, both b 0 and d are understood to be parameters chosen to provide an upper bound on the 2 norm and the sparsity of the true β * . The parameter 0 < φ < 1 is a parameter that we use to describe the gap between β * 2 2 and its upper bound b 2 0 . Denote the Signal-to-noise ratio by The two conditions (3.3) and (3.7) on b 0 and φ imply that N ≤ K 2 φb 2 0 ≤ S. Notice that this could be restrictive if φ is small. We will show in Section 6.2 that condition (3.3) is not needed in order for the p , p = 1, 2 errors as stated in the Theorem 3 to hold. It was indeed introduced so as to further simplify the expression for the condition on d as shown in (3.4). Therefore we provide slightly more general conditions on d in (6.9) in Lemma 17, where (3.3) is not required. We introduce the parameter φ so that the conditions on d depend on φ and b 2 0 rather than the true signal β * 2 (cf. Proof of Lemmas 17 and 18). It will also become clear in the sequel from the proof of Lemma 17 (cf. (H.4)) that we could use β * 2 rather than its the lower bound b 2 0 φ in the expression for d. However, we choose to state the condition on d as in Theorem 3 for clarity of our exposition. See also Theorem 6 and Lemma 18.
In fact, we prove that Theorem 3 holds with N = M 2 and S = φK 2 b 2 0 in arbitrary orders, so long as conditions (3.2) and (3.4) or (6.9) hold. For both cases, we require as expressed in (3.6). That is, when either the noise level M or the signal strength K β * increases, we need to increase λ correspondingly; moreover, when N dominates the signal K 2 β * 2 2 , we have for d which eventually becomes a vacuous bound when N S. This bound appears a bit crude as it does not entirely discriminate between the noise, measurement error, and the signal strength. We further elaborate on the relationships among these three elements in Section 4. We will then present an improved bound in Theorem 6.
1. The choice of λ for the Lasso estimator and parameters µ, ω for the DS-type estimator satisfy Corollaries 13 and 14, which are the key results in proving Theorems 3, 4, 6, and 7.
2. Throughout our analysis of Theorems 3 and 4, our error bounds are stated in a way assuming the errors in W are sufficiently large in the sense that these bounds are optimal only when τ B is bounded from below by some absolute constant. For example, when B 2 is bounded away from 0, the lower bound on the effective rank r(B) = tr(B)/ B 2 implies that τ B must also be bounded away from 0. More precisely, by the condition on the effective rank as in (3.2), we have Later, we will state our results with τ B = tr(B)/n > 0 being explicitly included in the error bounds as well as the penalization parameters and sparsity constraints.
3. In view of the main Theorems 3 and 4, at this point, we do not really think one estimator is preferable to the other. While the q error bounds we obtain for the two estimators are at the same order for q = 1, 2, the conditions under which these error bounds are obtained are somewhat different. In Theorem 4, we only require that RE(2d 0 , 3k 0 , A 1/2 ) holds for k 0 = 1 + λ where λ 1, while in Theorem 3 we need the minimal eigenvalue of A to be bounded from below, namely, we need to assume that (A2) holds. As mentioned earlier, (A2) ensures that the Lower-RE condition as in Definition 2.2 is not vacuous while (A3) ensures that (2.6) holds for some s 0 ≥ 1. Th condition (3.2) on the effective rank of the row-wise covariance matrix B is also needed to establish the Lower and Upper RE conditions in Lemma 15 for the corrected Lasso estimator. Moreover, for the sparsity parameter d 0 in (3.8), we show in Lemma 34 that (A2) is a sufficient condition for a type of RE(2d 0 , 3k 0 ) condition to hold on non positive definite Γ as defined in (1.6). See also Theorem 26.
4. In some sense, the assumptions in Theorem 3 appear to be slightly stronger, while at the same time yielding correspondingly stronger results in the following sense: The corrected Lasso procedure can recover a sparse model using O(log m) number of measurements per nonzero component despite the measurement error in X and the stochastic noise , while the Conic programming estimator allows only d n/ log m to achieve the error rate at the same order as the corrected Lasso estimator. Hence, while Conic programming estimator is conceptually more adaptive by not fixing an upper bound on β * 2 a priori, the price we pay seems to be a more stringent upper bound on the sparsity level. 5. We note that following Theorem 2 as in [3], one can show that without the relatively restrictive sparsity condition (3.8), a bound similar to that in (3.11) holds, however, with β * 2 being replaced by β * 1 , so long as the sample size satisfies the condition as in (4.9). However, we show in Theorem 7 in Section 6.5 that this restriction on the sparsity can be relaxed for the Conic programming estimator (1.8), when we make a different choice for the parameter µ based on a more refined analysis. [30,3], however, under different assumptions on the distribution of the noise matrix W . When W is a random matrix with i.i.d. subgaussian noise, our results in Theorems 3 and 4 will essentially recover the results in [30] and [3]. We compare with their results in Section 4 in case B = τ B I after we present our improved bounds in Theorems 6 and 7. We refer to the paper of [3] for a concise summary of these and some earlier results.

Results similar to Theorems 3 and 4 have been derived in
Finally, one reviewer asked about the dependence of the tuning parameter on properties of A and B, namely We now state in Lemma 5 a sharp bound on estimating τ B using τ B as in (1.5), which will provide a natural plug-in estimate for parameters such as D 0 that involve τ B . Lemma 5. Let m ≥ 2. Let X be defined as in (1.4) and τ B be as defined in (1.5).
If we replace √ log m with log m in the definition of event B 6 , then we can drop the condition on n or r(A)r(B) = tr(A) to achieve the same bound on event B 6 .
In an earlier version of the present work by the same authors [39], we presented the rate of convergence for using the corrected gram matrix B := 1 m XX T − tr(A) m I m to estimate B and proved isometry properties in the operator norm once the effective rank of A is sufficiently large compared to n; one can then use such estimated B and its operator norm in D 2 and D 0 . See Theorem 21 and Corollary 22 therein. As mentioned, we use the estimated τ B (cf. Lemma 5) in D 0 . The dependencies on A, β * 2 and are known problems in the Lasso and corrected Lasso literature; see [4,30]. For example, the RE condition as stated in Definition 2.1 and its subgaussian concentration properties as shown [38] clearly depend on unknown parameter a max related to covariance matrix A. See Theorem 27 in the present paper. We prove Lemma 5 in Section C.1. Lemma 5 provides the powerful technical insight and one of the key ingredients leading to the tight analysis in Theorems 6 and 7 for the corrected Lasso estimator (1.7) as well as the Conic programming estimator (1.8) in Section 4, where we also present theory for which the dependency on A 2 becomes extremely mild.

Improved bounds when the measurement errors are small
Although the conclusions of Theorems 3 and 4 apply to cases when B 2 → 0, the error bounds are not as tight as the bounds we are about to derive in this section. So far, we have used more crude approximations on the error bounds in terms of estimating γ − Γβ * ∞ for the sake of reducing the amount of unknown parameters we need to imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 consider. The bounds we derive in this section take the magnitudes of the measurement errors in W into consideration. As such, we allow the error bounds to depend on the parameter τ B explicitly, which become much tighter as τ B becomes smaller. For the extreme case when τ B approaches 0, one hopes to recover a bound close to the regular Lasso or the Dantzig selector as the effect of the noise on the procedure should become negligible. We show in Theorems 6 and 7 that this is indeed the case. Denote by We first state a more refined result for the Lasso-type estimator, for which we now only require that That is, we replace √ N + S in λ (3.6) now with √ N + τ B S, which leads to significant improvement on the rates of convergence for estimating β * when τ B → 0. Theorem 6. Suppose all conditions in Theorem 3 hold, except that we drop (3.3) and replace (3.6) with for D 0 and τ Then for any d-sparse vectors β * ∈ R m , such that φb 2 We give an outline for the proof of Theorem 6 in Section 6.3, and show the actual proof in Section 12.
We next state in Theorem 7 an improved bounds for the Conic programming estimator (1.8), which dramatically improve upon those in Theorem 4 when τ B is small, where an "oracle" rate for estimating β * with the Conic programming estimator β (1.8) is defined and the predictive error Xv Let C 0 satisfy (H.6) for c as defined in Theorem 31. Throughout the rest of the paper, we denote by: max , and D 0 , D oracle be as defined in (2.23). Let C 6 ≥ D oracle . Let ρ n and r m,m be as defined in (4.6). Suppose all conditions in Theorem 4 hold, except that we replace the condition on d as in (3.8) with the following.
Suppose that the sample size n and the size of the support of β * satisfy the following requirements: Let τ B be as defined in defined in (1.5). Let β be an optimal solution to the Conic programming estimator as in where τ 1/2 Then with probability at least 1 − c m 2 − 2 exp(−δ 2 n/2000K 4 ), Under the same assumptions, the predictive risk admits the following bound with the same probability as above, where c , C , C > 0 are some absolute constants.
We give an outline for the proof of Theorem 7 in Section 6.5, and show the actual proof in Section 13.

Oracle results on the Lasso-type estimator
We now discuss the improvement being made in Theorem 6 and Theorem 7.
imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 The Signal-to-noise ratio. Let us redefine the Signal-to-noise ratio by When either the noise level M or the measurement error strength in terms of τ increases, we need to increase the penalty parameter λ correspondingly; moreover, when d which eventually becomes a vacuous bound when M S.
In this setting, we recover essentially the same 2 error bound as that in Corollary 1 of [30] where σ 2 M 2 and K 2 1. However, when β * 2 = Ω(1), our statistical precision appears to be sharper as we allow the term β * 2 to be removed entirely from the RHS when σ w → 0 and hence recover the regular Lasso rate of convergence.
The penalization parameter. We focus now on the penalization parameter λ in (1.7). The effective rank condition in (3.2) implies that for n = O(m log m) where C B = 1 16c K 4 log(3eM 3 A /2) given that log(m log m) − log n > 0. This bound is very crude given that in practice, we focus on cases where n m log m. Note that under (A1) (A2) and (A3), we have for n = O(m log m), Without knowing τ B , we will use τ B as defined in (1.5). Notice that we know neither imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 However, assuming that we normalize the column norms of the design matrix X to be roughly at the same scale, we have for τ B = O(1) and m sufficiently large, for some large enough constant M . In summary, compared to Theorem 3, in ψ, we replace max so that the dependency on A 2 becomes much weaker. As mentioned in Section 3.2, we may use the plug- Finally, the concentration of measure bound for the estimator τ B as in (1.5) is stated in Lemma 5, which ensures that τ B is indeed a good proxy for τ B (cf. Lemma 23).
The sparsity parameter. The condition on d (and D φ ) for the Lasso estimator as defined in (4.3) suggests that as τ B → 0, and thus τ + B → 0, the constraint on the sparsity parameter d becomes slightly more stringent when K 2 M 2 /b 2 0 1 and much more restrictive when , that is, the stochastic error in the response variable y as in (1.1a) does not converge to 0 as quickly as the measurement error W in (1.1b) does, then the sparsity constraint becomes essentially unchanged as τ + B → 0 as we show now.
. In this case, essentially, we require that where c 0 , c are absolute constants and c := In this case, the sparsity constraint becomes essentially unchanged as τ + B → 0. Case 2. Analogous to (3.4), when M 2 ≤ τ + B φK 2 b 2 0 , we could represent the condition on d as follows: This condition, however, seems to be unnecessarily strong, when τ B → 0 (and M → 0 simultaneously). We focus on the following Case 2 in the present work.
For both cases, it is clear that sample size needs to satisfy where Ω(·) notation hides parameters K, M , φ and b 0 , which we treat as absolute constants that do not change as τ B → 0. These tradeoffs are somehow different from the behavior of the Conic programming estimator (cf (4.17)). We will provide a more detailed analysis in Sections 6.1 and 6.3.

Oracle results on the Conic programming estimator
In order to exploit the oracle bound as stated in Theorem 12 regarding γ − Γβ * ∞ , we need to know the noise level τ B := tr(B)/n in W and then we can set This will in turn lead to improved bounds in Theorems 6 and 7.
The penalization parameter. Without knowing the parameter τ B , we rely on the estimate from τ B as in (1.5), as discussed in Section 3. For a chosen parameter C 6 D oracle , we use τ log m n in view of Corollary 14, where an improved error bound over γ − Γβ * ∞ is stated. Without knowing D oracle , we could replace it with an upper bound; for example, assuming that D 2 The sparsity parameter. Roughly speaking, for the Conic programming estimator (1.8), one can think of d 0 as being bounded: imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 That is, when τ B decreases, we allow larger values of d 0 ; however, when τ B → 0, the sparsity level of d = O (n/ log(m/d)) starts to dominate, which enables the Conic programming estimator to achieve results similar to the Dantzig Selector when the design matrix X 0 is a subgaussian random matrix satisfying the Restricted Eigenvalue conditions; See for example [6,4,38].
In particular, when τ B → 0, Theorem 7 allows us to recover a rate close to that of the Dantzig selector with an exact recovery if τ B = 0 is known a priori; see Section 16. Moreover the constraint (3.8) on the sparsity parameter d 0 appearing in Theorem 4 can now be relaxed as in (4.8). In summary, our results in Theorem 7 are stronger than those in [3] (cf. Corollary 1) as their rates as stated therein are at the same order as ours in Theorem 4. We illustrate this dependency on τ B in Section 8 with numerical examples, where we clearly show an advantage by taking the noise level into consideration when choosing the penalty parameters for both the Lasso and the Conic programming estimators.

Optimization error on the gradient descent algorithm
We now present our computational convergence bounds.
Theorem 9. Suppose all conditions in Theorem 6 hold and let ψ be defined therein. Let g(β) = 1 λ ρ λ (β) where ρ λ (β) = λ β 1 . Consider the optimization program (1.10) for a radius R such that β * is feasible and a regularization parameter chosen such that Suppose that the step size parameter ζ ≥ α u 3 2 λ max (A). Suppose that the sparsity parameter and sample size further satisfy the following relationship: Then on event A 0 ∩ B 0 , the conclusions in Theorem 2 hold, where imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Corollary 10. Suppose all conditions as stated in Theorem 9 hold and event A 0 ∩ B 0 defined therein holds. Consider for some constant M ≤ 400τ 0 andδ 2 as defined in Theorem 2, Then for all t ≥ T * (δ) as in (2.17) and in view of the upper boundd (5.3). Then for all t ≥ T * (δ) as in (2.17), We prove Theorem 9 and Corollary 10 in Section 14.

Discussions
Throughout this section, we assume ψ (4.2) is as defined in Theorem 6. Assume that ζ ≥ α u ≥ᾱ . In addition, suppose that the radius R b 0 √ d as we set in (1.7). Let d 0 ≤ n 160M 2 + log m be as defined in (4.3), where recall that we require the following condition on d: Then by the proof of Lemma 18, In contrast, under (5.3), the following upper bound holds on d, which is slightly more restrictive in the sense that the maximum level of sparsity allowed on β * has decreased by a factor proportional to κ(A) compared to the upper boundd 0 (4.3) in Theorem 6; Now we require that |supp(β * )| ≤d, where for C A = 1 .
To consider the general cases as stated in Theorem 6, we consider the ideal case when we set Following the derivation in Remark 14.1, we have Combining (5.6) and (5.8), it is clear that one can set in order to satisfy the condition (5.2) on λ in Theorem 2 when we set This choice is potentially too conservative because we are setting R in (5.10) with respect to the upper sparsity leveld 0 chosen to guarantee statistical convergence, leading to a larger than necessary penalty parameter as in (5.9). Similarly, when we choose step size parameter ζ to be too large, we need to increase the penalty parameter λ correspondingly given the following lower bound: Suppose we set ζ = 3 2 λ max (A) and ζ α ≈ 3κ(A) as in Theorem 9. It turns out that the less conservative choice of λ as in (5.11) imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 is sufficient, for example when τ B = Ω(1), for which we now set n log m as in Corollary 10. We will discuss the two scenarios as considered in Section 4. See the detailed discussions in Section 14.

Proof of theorems
In Section 6.1, we develop in Theorem 12 the crucial large deviation bound on γ − Γβ * . This entity appears in the constraint set in the Conic programming estimator (1.8), and is directly related to the choice of λ for the corrected Lasso estimator in view of Theorem 16. Its corollaries are stated in Corollary 13 and Corollary 14. In section 6.2, we provide an outline and additional Lemmas 15 and 17 to prove Theorem 3. The full proof of Theorem 3 appears in Section 10. In Section 6.3, we give an outline illustrating the improvement for the Lasso error bounds as stated in Theorem 6. We emphasize the impact of this improvement over sparsity parameter d, which we restate in Lemma 18. In Section 6.4, we provide an outline as well as technical results for Theorem 4. In Section 6.5, we give an outline illuminating the improvement in error bounds for the Conic programming estimator as stated in Theorem 7.

Stochastic error terms
In this section, we first develop stochastic error bounds in Lemma 11, where we also define some events B 4 , B 5 , B 10 . Recall that B 6 was defined in Lemma 5. Putting the bounds in Lemma 11 together with that in Lemma 5 yields Theorem 12. Lemma 11. Assume that the stable rank of B, B 2 F / B 2 2 ≥ log m. Let Z, X 0 and W as defined in Theorem 3. Let Z 0 , Z 1 and Z 2 be independent copies of Z. Then imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Finally, denote by B 10 the event such that We prove Lemma 11 in Section C.2. Denote by B 0 := B 4 ∩ B 5 ∩ B 6 , which we use throughout this paper.
Let Γ and γ be as in (1.6). Let D 0 = √ τ B + √ a max and D 0 be as defined in (2.23).
We next state the first Corollary 13 of Theorem 12, which we use in proving Theorems 3 and 4. Here we state a somewhat simplified bound on γ − Γβ * ∞ for the sake of reducing the number of unknown parameters involved with a slight worsening of the statistical error bounds when τ B 1. On the other hand, the bound in (6.1) provides a significant improvement over the error bound in Corollary 13 in case τ B = o(1). Corollary 13. Suppose all conditions in Theorem 12 hold. Let Γ and γ be as in (1.6). On event B 0 , we have for D 2 = 2( A 2 + B 2 ) and some absolute constant C 0 is as defined in Theorem 3.
In particular, Corollary 13 ensures that for the corrected Lasso estimator, (6.7) holds with high probability for λ chosen as in (3.6). We prove Corollary 13 in Section D.
In this case, the error term involving β * 2 in (4.2) vanishes, and we only need to set (cf. Theorem 16) imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where the second term in ψ defined immediately above comes from the estimation error in Lemma 5; this term vanishes if we were to assume that (1) tr(B) is also known or We mention in passing that Corollaries 13 and 14 are crucial in proving Theorems 3, 4, 6 and 7. First, we replace (A3) with (A3') which reveals some additional information regarding the constant hidden inside the O(·) notation. and tolerance log m n for α, α and τ as defined in Definitions 2.2 and 2.3, and C, s 0 , (s 0 ) in (2.6). Suppose that for some c > 0 and c K 4 < 1, The main focus of the current section is then to apply Theorem 16 to show Theorem 3. Theorem 16 follows from Theorem 1 by [30]. Theorem 16. Consider the regression model in (1.1a) and (1.1b). Let d ≤ n/2. Let γ, Γ be as constructed in (1.6). Suppose that the matrix Γ satisfies the Lower-RE condition with curvature α > 0 and tolerance τ > 0, where d, b 0 and λ are as defined in (1.7). Then for any d-sparse vectors β * ∈ R m , such that β * 2 ≤ b 0 and the following bounds hold: where β is an optimal solution to the corrected Lasso estimator as in (1.7).
We include the proof of Theorem 16 for the sake of self-containment and defer it to Section G for clarity of presentation. Lemma 17. Let c , φ, b 0 , M , M + and K be as defined in Theorem 3, where we assume that b 2 0 ≥ β * 2 2 ≥ φb 2 0 for some 0 < φ ≤ 1. Suppose all conditions in Lemma 15 hold. Suppose that s 0 ≥ 32 and imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Then the following condition holds In this regime, the conditions on d as in (6.9) can be conveniently expressed as that in (3.4) instead.

Improved bounds for the corrected Lasso estimator
The proof of Theorem 6 follows exactly the same line of arguments as in Theorem 3, except that we now use the improved bound on the error term γ − Γβ * ∞ given in Corollary 14, instead of that in Corollary 13. Moreover, we replace Lemma 17 with Lemma 18, the proof of which follows from Lemma 17 with d now being bounded as in Then (6.10) holds with ψ as defined in Theorem 6 and α = 5 8 λ min (A).

Outline for proof of Theorem 4
We provide an outline and state the technical lemmas needed for proving Theorem 4. Our first goal is to show that the following holds with high probability, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where µ, ω are chosen as in (6.12). This forms the basis for proving the q convergence, where q ∈ [1,2], for the Conic programming estimator (1.8). This follows immediately from Theorem 12 and Corollary 13. More explicitly, we will state it in Lemma 19. Before we proceed, we first need to introduce some notation and definitions. Let X 0 = Z 1 A 1/2 be defined as in (1.4). Let k 0 = 1 + λ. First we need to define the q -sensitivity parameter for Ψ := 1 n X T 0 X 0 following [3]: See also [21]. Let ( β, t) be the optimal solution to (1.8) and denote by v = β − β * . We will state the following auxiliary lemmas, the first of which is deterministic in nature. The two lemmas reflect the two geometrical constraints on the optimal solution to (1.8).

Lower and Upper RE conditions
The goal of this section is to show that for ∆ defined in (7.4), the presumption in Lemmas 37 and 39 as restated in ( Then the Lower and Upper RE conditions hold: for all υ ∈ R m , Theorem 26. Let A m×m , B n×n be symmetric positive definite covariance matrices. Let E = ∪ |J|≤k E J for 1 ≤ k < m/2. Let Z, X be n × m random matrices defined as in Theorem 3. Let τ B be defined as in (1.5). Let Suppose that for some absolute constant c > 0 and 0 < ε ≤ 1 C ,
Then with probability at least 1 − 4 exp −c 2 ε 2 tr(B) We prove Theorem 26 in Section M.

Numerical results
In this section, we present results from numerical simulations designed to validate the theoretical predictions as presented in previous sections. We implemented the composite gradient descent algorithm as described in [1,30,31] for solving the corrected Lasso objective function (1.7) with ( Γ, γ) as defined in (1.6). For the Conic programming estimator, we use the implementation provided by the authors [3] with the same input ( Γ, γ) (1.6). Throughout our experiments, A is a correlation matrix with a max = 1. We set the following as our default parameters: where d is the sparsity parameter, the number of non-zero entries in β * . In one set of simulations, we also vary R.
In our simulations, we look at three different models from which A and B will be chosen. Let Ω = A −1 = (ω ij ) and Π = B −1 = (π ij ). Let E denote edges in Ω, and F denote edges in Π. We choose A from one of these two models: • AR(1) model. In this model, the covariance matrix is of the form A = {ρ |i−j| } i,j .
The graph corresponding to the precision matrix A −1 is a chain.
• Star-Block model. In this model the covariance matrix is block-diagonal with equal-sized blocks whose inverses correspond to star structured graphs, where A ii = 1, for all i. We have 32 subgraphs, where in each subgraph, 16 nodes are connected to a central hub node with no other connections. The rest of the nodes in the graph are singletons. The covariance matrix for each block S in A is generated by setting S ij = ρ A if (i, j) ∈ E, and S ij = ρ 2 A otherwise. We choose B from one of the following models. Recall that τ B = tr(B)/n. • We also consider a second model based on Π = B −1 , where we use the random concentration matrix model in [57]. The graph is generated according to a type of Erdős-Rényi random graph model. Initially, we set Π = cI n×n , and c is a constant. Then we randomly select n log n edges and update Π as follows: for each new edge (i, j), a weight w > 0 is chosen uniformly at random from [w min , w max ] where w max > w min > 0; we subtract w from π ij and π ji , and increase π ii and π jj by w. This keeps Π positive definite. We then rescale B to have a certain desired trace parameter τ B .
For a given β * , we first generate matrices A and B, where A is m × m and B is n × n. For the given covariance matrices A and B, we repeat the following steps to estimate β * in the errors-in-variables model as in (1.1a) and (1.1b), 1. We first generate random matrices X 0 ∼ N f,m (0, A ⊗ I) and W ∼ N f,m (0, I ⊗ B) independently from the matrix variate normal distribution as follows. Let Z ∈ R n×m be a Gaussian random ensemble with independent entries Z ij satisfying EZ ij = 0, EZ 2 ij = 1. Let Z 1 , Z 2 be independent copies of Z. Let X 0 = Z 1 A 1/2 and W = B 1/2 Z 2 , where A 1/2 and B 1/2 are the unique square root of the positive definite matrix A and B = τ B B * respectively.
The final relative error is the average of 100 runs for each set of tuning and step-size parameters; for the Conic programming estimator, we solve (1.8) instead of (1.7) to recover β * .

Relative error
In the first experiment, A and B are generated using the AR(1) model with parameters ρ A , ρ B * ∈ {0.3, 0.7} and trace parameter τ B ∈ {0.3, 0.7, 0.9}. We see in Figures 1 and 2 that a larger sample size is required when ρ A , ρ B * or τ B increases. To explain these results, we first recall the following definition of the Signal-to-noise ratio, where we take K = M ε 1 , where S := β * 2 2 and M := 1 + τ B β * 2 2 , which clearly increases as β * 2 2 increases or as the measurement error metric τ B decreases. We keep β * 2 = 5 throughout our simulations. The corrected Lasso recovery problem thus becomes more difficult as τ B increases. Indeed, we observe that a larger sample size n is needed when τ B increases from 0.3 to 0.9 in order to control the relative 2 error to stay at the same level. Moreover, in view of Theorem 6, we can express the relative error as follows: for α λ min (A) and K 1, Note that when β * 2 is large enough and τ B = Ω(1), the factor preceding d log m n on the RHS of (8.1) is proportional to .
When we plot the relative 2 error β − β * 2 / β * 2 versus the rescaled sample size n d log m under the same S/M ratio, the two sets of curves corresponding to ρ A = 0.3 and ρ A = 0.7 indeed line up in Figure 1(b), as predicted by (8.1). We observe in Figure 1(b), the rescaled curves overlap well for different values of (m, d) for each ρ A when we keep (ρ B * , τ B ) and the length β * 2 = 5 invariant. Moreover, the upper bound on the relative 2 error (8.1) characterizes the relative positions of these two sets of curves in that the ratio between the 2 error corresponding to ρ A = 0.7 and that for ρ A = 0.3 along the y-axis roughly falls within the interval   . Plot (b) shows the relative 2 error versus the rescaled sample size n/(d log m). We observe that as τ B increases from 0.3 to 0.7, the two sets of curves corresponding to ρ B * = 0.3, 0.7 become visibly more separated. As n increases, all curves converge to 0.
In Figure 1(c) and (d), we also show the effect of τ B when τ B is chosen from {0.3, 0.7, 0.9}, while fixing the AR(1) parameters ρ A = 0.3 and ρ B * = 0.3. As predicted by our theory, as the measurement error magnitude τ B increases, M increases, resulting in a larger relative 2 error for a fixed sample size n.
While the effect of ρ A as shown in (8.1) through the minimal eigenvalue of A is directly visible in Figure 1(b), the effect of ρ B * is more subtle, as it is modulated by τ B as shown in Figure 2(a) and (b). When τ B is fixed, our theory predicts that B 2 plays a role in determining the p error, p = 1, 2, through the penalty parameter λ in view of (8.1). The effect of ρ B * , which goes into the parameter D 0 = B 1/2 2 +a 1/2 max 1, is not changing the sample requirement or the rate of convergence as significantly as that of ρ A when τ B = 0.3. This is shown in the bottom set of curves in Figure 2(a) and (b). On the other hand, the trace parameter τ B plays a dominating role in determining the sample size as well as the p error for p = 1, 2, especially when the length of the signal β * is large: β * 2 = Ω(1). In particular, the separation between the two sets of curves in Figure 2(b), which correspond to the two choices of ρ B * , is clearly modulated by τ B and becomes more visible when τ B = 0.7.
These findings are also consistent with our theoretical prediction that in order to guarantee statistical and computational convergence, the sample size needs to grow according to the following relationship to be specified in (8.2). We will show in the proof of Theorem 9 that the condition on sparsity d as stated in (5.3) implies that as ρ A , or τ B , or the step size parameter ζ increases, we need to increase the sample size in order to guarantee computational convergence for the composite gradient descent algorithm given the following lower bound: We illustrate the effect of the penalty and step size parameters in Section 8.2.

Corrected Lasso via GD versus Conic programming estimator
In the second experiment, both A and B are generated using the AR(1) model with parameters ρ A = 0.3, ρ B * = 0.3, and τ B ∈ {0.3, 0.7}. We set m = 1024, d = 10 and β * 2 = 5. We then compare the performance of the corrected Lasso estimator (1.7) using the composite gradient descent algorithmic with the Conic programming estimator, which is a convex program designed and implemented by authors of [3].
We consider three choices for the step size parameter for the composite gradient descent algorithm: ζ 1 = λ max (A) + 1 2 λ min (A), ζ 2 = 3 2 λ max (A) and ζ 3 = 2λ max (A). We observe that the gradient descent algorithm consistently produces an output such that its statistical error in 2 norm is lower than the best solution produced by the Conic programming estimator, when both methods are subject to optimal tuning after we fix upon the radius R = The factor f is chosen to reflect the fact that in practice, we do not know the exact value of β * 2 or β * 1 , D 0 or D 0 , or other parameters related to the spectrum properties of A, B; moreover, in practice, we wish to understand the whole-path behavior for both estimators.
In Figures 3 and 4, we plot the relative error in 1 and 2 norm as n increases from 100 to 2500, while sweeping over penalty factor f ∈ [0.05, 0.8] for τ B = 0.3 and τ B = 0.7 respectively. For both estimators, the relative 2 and 1 error versus the scaled sample size n/(d log m) are also plotted. In these figures, green dashed lines are for the corrected Lasso estimator via gradient descent algorithm, and blue dotted lines are for the Conic programming estimator. These plots allow us to observe the behaviors of the two estimators across a set of tuning parameters. Overall, we see that both methods are able to achieve low relative error p , p = 1, 2 norm when λ and µ are chosen from a suitable range.
For the corrected Lasso estimator, we display results where the step size parameter ζ is set to ζ 2 = 3 2 λ max (A) and ζ 3 = 2λ max (A) in the left and right column respectively. We mention in passing that the algorithm starts to converge even when we set ζ = ζ 1 = λ max (A) + 1 2 λ min (A) as we observe quantitively similar behavior as the displayed cases. For both estimators, we observe that we need a larger sample size n in case τ B = 0.7 in order to control the error at the same level as in case τ B = 0.3.
In Figure 5, we plot the 2 and 1 error versus the penalty factor f ∈ [0.05, 0.8] for sample size n ∈ {300, 600, 1200}. We plot results for τ B = 0.3 and τ B = 0.7 in the left and right column respectively. For these plots, we focus on cases when n > dκ(A) log m, by choosing n ∈ {300, 600, 1200}; Otherwise, the gradient descent algorithm does not yet reach the sample requirement (8.2) that guarantees computational convergence. In Figure 5, we observe that the Conic programming estimator is relatively stable over the choices of µ once f ≥ 0.2. The composite gradient algorithm favors smaller penalties such as f ∈ [0.05, 0.2], leading to smaller relative error in the 1 and 2 norm, consistent with our theoretical predictions. These results also confirm our theoretical prediction that the Lasso and Conic programming penalty parameters λ and µ need to be adaptively chosen based on the noise level τ B , because a larger than necessary amount of penalty will cause larger relative error in both 1 and 2 norm.

Sensitivity to tuning parameters
In the third experiment, we change the 1 -ball radius R ∈ {R * , 5R * , 9R * } in (1.10), where R * = β * 2 √ d, while running through different penalties for the composite gradient descent algorithm. In the left column in Figure 6, A and B are generated using the AR(1) model with ρ A = 0.3, ρ B * = 0.3 and τ B = 0.7. In the right column, we set τ B = 0.3, while keeping other parameters invariant.
As predicted by our theory, a larger radius demands correspondingly larger penalty to ensure consistent estimation using the composite gradient descent algorithm; this in turn will increase the relative error when R is too large, for example, when R = Ω( n log m ), where the Ω(·) notation hides parameters involving τ B and κ(A). This is observed in Figure 6. When n is sufficiently large relative to τ B and κ(A), the optimal In the top row, we plot the relative 2 error for the Conic programming estimator (blue dotted lines) and the corrected Lasso (green dashed lines) via the composite gradient descent algorithm with step size parameter set to be ζ 2 = 3 2 λmax(A) and ζ 3 = 2λmax(A); in the bottom row, we plot the relative 1 error under the same settings. We note that the composite gradient descent algorithm starts to converge even when we set the step size parameter to be ζ 1 = λmax(A) +

Statistical and optimization error in Gradient Descent
In the last set of experiments, we study the statistical error and optimization error for each iteration within the composite gradient descent algorithm. We observe a geometric convergence of the optimization error β t − β 2 .
For each experiment, we repeat the following procedure 10 times: we start with a random initialization point β 0 and apply the composite gradient descent algorithm to compute an estimate β; we compute the optimization error log( β t − β 2 ), which records the difference between β t and β, where β is the final solution. In all simulations, we plot the log error log( β t − β 2 ) between the iterate β t at time t versus the final solution β, as well as the statistical error log( β t − β * 2 ), which is the difference between β t and β * at time t. Each curve plots the results averaged over ten random instances.
In the first experiment, both A and B are generated using the AR(1) model with parameters ρ A = 0.3 and ρ B * = 0.3. We set m = 1024, d = 10 and τ B ∈ {0.3, 0.7}. These results are shown in Figure 7. Within each plot, the red curves show the statistical error and the blue curves show the optimization error. We can see the optimization error β t − β 2 decreases exponentially for each iteration, obeying a geometric convergence. To illuminate the dependence of convergence rate on the sample size n, we study the optimization error log( β t − β 2 ) when n = ρd log m , where we vary ρ ∈ {1, 2, 3, 6, 12, 25}. When n = d log m, the composite gradient algorithm fails to converge since the sample size is too small for the RSC/RSM conditions to hold, resultimsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 ing in the oscillatory behavior of the algorithm for a constant step size. As the factor ρ increases, the lower and upper RE curvature α and smoothness parameter α become more concentrated around λ min (A) and λ max (A) respectively, and the tolerance parameter τ decreases at the rate of log m n . Hence we observe faster rates of convergence for ρ = 25, 12, 6 compared to ρ = 2, 3. This is well aligned with our theoretical prediction that once n = Ω(κ(A) τ0 λmin(A) d log m) (cf. (8.2)), we expect to observe a geometric convergence of the computational error β t − β 2 .
For the statistical error, we first observe the geometric contraction, and then the curves flatten out after a certain number of iterations, confirming the claim that β t converges to β * only up to a neighborhood of radius defined through the statistical error bound ε 2 stat ; that is, the geometric convergence is not guaranteed to an arbitrary precision, but only to an accuracy related to statistical precision of the problem measured by 2 error: β − β * 2 2 =: ε 2 stat between the global optimizer β and the true parameter β * . In the second experiment, A is generated from the Star-Block model, where we have 32 subgraphs and each subgraph has 16 edges; B is generated using the random graph model with n log n edges and adjusted to have τ B = 0.3. We set m = 1024, n = 2500 and d = 10. We then choose ρ A ∈ {0.3, 0.5, 0.7, 0.9}. The results are shown in Figure 8(b). As we increase ρ A , we need larger sample size to control the statistical error. Hence for a fixed n, the statistical error is bigger for ρ A = 0.7, compared to cases where ρ A = 0.5 or ρ A = 0.3, for which we have κ(A) = 42.06 and κ(A) = 10.2 (for ρ A = 0.3) respectively; Moreover, the rates of convergence are faster for the latter two compared to ρ A = 0.7, where κ(A) = 169.4. When ρ A = 0.9, the composite gradient descent algorithm fails to converge as ρ(A) is too large (hence not plotted here) with respect to the sample size we fix upon. In Figure 8(a), we show results of A being generated using the AR(1) model with four choices of ρ A ∈ {0.3, 0.5, 0.7, 0.9} and B being generated using the AR(1) model with ρ B * = 0.7 and τ B = 0.3. We observe quantitively similar behavior as in Figure 8(b).

Proof of Lemma 1
Proof of Lemma 1. Part I: Suppose that the Lower-RE condition holds for Γ := A T A. Let x ∈ Cone(s 0 , k 0 ). Then Thus for x ∈ Cone(s 0 , k 0 ) ∩ S p−1 and τ (1 + k 0 ) 2 s 0 ≤ α/2, we have where we use the fact that for any J ∈ {1, . . . , p} such that |J| ≤ s 0 , x J 2 ≤ x T0 2 . We now show the other direction.
Part II. Assume that RE(4R 2 , 2R − 1, A) holds for some integer R > 1. Assume that for some R > 1 Let (x * i ) p i=1 be non-increasing arrangement of (|x i |) p i=1 . Then where J := {1, . . . , s}. Choose s = 4R 2 . Then Thus we have . Then for all x ∈ S p−1 such that x 1 ≤ R x 2 , we have for k 0 = 2R − 1 and s 0 := 4R 2 , where we use the fact that (1 + k 0 ) x T0 2 2 ≥ x 2 2 by Lemma 33 with x T0 as defined therein. Otherwise, suppose that x 1 ≥ R x 2 . Then for a given τ > 0, Thus we have by the choice of τ as in (2.2) and (9.3) The Lemma thus holds.

Proof of Theorem 3
Throughout this proof, we assume that A 0 ∩ B 0 holds. First we note that it is sufficient to have (3.2) in order for (6.5) to hold. Condition (3.2) guarantees that for , and the last inequality holds given that k log(cm/k) on the RHS of (10.1) is a monotonically increasing function of k, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Next we check that the choice of d as in (3.4) ensures that (6.9) holds for D φ defined there. Indeed, for c K 4 ≤ 1, we have By Lemma 15, we have on event A 0 , the modified gram matrix Γ A := 1 n (X T X − tr(B)I m ) satisfies the Lower RE conditions with α and τ as in (10.2). Theorem 3 follows from Theorem 16, so long as we can show that condition (6.6) holds for λ ≥ 4ψ log m n , where the parameter ψ is as defined (3.6), Combining (10.2) and (6.6), we need to show (6.10) holds. This is precisely the content of Lemma 17. This is the end of the proof for Theorem 3

Proof of Theorem 4
For the set Cone J (k 0 ) as in (F.3), Recall the following Theorem 27 from [38]. Theorem 27. ( [38]) Set 0 < δ < 1, k 0 > 0, and 0 < d 0 < p. Let A 1/2 be an m × m matrix satisfying RE(d 0 , 3k 0 , A 1/2 ) condition as in Definition 2.1. Set Let Ψ be an n × m matrix whose rows are independent isotropic ψ 2 random vectors in R m with constant α. Suppose the sample size satisfies Then with probability at least Proof of Theorem 4. Suppose RE(2d 0 , 3k 0 , A 1/2 ) holds. Then for d as defined in (3.10) and n = Ω(dK 4 log(m/d)), we have with probability at least 1−2 exp(δ 2 n/2000K 4 ), The rest of the proof follows from [3] Theorem 1 and thus we only provide a sketch. In more details, in view of the lemmas shown in Section 6, we need to hold for some constant c for Ψ := 1 n X T 0 X 0 . It is shown in Appendix C in [3] that under the RE(2d 0 , k 0 , 1 √ n Z 1 A 1/2 ) condition, for any d 0 ≤ m/2 and 1 ≤ q ≤ 2, where c(q) > 0 depends on k 0 and q. The theorem is thus proved following exactly the same line of arguments as in the proof of Theorem 1 in [3] in view of the q sensitivity condition derived immediately above, in view of Lemmas 19, 20 and 21. Indeed, for v := β − β * , we have by definition of q sensitivity as in (6.13), Thus we have for d 0 = c 0 n/ log m, where c 0 is sufficiently small, is sufficiently small and thus (3.11) holds. The prediction error bound follows exactly the same line of arguments as in [3] which we omit here. See proof of Theorem 7 in Section 6.5 for details.

Proof of Theorem 6
Throughout this proof, we assume that A 0 ∩ B 0 holds. The proof is also identical to the proof of Theorem 3 up till (10.2), except that we replace the condition on d as in the theorem statement by (4.3). Theorem 6 follows from Theorem 16, so long as we can show that condition (6.6) holds for α and τ = λmin(A)−α s0 as defined in (10.2), and λ ≥ 2ψ log m n , where the parameter ψ is as defined (6.3). Combining (10.2) and (6.6), we need to show (6.10) holds. This is precisely the content of Lemma 18. This is the end of the proof for Theorem 6.

Proof of Theorem 9
Suppose that event A 0 ∩ B 0 holds. The condition on d in (5.3) implies that To see this, note that the following holds by the first bound in (5.3): where α = 5 8 λ min (A) by Lemma 8, and henceᾱ ≥ 59α 60 . Thus we have Now, by definition of ν(d, m, n) and the second bound on n in (14.1), imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 That is, we actually need to have for 2 ≤ᾱ 8ζ where we use the second bound in (14.1), and hencē Finally, putting all bounds in (2.11), we have 0 < κ < 1. Thus the conclusion of Theorem 2 hold.

Proof of Corollary 10
Suppose that event A 0 ∩ B 0 holds. We first show that Recall that ξ ≥ 10τ (L n ) by definition of ξ in (2.12). The condition (5.2) on λ as stated in Theorem 2 indicates that We first show that for the choice of λ and R as in (14.4), Then (5.4) holds.
For the second term on the RHS of (2.16), we have by (14.1), Consider the choice ofη = δ 2 , where Mδ 2 ≥η = δ 2 ≥ Thus we have for (14.4), and hence = 4δ 2 λ . Then for the last term on the RHS of (2.16), we have for τ (L n ) τ , Finally, suppose we fix Then (5.5) holds given that the last term on the RHS of (2.16) is now bounded by

Now we obtain an upper bound using
where we use (5.3) and the fact that τ0 M+ = 12.5C (s 0 + 1). We now discuss the implications of this bound on the choice of λ in Section 5.1. We consider two cases.
• When τ B = Ω(1). It is sufficient to have for β * 2 ≤ b 0 and τ B 1, Now combining this with the condition on d as in (5.3) implies that it is sufficient to set R such that Hence it is sufficient to have for ψ D 0 K M + Kτ

Proof of Theorem 12
Clearly the condition on the stable rank of B guarantees that Thus the conditions in Lemmas 11 and 5 hold. First notice that By Lemma 11 we have on B 4 for D 0 := √ τ B + a 1/2 max , imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 and on event B 5 for D 0 := We have on B 0 and under (A1), by Lemmas 11 and 5 and D 1 defined therein, Finally, we have by the union bound, P (B 0 ) ≥ 1 − 16/m 3 . This is the end of the proof of Theorem 12.

Conclusion
In this paper, we provide a unified analysis on the rates of convergence for both the corrected Lasso estimator (1.7) and the Conic programming estimator (1.8). As n increases or as the measurement error metric τ B decreases, we see performance gains over the entire paths for both 1 and 2 error for both estimators as expected. When we focus on the lowest 2 error along the paths as we vary the penalty factor f ∈ [0.05, 0.8], the corrected Lasso via the composite gradient descent algorithm performs slightly better than the Conic programming estimator as shown in Figure 5.
For the Lasso estimator, when we require that the stochastic error in the response variable y as in (1.1a) does not approach 0 as quickly as the measurement error W in (1.1b) does, then the sparsity constraint becomes essentially unchanged as τ B → 0. These tradeoffs are somehow different from the behavior of the Conic programming estimator versus the Lasso estimator; however, we believe the differences are minor. Eventually, as τ B → 0, the relaxation on d as in (4.17) enables the Conic programming estimator to achieve bounds which are essentially identical to the Dantzig Selector when the design matrix X 0 is a subgaussian random matrix satisfying the Restricted Eigenvalue conditions; See for example [6,4,38].
so as to recover the regular lasso bounds in q loss for q = 1, 2 in (4.5) in Theorem 6. Moreover, suppose that tr(B) is given, then one can drop the second term in ψ as in (4.2) involving β * 2 entirely and hence recover the lasso bound as well.
Finally, we note that the bounds corresponding to the Upper RE condition as stated in Corollary 25, Theorem 26 and Lemma 15 are not needed for Theorems 3 and 6. They are useful to ensure algorithmic convergence and to bound the optimization error for the gradient descent-type of algorithms as considered in [1,30], when one is interested in approximately solving the nonconvex optimization function (1.7). Our Theorem 9 illustrates this result. Our theory in Theorem 9 predicts the dependencies of the computational and statistical rates of convergence for the corrected Lasso via gradient descent algorithm on the condition number κ(A), the trace parameter τ B and the radius R as depends on τ B , sparse and minimal eigenvalues of A. Therefore, we need to increase the penalty when we increase the 1 -ball radius R in (1.10) in order to ensure algorithmic and statistical convergence as predicted in Theorem 9. This is well-aligned with the observation in Figure 6. Our numerical results validate such algorithmic and statistical convergence properties.
where the distance between β t and the global optimizer β is measured in terms of the objective function φ, namely, δ t = φ(β t ) − φ( β).
We first show Lemma 28, which ensures that the vector ∆ t := β t − β satisfies a certain cone-type condition. The proof is omitted, as it is a shortened proof of Lemma 1 of [31]. Lemma 28. (Iterated Cone Bound) Under the conditions of Theorem 2, suppose there exists a pair (η, T ) such that (B.1) holds. Then for any iteration t ≥ T , we have We next state the following auxiliary result on the loss function. We use Lemma 29 in the proof of Lemma 28 and Corollary 10.  (2015)) Suppose the RSC and RSM conditions as stated in (2.8) and (2.9) hold with parameters (α , τ (L n )) and (α u , τ u (L n )) respectively. Under the conditions of Theorem 2, suppose there exists a pair (η, T ) such that (B.1) holds. Then for any iteration t ≥ T , we have for 0 < κ < 1,
Proof of Theorem 2. We are now ready to put together the final argument for the theorem. First notice that (2.16) follows from (2.15) directly in view of (B.3) and Lemma 28, where we setη = δ 2 ,¯ stat = 8 √ dε stat and = 2 min 2δ 2 λ , R .
imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 and thus The remainder of the proof follows an argument in [1]. We first prove the following inequality: We divide the iterations t ≥ 0 into a series of epochs [T , T +1 ] and defend the toler- In the first iteration, we apply Lemma 30 for any iteration t ≥ 0. Then we have for any iteration t ≥ T 1 The same argument can be now be applied in a recursive manner. Suppose that for some ≥ 1, we are given a pair (η , T ) such that We now definē We can apply Lemma 30 to obtain for any iteration t ≥ T and ε := 2 min{η λ , R}, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 which implies that for all t ≥ T +1 , by our choice of {η , T } ≥1 . Finally, we use the recursion to establish the recursion that Taking these statements as given, we need to havē It is sufficient to establish that Thus we find that the error drops below δ 2 after at most δ ≥ log log(Rλ/δ 2 )/ log(4) / log 2 + 1 = log log(Rλ/δ 2 ) epochs. Combining the above bound on δ with the recursion (B.6) is guaranteed to hold for all iterations To establish (B.7), we start with = 0 and establish that for¯ stat = 8 and thus ε 1 := 2 min{η imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Assume that¯ stat ≤ ε 1 (otherwise, we are done at the first iteration). First, we obtain for = 1,η where in the last three steps, we use the fact that λ ≥ 16Rξ (1−κ) and (B.8). Thus (B.6) holds for = 1.
Now assume that (B.7) holds for d ≤ . In the induction step, we again use the assumption that ε := 2η λ ≥¯ stat and (B.6) to obtain Finally, by the induction assumption we use the bound immediately above to obtain The rest of the proof follows from that of Corollary 10. This is the end of the proof for Theorem 2.
It remains to prove Lemma 29.

Appendix C: Some auxiliary results
We first need to state the following form of the Hanson-Wright inequality as recently derived in Rudelson and Vershynin [37], and an auxiliary result in Lemma 32 which may be of independent interests.
Theorem 31. Let X = (X 1 , . . . , X m ) ∈ R m be a random vector with independent components X i which satisfy E (X i ) = 0 and X i ψ2 ≤ K. Let A be an m × m matrix. Then, for every t > 0, We note that following the proof of Theorem 31, it is clear that the following holds: Let X = (X 1 , . . . , X m ) ∈ R m be a random vector as defined in Theorem 31. Let Y, Y be independent copies of X. Let A be an m × m matrix. Then, for every t > 0, We next need to state Lemma 32, which we prove in Section N. Lemma 32. Let u, w ∈ S n−1 . Let A 0 be an m × m symmetric positive definite matrix. Let Z be an n × m random matrix with independent entries Z ij satisfying EZ ij = 0 and Z ij ψ2 ≤ K. Let Z 1 , Z 2 be independent copies of Z. Then for every t > 0, where c is the same constant as defined in Theorem 31.

C.1. Proof of Lemma 5
First we write imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 By constructing a new matrix A n = I n ⊗ A, which is block diagonal with n identical submatrices A along its diagonal, we prove the following large deviation bound: for t 1 = C 0 K 2 A F √ n log m and n > log m, where the first inequality holds by Theorem 31 and the second inequality holds given that A n 2 F = n A 2 F and A n 2 = A 2 . Similarly, by constructing a new matrix B m = I m ⊗ B, which is block diagonal with m identical submatrices B along its diagonal, we prove the following large deviation bound: for Finally, we have by (C.1) for t 0 = C 0 K 2 tr(A)tr(B) log m, To see this, recall then we can also guarantee that The lemma is thus proved.

C.2. Proof of Lemma 11
Following Lemma 32, we have for all t > 0, B 0 being an n × n symmetric positive definite matrix, and v, w ∈ R m , t and Proof of Lemma 11. Let e 1 , . . . , e m ∈ R m be the canonical basis spanning R m . Let x 1 , . . . , x m , x 1 , . . . , x m ∈ R n be the column vectors Z 1 , Z 2 respectively. Let for all i. Clearly the condition on the stable rank of By (C.1), we obtain for t = C 0 M K tr(B) log m P ∃j, T B 1/2 Z 2 e j > t = P ∃j, where the last inequality holds by the union bound, given that tr(B) for all j and t = C 0 K 2 √ log mtr(B) 1/2 , imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Let v, w ∈ S m−1 . Thus we have by Lemma 32, for t 0 = C 0 M K √ n log m, τ = and n ≥ log m, Therefore we have with probability at least 1 − 4/m 3 , The "moreover" part follows exactly the same arguments as above. Denote byβ * := β * / β * 2 ∈ E ∩ S m−1 and w i := A 1/2 e i / A 1/2 e i 2 . By (C.3) imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 By the two inequalities immediately above, we have with probability at least 1 − 4/m 3 , and The last two bounds follow exactly the same arguments as above, except that we replace β * with e j , j = 1, . . . , m and apply the union bounds to m 2 instead of m events, and thus P ( The corollary is thus proved.
and hence F.1. Comparing the two type of RE conditions in Theorems 3 and 4 We define Cone(d 0 , k 0 ), where 0 < d 0 < m and k 0 is a positive number, as the set of vectors in R m which satisfy the following cone constraint: For each vector x ∈ R m , let T 0 denote the locations of the d 0 largest coefficients of x in absolute values. The following elementary estimate [38] will be used in conjunction with the RE condition.
We first show Lemma  Proof. By the optimality of β, we have Hence, we have for λ ≥ 4ψ log m n , where by the triangle inequality, and β S c = 0, we have We now give a lower bound on the LHS of (G.1), applying the lower-RE condition as in Definition 2.2, where we use the assumption that imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 which holds by the triangle inequality and the fact that both β and β have 1 norm being bounded by b 0 √ d. Hence by (G.3) and (G.5) Thus we have and the lemma holds.
Proof of Theorem 16. Following the conclusion of Lemma 35, we have Moreover, we have by the lower-RE condition as in Definition 2.2 where the last inequality follows from the assumption that 16dτ ≤ α/2.

H.2. Proof of Lemma 18
The proof for d ≤ α 32τ = 5s0 96 follows from (H.2). In order to show the second inequality, we follow the same line of arguments except that we need to replace one inequality (H.4) with (H.5). By definition of D 0 , we have B 2 +a max ≤ (D 0 ) 2 ≤ 2( B 2 + a max ). Let D = (s 0 + 1).
By (6.11), (H.1) and (H.3), we have for c ≤ where assuming that s 0 ≥ 32, we have the following inequality by definition of s 0 and α = 5 8 λ min (A), We now replace (H.4) with and ψ = C 0 D 0 K Kτ . The lemma is thus proved.
Remark H.1. Throughout this paper, we assume that C 0 is a large enough constant such that for c as defined in Theorem 31, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 Now by the triangle inequality, The lemma thus holds given

I.3. Proof of Lemma 21
Recall the following shorthand notation: First we rewrite an upper bound for v = β − β * , D = tr(B) and D = tr(B), On event B 0 , we have by Lemma 20 and the fact that β ∈ Υ, imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 and on event B 4 , Thus on event B 0 , we have Now on event B 6 , we have for 2D 1 ≤ D 2 On event B 5 ∩ B 10 , we have Thus we have on B 0 ∩ B 10 , , where D 0 ≤ D 2 and τ A = 1, and The lemma thus holds.
Appendix J: Proof for Theorem 7 We prove Lemmas 22 to 24 in this section.

J.3. Proof of Lemma 24
For the rest of the proof, we will follow the notation in the proof for Lemma 21. Notice that the bounds as stated in Lemma 20 remain true with ω, µ chosen as in (6.16), so long as (β * , β * 2 ) ∈ Υ. This indeed holds by Lemma 22: for ω and µ (4.11) as chosen in Theorem 7, we have by (J.2), which ensures that (β * , β * 2 ) ∈ Υ by Lemma 22.

Appendix K: Some geometric analysis results
Let us define the following set of vectors in R m : Cone(s) := {υ : υ 1 ≤ √ s υ 2 } imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 For each vector x ∈ R m , let T 0 denote the locations of the s largest coefficients of x in absolute values. Any vector x ∈ S m−1 satisfies: We need to state the following result from [33]. Let S m−1 be the unit sphere in R m , for 1 ≤ s ≤ m, U s := {x ∈ R m : | supp(x)| ≤ s}. (K. 2) The sets U s is an union of the s-sparse vectors. The following three lemmas are wellknown and mostly standard; See [33] and [30]. Proof. Fix x ∈ R m . Let x T0 denote the subvector of x confined to the locations of its s largest coefficients in absolute values; moreover, we use it to represent its 0-extended version x ∈ R m such that x T c = 0 and x T0 = x T0 . Throughout this proof, T 0 is understood to be the locations of the s largest coefficients in absolute values in x.
Moreover, let (x * i ) m i=1 be non-increasing rearrangement of (|x i |) Any vector x ∈ R m satisfies: It follows that for any ρ > 0, s ≥ 1 and for all z ∈ L, we have the i th largest coordinate in absolute value in z is at most √ s/i, and sup z∈L x, z ≤ max z 2 ≤ρ x T0 , z + max imsart-ejs ver. 2014/10/16 file: eiv-rz.tex date: October 16, 2018 where clearly max z 2 ≤ρ x T0 , z = ρ given that for a convex function x, z , the maximum happens at an extreme point; and in this case, it happens for z such that z is supported on T 0 , such that z T0 = and z T c 0 = 0. Lemma 37. Let 1/5 > δ > 0. Let E = ∪ |J|≤s E J for 0 < s < m/2 and k 0 > 0. Let ∆ be a m × m matrix such that Then for all v ∈ √ sB m 1 ∩ B m 2 , υ T ∆υ ≤ 4δ. (K.9) Proof. It is sufficient to show that ∀υ ∈ Cone(s) ∩ S m−1 , Denote by Cone := Cone(s). Clearly this set of vectors satisfy: Thus (K.9) follows from (K.5). Clearly, Cone(s, 1) ⊂ Cone(s). given that ∀u ∈ Cone(s, 1), we have Lemma 39. Suppose all conditions in Lemma 37 hold. Then for all υ ∈ R m , υ T ∆υ ≤ 4δ( υ Proof. The lemma follows given that ∀υ ∈ R m , one of the following must hold: if υ ∈ Cone(s) υ T ∆υ ≤ 4δ υ In fact, the same conclusion holds for all y, w ∈ F ∩ S m−1 ; and in particular, for B = I, we have the following. where recall X 0 = Z 1 A 1/2 . Notice that We first bound the middle term as follows. Fix u, υ ∈ E ∩ S m−1 . Then on event B 2 , for Υ = Z T 1 B 1/2 Z 2 , We now use Lemma 40 to bound both I and III. We have for C as defined in Lemma 40, on event B 1 ∩ B 3 , Thus we have on event B 1 ∩ B 2 ∩ B 3 and for τ B := tr(B)/n, A)) .
On event B 6 , we have for D 1 as defined in Lemma 5, The theorem thus holds by the union bound.