An $\{l_1,l_2,l_{\infty}\}$-Regularization Approach to High-Dimensional Errors-in-variables Models

Several new estimation methods have been recently proposed for the linear regression model with observation error in the design. Different assumptions on the data generating process have motivated different estimators and analysis. In particular, the literature considered (1) observation errors in the design uniformly bounded by some $\bar \delta$, and (2) zero mean independent observation errors. Under the first assumption, the rates of convergence of the proposed estimators depend explicitly on $\bar \delta$, while the second assumption has been applied when an estimator for the second moment of the observational error is available. This work proposes and studies two new estimators which, compared to other procedures for regression models with errors in the design, exploit an additional $l_{\infty}$-norm regularization. The first estimator is applicable when both (1) and (2) hold but does not require an estimator for the second moment of the observational error. The second estimator is applicable under (2) and requires an estimator for the second moment of the observation error. Importantly, we impose no assumption on the accuracy of this pilot estimator, in contrast to the previously known procedures. As the recent proposals, we allow the number of covariates to be much larger than the sample size. We establish the rates of convergence of the estimators and compare them with the bounds obtained for related estimators in the literature. These comparisons show interesting insights on the interplay of the assumptions and the achievable rates of convergence.


Introduction
Several new estimation methods have been recently proposed for the linear regression model with observation error in the design. Such problems arise in a variety of applications, see [7,6,9,10]. In this work we consider the following regression model with observation error in the design: Here the random vector y ∈ R n and the random n × p matrix Z are observed, the n × p matrix X is unknown, W is an n × p random noise matrix, and ξ ∈ R n is a random noise vector. The vector of unknown parameters of interest is θ * which is assumed to belong to a given convex subset Θ of R p characterizing some prior knowledge about θ * (potentially Θ = R p ). Similarly to the recent literature on this topic, we consider the setting where the dimension p can be much larger than the sample size n and the vector θ * is s-sparse, which means that it has not more than s non-zero components.
The need for new estimators under errors in the design arises from the fact that standard estimators (e.g. Lasso and Dantzig selector) might become unstable, see [7]. To deal with this framework, various assumptions have been considered, leading to different estimators.
A classical assumption in the literature is a uniform boundedness condition on the errors in the design, namely, |W | ∞ ≤δ almost surely, where |·| q denotes the ℓ q -norm for 1 ≤ q ≤ ∞. Note that this assumption allows for various dependences between the errors in the design. In this setting, the Matrix Uncertainty selector (MU selector), which is robust to the presence of errors in the design, is proposed in [7]. The MU selectorθ MU is defined as a solution of the minimization problem where the parameters µ and τ depend on the level of the noises of W and ξ respectively. Under appropriate choices of these parameters and suitable assumptions on X, it was shown in [7] that with probability close to 1, |θ MU − θ * | q ≤ Cs 1/q {δ +δ 2 }|θ * | 1 + Cs 1/q log p n , 1 ≤ q ≤ ∞.
Here and in what follows we denote by the same symbol C different positive constants that do not depend on θ * , s, n, p,δ. The result (3) implies consistency as the sample size n tends to infinity provided that the error in the design goes to zero sufficiently fast to offset s 1/q |θ * | 1 , and the number of variables p and the sparsity s of θ * do not grow too fast relative to the sample size n.
An alternative assumption considered in the literature is that the entries of the random matrix W are independent with zero mean, the values E(W 2 ij ), j = 1, . . . , p, are finite, and data-driven estimatorsσ 2 j of σ 2 j are available converging with an appropriate rate. This assumption motivated the idea to compensate the bias of using the observable Z T Z instead of the unobservable X T X in (2) thanks to the estimates of σ 2 j . This compensated MU selector, introduced in [8] and denoted asθ cMU , is defined as a solution of the minimization problem where D is the diagonal matrix with entriesσ 2 j and µ > 0 and τ > 0 are constants chosen according to the level of the noises and the accuracy of theσ 2 j .
Rates of convergence of the compensated MU selector were established in [8]. Importantly, the compensated MU selector can be consistent as the sample size n increases even if the error in the design does not vanish. This is in contrast to the case of the MU selector, where the bounds are small only if the bound on the design errorδ is small. In particular, under regularity conditions, when θ * is s-sparse, it is shown in [8] that with probability close to 1 Under the same alternative assumption, a conic programming based estimatorθ C has been recently proposed and analyzed in [1]. The estimatorθ C is defined as the first component of any solution of the optimization problem min (θ,t)∈R p ×R+ where λ, µ and τ are some positive tuning constants. Akin toθ cMU , this estimator compensates for the bias by using the estimatorsσ 2 j of σ 2 j . However it exploits a combination of ℓ 1 and ℓ 2 -norm regularization to be more adaptive. It was shown to attain a bound as in (4) and to be computationally feasible since it is cast as a tractable convex optimization problem (a second order cone programming problem). Moreover, under mild additional conditions, with probability close to 1, the estimator (5) achieves improved bounds of the form provided thatD converges to D in sup-norm with the rate (log p)/n. It is shown in [1] that the rate of convergence in (6) is minimax optimal in the considered model.
There have been other approaches to the errors-in-variables model, usually exploiting some knowledge about the vector θ * , see [6,9,2,3]. Assuming |θ * | 1 is known, [6] proposed an estimatorθ ′ defined as the solution of a non-convex program which can be well approximated by an iterative relaxation procedure. In the case where the entries of the regression matrix X are zero-mean subgaussian and θ * is s-sparse, under appropriate assumptions, it is shown in [6] that for the error in ℓ 2 -norm (q = 2), with probability close to 1. Here, the value C(θ * ) depends on θ * , so that there is no guarantee that the estimator attains the optimal bound as in (6). Assuming that the sparsity s of θ * is known and the non-zero components of θ * are separated from zero in the way that an orthogonal matching pursuit algorithm to estimate θ * is introduced in [2,3]. Focusing as in [6] on the particular case where the entries of the regression matrix X are zero-mean subgaussian, it is shown in [2,3] that this last estimator satisfies a bound analogous to (6), as well as a consistent support recovery result.
The main purpose of this work is to show that an additional regularization term based on the ℓ ∞norm leads to improved rates of convergence in several situations. We propose two new estimators for θ * . The first proposal is applicable under a new combination of the assumptions mentioned above. Namely, we assume that the components of the errors in the design are uniformly bounded byδ as in (1), and that the rows of W are independent and with zero mean. However, we will neither assume that a data-driven estimatorD is available, nor that specific features of θ * are known (e.g. s or |θ * | 1 ). The estimator is defined as a solution of a regularized optimization problem which uses simultaneously ℓ 1 , ℓ 2 , and ℓ ∞ regularization functions. It can be cast as a convex optimization problem and the solution can be easily computed. We study its rates of convergence in various norms in Section 3. One of the conclusions is that forδ ≫ (log p)/n the new estimator has improved rates of convergence compared to the MU selector. Furthermore, note that the conic estimatorθ C studied in [1] can be also applied. Indeed, our setting can be embedded into that of [1] with D being the identically zero p × p matrix, which means that we have an estimator of each σ 2 j with an error bounded byδ 2 . Comparing the bounds yields that the conic estimatorθ C achieves the same rate as our new estimator ifδ is smaller than or of the order (log p)/n 1/4 . However, there is no bound forθ C available when δ ≫ (log p)/n 1/4 .
The second estimator we propose applies to the same setting as in [1]. The idea of taking advantage of an additional ℓ ∞ -norm regularization can be used to improve the conic estimatorθ C of [1] whenever the rate of convergence of the estimator D for σ 2 j , j = 1, . . . , p, is slower than (log p)/n. This motivates us to propose and analyze a modification of the conic estimator. We derive new rates of convergence that can lead to improvements. However, we acknowledge that in the case considered in [1], where the rate of convergence of D is (log p)/n, there is no gain in the rates of convergence when using the additional ℓ ∞ -norm regularization.
The paper is organized as follows. Section 2 contains the notation, main assumptions and some preliminary lemmas needed to determine threshold constants in the algorithms. The definition and properties of our first estimator are given in Section 3 whereas those of our second procedure can be found in Section 4. Section 5 contains simulation results. Some auxiliary lemmas are relegated to an appendix.

Notation, assumptions, and preliminary lemmas
In this section, we introduce the assumptions which will be required to derive the rates of convergence of the proposed estimators. One set of conditions pertains to the design matrix and the second to the errors in the model. We also state preliminary lemmas related to the stochastic error terms. We start by introducing some notation.
A random vector ζ ∈ R p is said to be sub-gaussian with variance parameter γ 2 if the inner products (ζ, v) are γ-sub-gaussian for any v ∈ R p with |v| 2 = 1.

Design matrix
The performance of the estimators that we consider below is influenced by the properties of the Gram matrix Ψ = 1 n X T X.
We will assume that: In order to characterize the behavior of the design matrix, we set where the X ij are the elements of matrix X and we consider the sensitivity characteristics related to the Gram matrix Ψ. For u > 0, define the cone and an integer s ∈ [1, p], the ℓ q -sensitivity (cf. [4]) is defined as follows: Like in [4], we use here the sensitivities to derive the rates of convergence of estimators under sparsity. Importantly, as shown in [4], the approach based on sensitivities is more general than that based on the restricted eigenvalue or the coherence conditions, see also [8,5,1]. In particular, under those conditions, we have κ q (s, u) ≥ c s −1/q for some constant c > 0, which implies the usual optimal bounds for the errors.

Disturbances
Next we turn to the error W in the design and the error ξ in the regression equation. We will make the following assumptions.
(A2) The elements of the random vector ξ are independent zero-mean sub-gaussian random variables with variance parameter σ 2 . (A3) The rows w i , i = 1, . . . , n, of the noise matrix W are independent zero-mean sub-gaussian random vectors with variance parameter σ 2 * . Furthermore, W is independent of ξ.

Bounds on the stochastic error terms
We now state some useful lemmas from [1] and [8] that provide bounds to various stochastic error terms that play a role in our analysis. We state them here because they introduce the thresholds δ i , δ ′ i that will be used in the definition of the estimators. In what follows, D is the diagonal matrix with diagonal elements σ 2 j , j = 1, . . . , p, and for a square matrix A, we denote by Diag{A} the matrix with the same dimensions as A, the same diagonal elements, and all off-diagonal elements equal to zero. Lemma 1. Let 0 < ε < 1 and assume (A1)-(A3). Then, with probability at least 1 − ε (for each event), and for an integer N ,δ where γ 0 , t 0 are positive constants depending only on σ, σ * . Lemma 2. Let 0 < ε < 1, θ * ∈ R p and assume (A1)-(A3). Then, with probability at least 1 − ε, In addition, with probability at least 1 − ε, and γ 2 , t 2 are positive constants depending only on σ * .
The proofs of Lemmas 1 and 2 can be found in [8] and [1] respectively.
In this section, we define and analyze our first estimator. It can be seen as a compromise between the MU selector (2) and the conic estimator (5) achieved thanks to an additional ℓ ∞ -norm regularization.
In the setting that we consider now, the estimateD is not available but the rows of the design error matrix W are independent with mean 0, and its entries are uniformly bounded. Formally, in this section, we make the following assumption.
Thus, Assumptions (A1)-(A4) imply the assumptions in [7]. However, they neither imply or are implied by the assumptions in [8]. That is, it is an intermediary set of conditions relative to the original assumptions for the MU selector in [7] and to those for the compensated MU selector in [8]. Importantly, we do not assume that there are some accurate estimators of the σ 2 j .
We consider the estimatorθ such that (θ,t,û) ∈ R p × R + × R + is a solution of the following minimization problem min θ,t,u where λ > 0 and ν > 0 are tuning constants and the minimum is taken over (θ, t, u) ∈ R p × R + × R + . This estimatorθ will be further referred to as the {ℓ 1 , ℓ 2 , ℓ ∞ }-MU selector.
The estimator above attempts to mimic the conic estimator (5) without an estimator D for σ 2 j , j = 1, . . . , p. In order to make θ * feasible for (8), the contribution of the unknown term 1 n Diag(W T W )θ * needs to be bounded. This is precisely the role of the extra termδ 2 u in the constraint since |θ| ∞ ≤ u and | 1 n Diag(W T W )| ∞ ≤δ 2 almost surely. Note that the use of u and t instead of |θ| ∞ and |θ| 2 in the constraint makes (8) a convex programming problem.
This new estimator exploits Assumptions (A2)-(A4) to achieve a rate of convergence that is intermediary relative to the rate of the MU selector and to that of the conic estimator.
. Note that µ and τ are of order (log p)/n. The next theorem summarizes the performance of the estimator defined by solving (8).
Under the same assumptions with q = 1, the prediction error admits the following bound, with the same probability: Proof. We proceed in three steps.
Step 2 provides a bound on | 1 n X T X∆| ∞ .
Step 3 establishes the rates of convergence stated in the theorem. We work on the event of probability at least 1 − 7ε where all the inequalities in Lemmas 1 and 2 are realized. Throughout the proof, J = {j : θ * j = 0}. We often make use of the inequalities |θ| ∞ ≤ |θ| 2 ≤ |θ| 1 , ∀θ ∈ R p .
Remark 1. We have stated Theorem 1 under Assumption (A4) to make the analysis streamlined with the previous literature, see [7]. However, inspection of the proofs shows that a more general condition can be used. The results of Theorem 1 hold with probability at least 1−7ε−ε ′ if instead of Assumption (A4) we require W to satisfy: | 1 n Diag(W T W )| ∞ ≤δ 2 with probability at least 1 − ε ′ , for some ε ′ > 0.
Compared to [7], the results in Theorem 1 exploit the zero mean condition on the noise matrix W . As in [7], the estimator is consistent asδ goes to zero. In order to compare the rates in Theorem 1 with those for the MU selector, we recall that, by Theorem 3 in [7], the MU selector satisfies |θ MU − θ * | q ≤ Cs 1/q log(c ′ p/ε) n + Cs 1/q (δ +δ 2 )|θ * | 1 with probability close to 1. While both rates share some terms, a term of order s 1/qδ |θ * | 1 appears only in the rate for the MU selector whereas a term of the order s 1/q log(c ′ p/ε)/n|θ * | 1 appears only for the {ℓ 1 , ℓ 2 , ℓ ∞ }-MU selector. Therefore, the improvement upon the original MU selector is achieved wheneverδ ≫ log(c ′ p/ε)/n.
If the additional conditionδ 2 + log(c ′ p/ε)/n ≤ c 1 κ 1 (s, 1 + λ + ν) holds, we can use the bound (10) and a better accuracy is achieved by the proposed estimator. In particular, |θ * | 1 no longer drives the rate of convergence. The impact ofδ on this rate is in the term for the MU selector. Furthermore, the rate of convergence of the new estimator also has a term of the form |θ * | 2 s 1/q log(c ′ p/ε)/n. Thus the new estimator obtains a better accuracy by exploiting additional assumptions together with the fact thatδ|θ * | 1 is of larger order than log(c ′ p/ε)/n|θ * | 2 , which holds wheneverδ ≫ log(c ′ p/ε)/n. Finally, the impact of going down from the ℓ 1 -norm to the ℓ 2 -or ℓ ∞ -norms is not negligible neither. For example, if all non-zero components of θ * are equal to the same constant a > 0, we have |θ * | 1 = sa while |θ * | 2 = a √ s, and |θ * | ∞ = a. Then, the comparison in (17) is reduces to comparing s 1/qδ2 versus s 1+1/q (δ +δ 2 ), featuring the maximum contrast between the two rates.
Finally, note that the conic estimatorθ C studied in [1] can be also applied under the assumptions of this section. Indeed, our setting can be embedded into that of [1] with D being the identically zero p × p matrix, which means that we have an estimator of each σ 2 j with an error bounded by b =δ 2 . The results in [1] assume b = C (log p)/n but they do not apply to designs with b of larger order. Comparing the bound (10) in Theorem 1 to the bound (6) yields that the conic estimatorθ C achieves the same rate as our new estimator wheneverδ is smaller than or of the order (log p)/n 1/4 . However, there is no bound forθ C available whenδ ≫ (log p)/n 1/4 .

{ℓ 1 , ℓ 2 , ℓ ∞ }-compensated MU selector
In this section, we discuss a modification of the conic estimator proposed in [1]. We introduce an additional ℓ ∞ -norm regularization to better adapt to the estimation error in D. As discussed in the introduction, this is beneficial when the rate of convergence of D to D is slower than (log p)/n, which is not covered by [1]. Here we consider the same assumptions as in [1] with the only difference that now we allow for any rate of convergence of D to D. Thus, we replace Assumption (A4) by the following assumption on the availability of estimators for σ 2 j , j = 1, . . . , p. (A5) There exist statisticsσ 2 j and positive numbers b(ε) such that for any 0 < ε < 1, we have P max j=1,...,p In what follows, we fix ε and set We are particularly interested in cases where β is of larger order than (log p)/n. To define the estimator, we consider the following minimization problem: min θ,t,u Here, λ > 0 and ν > 0 are tuning constants and the minimum is taken over (θ, t, u) ∈ R p × R + × R + .
Let (θ,t,û) be a solution of (18). We takeθ as estimator of θ * and we call it the {ℓ 1 , ℓ 2 , ℓ ∞ }compensated MU selector. The rates of convergence of this estimator are given in the next theorem.
Theorem 2 generalizes the results in [1] to estimators D that converge with rate b(ε) of larger order than (log p)/n. At the same time, if b(ε) is smaller than (log p)/n, both the conic estimatorθ C of [1] and the {ℓ 1 , ℓ 2 , ℓ ∞ }-compensated MU selector achieve the same rate of convergence.
For such designs that condition (20) does not hold, the conclusions of Theorem 2 need to be slightly modified as shown in the next theorem.
In our first set of simulations, we illustrate the finite sample performance of the proposed estimator by setting λ = ν ∈ {0.25, 0.5, 0.75, 1}. The {ℓ 1 , ℓ 2 , ℓ ∞ }-compensated MU selector will be denoted by {ℓ 1 , ℓ 2 , ℓ ∞ }. We compare its performance with other recent proposals in the literature, namely the conic estimator (denoted as Conic (λ) for λ = 0.25, 0.5, 0.75, 1), and the Compensated MU selector (cMU). We also provide the (infeasible) Dantzig selector which knows X (Dantzig X) and the Dantzig selector that uses only Z (Dantzig Z) as additional benchmark for the performance. Tables 1 and 2 provide the performance of the proposed estimator when λ = ν and the performance of various benchmarks. As discussed in the literature, ignoring the error-in-variables issue can lead to worse performance as seen from the performance of Dantzig Z compared to the (infeasible) Dantzig X. The conic estimator performs better than the compensated MU selector (cMU) when λ ∈ {0.5, 0.75, 1}. The comparison of the proposed estimator and the conic estimator is easier to establish as we can parametrize them by λ (as we set λ = ν). In this case the conic estimator penalizes more aggressively the uncertainty of not knowing σ 2 j . In essentially all cases 1 the proposed estimator yields improvements. The introduction of ℓ ∞ -norm regularization seems to alleviate regularization bias. Nonetheless, when setting λ = 0.25 both the conic estimator and the proposed estimator fail in the experiment. This failure occurs by not having enough penalty to control t − |θ| 2 and u − |θ| ∞ which leads to a large right hand side µt + βu + τ in the constraint 1 n Z T (y − Zθ) + Dθ ∞ ≤ µt + βu + τ in (18) and similarly the right hand side µt + τ in (5). In turn, this leads to substantial regularization bias and therefore underfitting. In fact, detailed inspection of estimators in that case reveals that coefficients are very close to zero for both the conic and the proposed estimator.
In the second set of simulations, we explore the performance of the proposed estimator for the case λ = ν. Moreover, we also study a modified estimator that contains safeguard constraints. These constraints aim to mitigate the problem discussed above. The safeguard constraints are described in Remark 2 below. We denote by {ℓ 1 , ℓ 2 , ℓ ∞ } * the estimator computed with the safeguards.
Remark 2 (Safeguard Constraints). In order to further bound t and u, we can add constraints that exploit that | · | q ≤ | · | 1 for q ≥ 1. Therefore, the constraints {θ + + θ − }, t ≤ w, and u ≤ w preserve the convexity of the optimization problem and can potentially yield additional performance.
We consider the same design as before and we explore some combinations of values for both proposed estimators (with and without the safeguard constraints). Tables 3 and 4 show the performance for different values of λ and ν. We note that these parameters seem to have different impact on the finite sample performance even if λ + ν is kept constant. Importantly, we observe that the addition of safeguard constraints virtually always leads to improvements although small (even zero sometimes) for most of the tested parameter values. In the case λ < ν, using safeguard constraints makes almost no difference and overall performance of both estimators is better. In contrast, the estimators perform worse when λ > ν and the safeguard constraints lead to improvements. Finally, as expected, the safeguard constraints improve substantially the performance when λ = ν = 0.25. In that case, the performance becomes comparable to that of the cMU estimator.  Table 4 Simulation results for 100 replications. For each estimator we provide average bias (Bias), average root-mean squared error (RMSE), and average prediction risk (PR).
Essentially, the safeguard constraints help to avoid severe underfitting. They are very helpful when the performance is below of what can be achieved. Nonetheless, we recommend to keep them in all cases as it does not impact negatively the estimator and the additional computational burden seems minimal.