Preconditioning the Lasso for sign consistency

: Sign consistency of the Lasso requires the stringent irrepre-sentable condition. This paper examines whether preconditioning can circumvent this condition. Let X ∈ R n × p and Y ∈ R n satisfy the standard linear regression equation. Instead of computing the Lasso with ( X ,Y ), preconditioning ﬁrst left multiplies by F ∈ R n × n and then computes the Lasso with ( F X ,FY ). While others have proposed preconditioning for other purposes, we provide the ﬁrst results that show F X can satisfy the irrepresentable condition even when X fails to satisfy the condition. Preconditioning the Lasso creates a new estimator that is sign consistent in a wider variety of settings. Importantly, left multiplying the regression equation by F does not change β , the vector of unknown coeﬃcients. However, left multiplying this equation by F often inﬂates the variance of the errors. We propose a class of preconditioners to balance these costs and beneﬁts.


Introduction
Recent breakthroughs in information technology have provided new experimental capabilities in astronomy, biology, chemistry, neuroscience, and several other disciplines. Many of these new measurement devices create data sets with many more "measurements" than units of observation. For example, due to experimental constraints, both fMRI and microarray experiments often include tens or hundreds of people. However, the fMRI and microarray technologies can simultaneously measure thousands to millions of different pieces of information for each individual. Sparse high dimensional regression aims to select a small set of measurements that relate to an outcome of interest.
The Lasso (Tibshirani, 1996) is one of the most popular techniques for sparse high dimensional regression because it is the solution to a convex optimization problem, allowing for fast algorithms and assurances of global optimality. A rich theoretical literature describes the conditions for the Lasso to consistently estimate the regression coefficients . Because of the Lasso's ability to select a sparse solution, it is of particular interest to understand when the Lasso can select the true nonzero coefficients in the linear regression model. Stated loosely, the Lasso performs well in this respect when the columns of X are weakly correlated. This concept is formalized with sign consistency and the irrepresentable condition (see Section 2).
It is well known that the Ordinary Least Squares (OLS) estimator performs poorly when the columns of the design matrix are highly correlated. However, more samples overcome this problem; OLS is still consistent. With the Lasso, the detrimental effects of correlation are more severe. If the columns of the design matrix are correlated in a way that violates the irrepresentable condition, then the Lasso will fail to estimate the correct signs and the estimation performance will not improve by increasing the number of samples or increasing the signal to noise ratio. This paper demonstrates that, for the purposes of the Lasso, the correlation in the design matrix is malleable and can be diminished (at the expense of marginally more variance) by preconditioning, a classical technique to accelerate solvers of systems of equations. This paper demonstrates that in many sparse regression settings, preconditioning the Lasso produces a better estimator of the sparsity pattern. The next section gives a surprising simulation that shows how correcting for heteroskedasticity with generalized least squares (GLS) can act as a bad preconditioner and degrade the estimation performance of the Lasso. This contrasts with the classical results that demonstrate how GLS improves upon the estimation performance of ordinary least squares (OLS).

Some notation and a surprising simulation
Suppose the regression model where Y ∈ R n , X ∈ R n×p , β * ∈ R p , and ǫ ∼ N (0, Σ). By observing Y and X, we are interested in estimating the support of β * , The rest of the paper assumes Σ = σ 2 I n (where I n ∈ R n×n is the identity matrix). However, this motivating simulation uses a heteroskedastic model where Σ is a diagonal matrix with diagonal entries σ 2 i . Each σ i is an independent draw from the Gamma distribution. In all simulations E(σ i ) = 1. The horizontal axis in Figure 1 represent the standard deviation of σ i (i.e. the amount of heteroskedasticity). Figure 1 compares two techniques under this heteroskedastic model. The first estimator does not correct for heteroskedasticity. It is the standard Lasso estimator that we study in the rest of the paper To correct for the heteroskedastic or correlated errors, GLS left multiplies the regression equation (1) by Σ −1/2 , Then, the error term becomes a vector of iid normal variables, Σ −1/2 ǫ ∼ N (0, I).
Instead of computing the Lasso with (X, Y ) as in Equation (3), use and define the resulting estimator asβ GLS+Lasso (λ). The vertical axis of Figure 1 reports the proportion of fifty simulations in which there exists a tuning parameter λ such that the support of the estimator aligns perfectly with the true support S. At the very left, the model is homoskedastic and the estimators are equivalent. As the heteroskedasticity increases, one expects theβ GLS+Lasso (λ) to outperformβ(λ). However, the performance ofβ GLS+Lasso (λ) quickly degrades, failing to estimate the correct sparsity pattern. In this simulation, correcting for heteroskedasticity degrades estimation. This surprising result happens because Σ −1/2 in Equation (4) acts as a preconditioner, a bad preconditioner. It makes the design matrix ill-conditioned. In least squares regression, an ill-conditioned design matrix does not create bias. As such, GLS can improve the estimation performance in the classical setting. However, in penalized least squares regression, an ill-conditioned design matrix prevents support recovery; any decrease in the variance is offset by an increase in the support recovery bias.
Just as there are several settings where the original data has heteroskedastic or correlated errors, there are several settings where the original data contains an ill-conditioned design matrix. The techniques described in this paper correct for ill-conditioned design matrices. For example, perhaps some rows of X have a much larger ℓ 2 length than some other rows. Where GLS "whitens" the errors (with the side effect of making the design ill-conditioned), preconditioning "whitens" the design matrix (with the side effect of making the errors heteroskedastic and correlated). In this sense, GLS and the preconditioning are opposite transformations. Figure 1 gives a simulation setting where penalized regression (1) can accommodate heteroskedastic errors and (2) cannot accommodate an ill-conditioned design matrix. This suggests that the classical intuition, which ignores the conditioning of the design and focus exclusively on the distribution of the errors, does not extend to the modern settings. The rest of this paper focuses on preconditioning matrices that make the design matrix well conditioned, thus improving the estimation performance of the Lasso.
In this simulation, Σ does not describe heteroskedastic behavior in X. This is why Σ −1/2 acts as a poor preconditioner. However, in many applications, the underlying structure which leads to covariance or heteroskedasticity in ǫ (e.g. spatial dependence) may create covariance or heteroskedasticity in the rows of X. In these situations, Σ −1/2 will act as a good preconditioner because it will decorrelate and normalize the rows of X.

Preconditioning to circumvent the irrepresentable condition
This section defines sign consistency and the irrepresentable condition, a necessary and almost sufficient condition for the Lasso to be sign consistent.
This implies thatβ(λ) can asymptotically identify the relevant and irrelevant variables when it is sign consistent. Several authors, including Meinshausen and Bühlmann (2006); Zou (2006); Zhao and Yu (2006), have studied the sign consistency property and found a sufficient condition for sign consistency. Zhao and Yu (2006) call this assumption the irrepresentable condition. For a vector x, denote x ∞ = max i |x i |.
Definition 2. The design matrix X satisfies the irrepresentable condition for β * if, for some constant η ∈ (0, 1], In the above sufficient condition, η > 0. If this is replaced with η ≥ 0, then it is a necessary condition for sign consistency (Zhao and Yu, 2006;Zou, 2006). This condition is difficult to check because it relies on the unknown set S. Section 2 of Zhao and Yu (2006) gives several sufficient conditions. For example, if |cor(X i , X j )| ≤ c/(2s − 1) for a constant 0 ≤ c < 1, then the irrepresentable condition holds for any S.
The extant literature has proposed several ways of circumventing the irrepresentable condition. The two methods that have received the most attention both focus on refining the ℓ 1 penalty term in the Lasso objective function (3). Fan and Li (2001) and Zhang (2010) proposed making the penalty function concave. The adaptive Lasso is another popular approach (Zou, 2006); this applies a different ℓ 1 penalty to each element of the coefficient vector; these penalty weights come from an initial run of OLS.  illustrated how this can be extended to the high dimensional setting by using an initial run of the Lasso instead of OLS. While these previous approaches alter the penalty function, preconditioning instead changes the shape of the least squares contours in the data fidelity term Y − Xb 2 2 . Similar to work presented here, Xiong et al. (2011) also propose adjusting the data fidelity term to avoid the irrepresentable condition. However, instead of preconditioning, they proposed a procedures which (1) makes the design matrix orthogonal by adding rows, and (2) applies an EM algorithm, with concave penalty SCAD, to estimate the outcomes corresponding to the additional rows in the design matrix.

The Puffer transformation
In this paper, we always assume that the design matrix X ∈ R n×p has rank d = min{n, p}. From the singular value decomposition, there exist matrices U ∈ R n×d and V ∈ R p×d with U T U = V T V = I d and diagonal matrix D ∈ R d×d such that X = U DV T . Define the Puffer transformation F = U D −1 U T . The preconditioned design matrix F X has the same singular vectors as X. However, all of the nonzero singular values of F X are set to unity: F X = U V ′ . When n ≥ p, the columns of F X are orthonormal. When n ≤ p, the rows of F X are orthonormal.
After left multiplying the regression equation by the matrix F , the transformed regression equation becomes The parenthesis around (F X) emphasize that preconditioning is transforming X, not β * . Just as in GLS (Equation 4), β * remains unchanged after left multiplying the regression equation.
The scale ofΣ depends on the diagonal matrix D, which contains the d singular values of X. If any singular values of X approach zero, the corresponding elements of D −2 grow, amplifying the noise F ǫ. This increased noise can quickly overwhelm the benefits of a well conditioned design matrix. For this reason, Section 3.3 proposes a slightly modified preconditioner that bounds the spectral norm ofΣ.
In numerical linear algebra, the objective is speed, and there is a trade off between the time spent computing the preconditioner vs. solving the system of equations. Better preconditioners make the resulting system of equations easier to solve. However, these preconditioners themselves can be time consuming to compute. In our setting, the objective is inference, not speed per se, and the tradeoff is between a well behaved design matrix and a well behaved error term. Preconditioning can aid statistical inference if it can balance these two constraints.

Previous literature on preconditioning for sparse inference
This paper contributes to the existing literature by studying when preconditioning can circumvent the irrepresentable condition. For other reasons, preconditioning the Lasso has been proposed elsewhere. Paul et al. (2008) estimate a type of latent factor model; theoretical and simulation results suggest that their preconditioning technique improved estimation in settings with a low signal to noise ratio. However, Paul et al. (2008) do not study the relationship between preconditioning and the irrepresentable condition.
More recently, Huang and Jojic (2011) use preconditioning to remove the effects of confounding in high-throughput biological experiments and are motivated by empirical observations in genome wide association studies. In such biological studies, Alter et al. (2000) show how the leading singular vectors from X often "represent additive or multiplicative noise, experimental artifacts, or even irrelevant biological processes." As such, several papers have studied techniques that screen out the (typically large) singular vectors of X; see Yang et al.
(2014) for a further references and discussion. These empirical observations, not the irrepresentable condition, motivated Huang and Jojic (2011) to emphasize the bottom singular values in X. Although they accomplish this through preconditioning, their motivation is focused on biological experiments. With data analysis and biologically motivated simulations, they show that that preconditioning improves the model selection performance of the Lasso.
Most recently, Rauhut and Ward (2011) study interpolation with orthogonal polynomials. They precondition the polynomials with a diagonal preconditioner to satisfy the restricted isometry principal with high probability. In the current paper, we employ non-diagonal preconditioning which drastically increases the class of design matrices that benefit from preconditioning. Moreover, we demonstrate how preconditioning alters the error term, creating a statistical tradeoffs between a well conditioned design matrix and a well behaved error term.
The technical report for this paper introduced the Puffer Transformation. This led to two pieces of follow up research. First, Qian and Jia (2012) demonstrate the benefits of the Puffer transformation for the fused Lasso, a sparse high dimensional regression problem that is particularly plagued by correlation in the design matrix. Second, Wauthier et al. (2013) compares the Puffer transformation to the two two previous techniques in Paul et al. (2008) and Huang and Jojic (2011). Their analysis assumes that there exists a range of λ ∈ [λ l , λ u ] for which the standard Lasso estimates the correct sign. They study when preconditioning increases the ratio λ u /λ l , thus making sign estimation more robust to the choice of tuning parameter λ. Their results highlight the fact that the preconditioners in Paul et al. (2008) and Huang and Jojic (2011) project onto rank deficient subspaces. Wauthier et al. (2013) goes on to present a specific model for the design matrix X under which the Puffer transformation deterministically scales λ u /λ l . Under their model, if the largest singular vectors have small values in the positions of S, then the Puffer transformation will increase λ u /λ l . Otherwise, the Puffer transformation will decrease λ u /λ l . In this paper, we are particularly interested in situations where the design matrix fails to satisfy the irrepresentable condition before preconditioning (i.e. λ u < λ l ). Section 4.3 gives two brief simulations that compare the Puffer transformation to the preconditioners defined in Paul et al. (2008) and Huang and Jojic (2011).
In the sparse regression literature, other papers have considered left multiplying the regression equation for alternative reasons. Bootstrapping techniques such as Chatterjee and Lahiri (2011) left multiply the regression equation by a random diagonal matrix. Penalized generalized linear models are fit by iteratively reweighted least squares, which is equivalent to left multiplying by a diagonal matrix at each iteration; van de Geer (2008) highlights how it is the conditioning of the final iteration that matters for sign consistency. The subbagging technique in Bradic (2013) concludes by solving the Lasso with a random diagonal weighting matrix on each sub-Lasso problem. These weights are chosen to obtain a random approximation for the solution of the original unweighted problem, not to adjust for the irrepresentable condition.

Geometrical representation of preconditioning and the irrepresentable condition
Figure 2 displays the geometry of the Lasso before and after the Puffer transformation. This figure (i) demonstrate what happens when the irrepresentable condition is not satisfied, (ii) reveal how the Puffer transformation circumvents the irrepresentable condition, and (iii) illustrate why we call F the Puffer transformation. The figures in this section are derived from the following optimization problem which is equivalent to the Lasso,β(c) = arg min b: b 1 ≤c Y − Xb 2 2 . The definition ofβ(c) abuses notation. In fact, there is a one-to-one function φ(c) = λ to make the Lagrangian form of the Lasso (Equation 3) equivalent to the constrained form of the Lasso denoted byβ(c). Given the constraint set b 1 ≤ c and a continuum of sets Y − Xb 2 2 ≤ x for x ≥ 0, define Under certain conditions on X (e.g. full column rank), the solution is unique andβ(c) = I(c, x * ). In Figure 2, the constraint set {b : b 1 ≤ c} appears as a diamond shaped polyhedron and the level set of the loss function {b : Y − Xb 2 2 < x * } appears as an ellipse. The rows of X are sampled as three dimensional Gaussian vectors with mean zero. The first two elements are independent Gaussians and the third element has correlation .6 with both the first and second elements. To highlight the effects of preconditioning, the noise is very small and n = 10,000.
In panel A, the design matrix is not preconditioned. In panel B, the problem has been preconditioned, and the ellipse represents the set F Y − F Xb 2 2 ≤ x * ; preconditioning turns the oblong ellipse in panel A into the sphere in panel B.
In this simulation, β * = (1, 1, 0) and in all illustrations, the third dimension is represented by the axis that points up and down. Thus, the Lasso estimates the correct sign if the ellipse intersects the constraint set in the (horizontal) plane formed by the first two dimensions. The design matrix in panel A fails the irrepresentable condition because the elongated ellipse forcesβ(c) off of the true plane. This is shown in the bottom illustration in panel A.
In panel B, the design matrix F X satisfies the irrepresentable condition because the elongated direction of the ellipse shrinks down and the ellipse is puffed out into a sphere. Because of this,β(λ) lies in the true plane. When n > p (as in these figures) preconditioning with F makes the ellipse a sphere. When p > n, preconditioning with F can make low dimensional projections of the ellipse more spherical. The name Puffer transformation comes from the pufferfish. As Figure 2 illustrates, the Puffer transformation inflates the smallest singular values of the design matrix, making the contours of F Y −F Xb 2 2 ≤ x * more spherical.

Low dimensional results
If n ≥ p and X is full rank, then the preconditioned design matrix is orthonormal.
Orthogonal matrices trivially satisfy the irrepresentable condition and other conditions such as the restricted eigenvalue condition for ℓ 2 consistency (Bickel et al., 2009). Theorem 1 proves that the preconditioned Lasso is sign consistent, so long as the smallest eigenvalue of 1 n X T X is bounded away from zero. Theorem 1. Suppose that data (X, Y ) follows the linear model described in Equation (1) with iid Gaussian noise ǫ ∼ N (0, σ 2 I n ). Define the singular value decomposition of X as X = U DV T . Suppose that n ≥ p and X has rank p. Further assume that Λ min ( 1 n X T X) ≥C min > 0. Define the Puffer trans- A proof can be found in Appendix A in the supplementary material (Jia and Rohe, 2015) on page 5. The proof is very similar to the standard result for the homogenous linear model. The difference here is that after the Puffer Transformation, the vector ǫ contains correlated entries. To overcome this difficulty, the proof relies on a Gaussian comparison result that does not need any assumptions on the correlations of Gaussian random variables. Instead, it depends on the maximum among a set of Gaussian (or sub-Gaussian) random variables.
Remark 1. The loss function defined in the Lasso estimator, is slightly different from the classical definition, which uses 1 2n Ỹ −Xb 2 2 . In fact, the the Puffer Transformation accounts for this change of scale because it depends on the SVD of X, instead of 1 n X T X. As such, it changes the scale of the loss function.
Remark 2. Suppose thatC min > 0 is a constant. If p, min j∈S |β * j | and σ 2 do not change with n, then choosing λ such that λ → 0 and λ 2 n → ∞, ensures thatβ(λ) is sign consistent. One possible choice is λ = log n n . In classical linear regression, increasing the correlation between columns of X amplifies the variance of the standard OLS estimator; correlated predictors make estimation more difficult. Without preconditioning, this intuition does not hold for the standard Lasso; increased correlation in X creates an increasingly biased estimator. Theorem 1 shows that after preconditioning, the intuition from OLS again translates; increasing the correlation between the columns of X decreases the smallest singular value of X, increasing the spectral norm of F and the variance of the noise terms. Importantly, a large sample size n can overcome the additional noise induced by preconditioning.
Theorem 1 applies to the more general class of penalized least squares methods arg min for some type of penalty function pen(b) : R p → R, e.g. Lasso, SCAD, and MCP (Fan and Li, 2001;Zhang, 2010). After preconditioning, the design matrix F X is orthogonal and several convenient facts follow. First, if the penalty decomposes, pen(b, λ) = p j=1 pen j (b j , λ) so that pen j does not rely on b k for k = j, then the penalized least squares methods admit closed form solutions. If it is also true that all the pen j 's are identical functions that have a cusp at zero (e.g. Lasso, SCAD, MCP), then the solution to the preconditioned penalized least squares problem selects the same sequence of models as preconditioned correlation screening (i.e. select X j if |cor(F Y, F X j )| ≥ λ) (Fan and Lv, 2008). Theorem 1 implies that all such methods are sign consistent. These observations rely on the fact that F X is an orthogonal matrix. In high dimensions, F X is no longer orthogonal. So, the various methods could potentially estimate different models. Preconditioning (in black) reduces the average IC value to less than one. rho IC Fig 3. As the correlation ρ increases, most values of IC β * (F X) (in black) remain below the dashed line corresponding to the critical threshold at 1. Without preconditioning, IC β * (X) (in grey) quickly surpasses the critical threshold. Each point corresponds to one design matrix. The thick black and grey lines pass through the average IC value for each setting of ρ. The thin solid lines correspond to +/− one standard deviation.

High dimensional results
Subsection 3.1 gives a motivating simulation that illustrates the benefits of preconditioning in the high dimensional setting. Theorem 2 in Subsection 3.2 shows that F X satisfies the irrepresentable condition for many design matrices X. Subsection 3.3 proposes a class of generalized Puffer transformations and Theorem 3 proves that the Lasso with a specific preconditioner in this class can be sign consistent with arbitrarily small singular values in X. Figure 3 presents an illustrative numerical simulation to prime our intuition on preconditioning in high dimensions. In this simulation, n = 200, p = 10,000, and each row of X is an independent Gaussian vector with mean zero and covariance matrix Σ. The diagonal of Σ is all ones and the off diagonal elements are all ρ; ρ varies on the horizontal axis of Figure 3. The vertical axis plots the values

Motivating simulation
where S = {1, . . . , 10} and the nonzero elements of β * are all positive. Along with IC β * (X), the figure also contains IC β * (F X) and a horizontal line at 1. Recall that if IC β * (X) < 1, then X satisfies the irrepresentable condition. The figure shows that IC β * (X) quickly exceeds 1, while IC β * (F X) < 1 for all values of ρ. The reason that this happens is that preconditioning drastically reduces the correlation between the columns. For example, for ρ = .9, the pairwise correlations between the columns of X have an average of .90 with a standard deviation of .01. After the transformation, the average correlation is .005, and the standard deviation is .07. By reducing the pairwise correlations, preconditioning helps the design matrix satisfy the irrepresentable condition.

Uniform distribution on the Stiefel manifold
When p ≥ n and X is full rank the rows of F X are orthogonal. It lies in the Stiefel manifold, Moreover, when p ≥ n, F can be computed as (XX ′ ) −1/2 and F X = (XX ′ ) −1/2 X is the projection of X onto V (n, p) under any unitarily invariant norm (Fan and Hoffman, 1955). Denote the orthogonal group of matrices as O(p, R) = V (p, p). (2003)). A random matrix V is uniformly distributed on V (n, p), written V ∼ uniform(V (n, p)), if the distribution of V is equal to the distribution of V O for any fixed O in the orthogonal group of matrices O(p, R).

Definition 3 (Chikuse
Theorem 2 shows that if F X ∼ uniform(V (n, p)), then the matrix satisfies the irrepresentable condition with high probability. Propositions 1 and 2 give two examples of random design matrices X where F X is uniformly distributed on V (n, p).
Theorem 2. Suppose that V ∼ uniform(V (n, p)) and let X = U DV T for any U ∈ O(p, R) and diagonal matrix D. If p − s ≥ n, p > 9n and n > 400(s + 1) 2 , then .
Section C in the supplementary material contains a proof for this theorem. The proof is on page 17 and restated as Theorm C.2 in the supplementary material. The first step of the proof is to relate the matrix F X, drawn uniformly from Stifle manifold, to matrices that contain iid N (0, 1) elements. Then, results for random matrix theory control the spectral norm of the Gaussian random matrix and provide the result (Davidson and Szarek, 2001). A similar argument obtains a similar result for a non-preconditioned design matrix X with iid N (0, 1) entries; this is included in Theorem B.2 in the supplementary material. Propositions 1 and 2 give two models for X that make F X ∼ uniform (V (n, p)).
Proposition 1. If the elements of X are independent N (0, 1) random variables, then F X ∼ uniform(V (n, p)).
Proposition 2. Suppose that U Σ ∈ R p×p is drawn uniformly from O(p, R) and D Σ ∈ R p×p is a diagonal matrix with positive entries. Define Σ = U Σ D Σ U T Σ and suppose the rows of X are drawn independently from N (0, Σ), then F X ∼ uniform (V (n, p)).
The proofs for these propositions are in the supplementary material, Section C.

Generalized Puffer transformation
In the preconditioned regression equation, the noise ǫ becomes F ǫ. Since the spectral norm of F is unbounded as the smallest nonzero singular value of X approaches zero, the preconditioned noise F ǫ has unbounded variance. To diminish the increase in variance from preconditioning, this section studies a generalized form of the Puffer transformation.
Definition 4. Let X ∈ R n×p be a design matrix with singular value decomposition X = U DV T . For the matrix X, the generalized Puffer transformation with g : This definition implies that F g,τ X = UDV T whereD ii = g(D ii , τ ). Here, g is a function of the singular values of X and a tuning parameter τ . The Puffer transformation is F = F 1,τ where 1(D ii , τ ) = 1. To illustrate the potential benefits from this generalized preconditioner, define the hard thresholding function as h(x, τ ) = 1 if x ≥ τ and zero otherwise.
The spectral norm of F h,τ is bounded by 1/τ , limiting the amount that the preconditioner amplifies the noise. This next theorem studies this preconditioner under a model where the singular values of X are potentially very small and assumes that V ∼ uniform(V (n, p)), where V contains the right singular vectors of X. This highlights the tradeoff between (a) satisfying the irrepresentable condition and (b) limiting the amount of additional noise created by preconditioning.
Theorem 3. Suppose that V ∼ uniform(V (n, p)) and let X = U DV T for any U ∈ O(p, R) and diagonal matrix D. Suppose Y = Xβ * + ǫ, where ǫ ∼ N (0, σ 2 I n ), independent of X. For τ n > 0, letñ be the number of D ii 's greater than or equal to τ n . Define the hard thresholding function h(x, τ n ) as in Equation (8) and the generalized preconditioner F h,τn as in Definition 4. DefineỸ = F h,τn Y,X = F h,τn X, andβ(λ) = arg min b 1 2 Ỹ −Xb 2 2 +λ b 1 . Suppose that p−s ≥ñ, p > 9ñ andñ > 400(s + 1) 2 . If min j∈S |β * j | ≥ 2λ 9sp/(5ñ), then A proof can be found in Section C in the supplementary material. The proof is on page 20 and restated as Theorm C.4 in the supplementary material. The proof for this result relies on the previous result in Theorem 2 saying that with high probably the irrepresentable condition holds.
The assumption on min j∈S |β * j | appears restrictive. However, the scale of λ, τ n , and D play an essential role. After accounting for these terms, this condition is comparable to previous results. For the probability bound to converge to one, λ 2 τ 2 n must grow faster than log p. So, it is necessary to consider how τ n grows in a standard setting. In the situation where the elements of X contain iid random variables with constant variance, the average element of D is O( √ p). If τ n grows at this rate, then choosing λ 2 = p −1 log n log p ensures the last term in the probability bound converges to one. This yields the condition which is comparable to previous results. If τ n is smaller, thenñ is larger. However, λ 2 must also be larger. As a result, but the lower bound on min j∈S |β * j | becomes more strict. To ensure that this lower bound is not growing, τ n should grow faster than p/n. This theorem does not assume that X satisfies the irrepresentable condition. Instead, it supposes that V ∼ uniform (V (n, p)) and only presumes that D has sufficiently many values greater than τ . Several previous papers have also studied the Lasso (without preconditioning) under generative models for X (e.g. (Rudelson and Vershynin, 2006;Candes and Romberg, 2007)). The previous literature has constructed these random designs in a few different ways. For example, containing independent and identically distributed elements (e.g. binary or Gaussian) or by taking an orthonormal basis O ∈ V (p, p) (e.g. Fourier transform) and sampling n elements of this basis uniformly at random; these n elements are then concatenated to form an n × p design matrix. In all previous cases, these matrices will be well conditioned (i.e. the smallest non-zero singular value of X has the same order of magnitude as the largest singular value). However, if the experimental design or physical constraints restrict the sampling mechanism in X, then X will likely be ill conditioned and thus fail the irrepresentable condition. Theorem 3 allows for such design matrices by not making any assumptions on the smallest elements in D, showing that the Lasso can still be sign consistent with a generalized preconditioner.

Simulations
This section contains two simulations that study the performance of the Puffer transformation and the generalized Puffer transformation. The first simulation compares the model selection and ℓ 2 estimation performance of the Puffer transformed Lasso with the standard Lasso, Elastic Net, SCAD, and MC+ (Zou and Hastie, 2005;Fan and Li, 2001;Zhang, 2010). The second simulation illustrates a situation where the generalized preconditioner improves upon the Puffer transformation.

Preconditioning with F
After preconditioning, the noise vector F ǫ contains statistically dependent terms that are no longer exchangeable. This complicates many of the standard methods of tuning parameter selection (e.g. CV, AIC, BIC). We use the following OLS-BIC procedure. Appendix D gives an additional simulation that ensures this procedure does not differentially favor the preconditioned Lasso.
OLS-BIC; To choose a model in a path of models Starting from the null model, select the first model along the solution path with nz nonzero elements, for nz = 1, . . . , 40. For each value of nz, use the selected nz features to fit an OLS model with the un-preconditioned data. Compute the BIC for the resulting OLS model. Finally, select the tuning parameter that corresponds to the model with the lowest OLS-BIC score. The OLS models were fit with the R function lm and the BIC was computed with the R function BIC.
In this simulation, n = 250, s = 20, and p grows along the horizontal axis of the figures (from 2 5 = 32 to 2 15 = 32,768). All nonzero elements in β * equal three and σ 2 = 1. The rows of X are mean zero Gaussian vectors with constant correlation ρ. In the top row of plots in Figure 4 and 5, ρ = .1. In the middle and bottom rows, ρ = .5 and .85 respectively.
The first column of plots in Figure 4 corresponds to the number of false negatives. The second column corresponds to the number of false positives. Figure 5 plots the ℓ 2 error β (λ) − β * 2 on the right. Each data point in every plot comes from an average of ten simulation runs.
In many settings, across both p and ρ, the preconditioned Lasso simultaneously admits fewer false positives and fewer false negatives than the competing methods. The number of false negatives when ρ = .85 (displayed in the bottom left plot of Figure 4) gives the starkest example. In particular, as the correlation increases or the number of predictors grows, the preconditioned Lasso has the best relative performance. When p ≈ n = 250, the preconditioned Lasso performs poorly; this is because the singular values of X follow the Marchenko-Pastur law and when p ≈ n, this distribution has mass around zero. As a result, F has large spectral norm leading to excessive noise. This would be an appropriate regime to explore the use of a generalized preconditioner.
All simulations in this section were deployed in R with the packages LARS (for the Lasso), PLUS (for SCAD and MC+), and glmnet (for the elastic net) (Efron et al., 2004;Zhang, 2010;Friedman et al., 2010).
The design matrix is simulated as X ij = (G i /α)Z ij , where Z ij are iid N (0, 1) random variables and the G i are independent Gamma random variables with shape α and rate one. The Gamma random variables make the rows of X have heterogeneous lengths. The horizontal axis of Figure 6 represents the standard deviation of (G i /α). As α → ∞, G i /α concentrates around one. So, large values of α are on the left. As α → 0, the standard deviation of G i /α grows; these values are plotted on the right.
The top plot in Figure 6 shows that IC(X) quickly surpasses the critical threshold of one. As such, X is much less likely to satisfy the irrepresentable condition when the standard deviation of G i /α is large. The middle plot in Figure 6 shows that as the standard deviation of row length increases, the Puffer transformation drastically reduces the signal to noise ratio, where SN R dB is defined as After preconditioning with F , the SN R dB becomes SN R dB (F Xβ * , F e). The top plot in Figure 6 shows that F g,τ retains many of the advantages of preconditioning by drastically expanding the region of design matrices that can satisfy the irrepresentable condition. At the same time, it drastically increases the signal to noise ratio (compared to the Puffer transformation). As a result, F g,τ a yields better sign estimator than both the standard Lasso and the Puffer preconditioned Lasso (bottom plot). In all simulations in Figure 6, there are s = 10 nonzero elements in β * and each nonzero element is 30. The error terms are iid N (0, 1), n = 200, and p = 1000. The tuning parameter is τ = .05 √ p.

The generalized preconditioner improves sign estimation
Heterogeneity in row length (SD of G_i) P(correct sign estimation) Fig 6. The horizontal axis controls the amount of heterogeneity in the row lengths of X. As this increases, the irrepresentable condition evaluated with X quickly fails by surpassing the red line. Simultaneously, the signal to noise ratio (defined in equation 9) for F X converges to −∞ because the spectrum of X decays faster as the row heterogeneity increases. The generalized preconditioner Fg,τ balances these trade-offs and improves sign estimation.

Comparing Lasso preconditioners
This simulation compares four different preconditioning methods, investigating their ability to (1) satisfy the irrepresentable condition and (2) select the correct model. In addition to the Puffer Transformation, this simulation investigates the following three techniques: • Row Normalization. This preconditioner is a diagonal matrix D, with D ii equal to the ℓ 2 length of the ith row of of X. Preconditioning with D creates a design matrix with equal row lengths. • Correlation Sifting. Huang and Jojic (2011) suggests a preconditioner that projects X and Y onto the n − K smallest principal components of X. To define the preconditioner, take the SVD X = U DV T and define U A ∈ R n×n−k to contain the n − K smallest left singular vectors of X.
The preconditioner is U A U T A . • Latent Model. To estimate a Gaussian latent variable model, Paul et al. (2008) propose the following preconditioning technique: First, identify the q columns of X that are maximally correlated with Y and place these columns into a matrix X S . Then, project Y onto the K largest principal components of X S . In this routine, X is not preconditioned.
Latent Model preconditioning differs from the others in two important respects. First, it only preconditions Y . So, it does not alter the IC value of the design matrix (see Equation 7). Second, both Correlation Sifting and the Puffer transformation remove the effect of the largest singular vectors in X.
Meanwhile, Latent Model emphasizes these directions. The simulation in Figure 7 samples each row of X ∈ R 300×p independently from a multivariate normal distribution, with mean zero and covariance matrix The value of ρ is represented in the horizontal axis in each of the four plots. The first simulation is "very-sparse," with p = 10,000 columns in X and q = 10 nonzero elements in β * . The second simulation is "semi-sparse," with p = 500 and q = 50. Here, we use q to denote both the true number of nonzeros in β * and also the number of variables screened for the Latent Model preconditioner. The nonzero elements of β * are all 10 and the noise variance is 1. Because Σ is a rank one perturbation of the identity, this simulation uses K = 1 for both Correlation Sifting and Latent Model preconditioning.
The top left panel in Figure 7 displays the IC values (7) under the very-sparse setting; recall that F X satisfies the irrepresentable condition when IC(F X) < 1. In this plot, as the correlation increases from 0 to .09, the IC values of both Correlation Sifting and Puffer preconditioning are unchanged. In fact, both techniques are insensitive to ρ all the way through ρ = .95 (not shown). However, as ρ exceeds .08, X begins to fail the irrepresentable condition. In the top plots, Latent Model is over plotted by "no preconditioning" because it does not precondition X. The lower left plot shows that both Correlation Sifting and Puffer preconditioning estimate the correct model for ρ ∈ [0, .09]. They select the correct model all the way through ρ = .95 (not shown). The lower left plot shows that for all levels of ρ ∈ [0, .09], Latent Model preconditioning has worse model selection performance than the Lasso without any preconditioning. In both the very-and semi-sparse settings, Puffer preconditioning creates a well conditioned design matrix that allows for good model selection performance.
The top right panel in Figure 7 shows the conditioning performance under the semi-sparse setting. In this setting, Puffer preconditioning is again insensitive to ρ and satisfies the irrepresentable condition for all values of ρ ∈ [0, .09]. This performance translates into far superior model selection performance (shown in the bottom right panel). These simulations were created with ρ taking the values 0, .01, .02, . . . , .09. For each of these values, both X and Y were sampled 100 different times. The lines connect the average of 100 points. A technique is deemed to "select the correct model" if there exists a value of λ such thatβ(λ) has the same support as β * .

Discussion
This paper shows that preconditioning has the potential to circumvent the irrepresentable condition in several sparse regression settings. This means that a preprocessing step can make the Lasso, and several other methods, sign consistent with fewer restrictions on the design matrix. Furthermore, this preprocessing step is easy to implement and it is motivated by a wide body of research in numerical linear algebra. The preconditioning described in this paper left multiplies the design matrix X and the response Y by a matrix F = U D −1 U T , where U and D are derived from the SVD of X = U DV T . This preprocessing step makes the columns of the design matrix less correlated; while the original design matrix X might fail the irrepresentable condition, the new design matrix F X can satisfy it. In low dimensions, the Puffer transformation, ensures that the design matrix always satisfies the irrepresentable condition. In high dimensions, the Puffer transformation projects the design matrix onto the Stiefel manifold, and Theorem 2 shows that in the high dimensional asymptote, most matrices on the Stiefel manifold satisfy the irrepresentable condition. Section 3.3 introduces the generalized Puffer transformation. Theorem 3 proves that one type of generalized Puffer transformation makes the Lasso sign consistent under drastically reduced assumptions on the singular values of X.
In our simulation settings, the Puffer transformation drastically improves the Lasso's estimation performance, particularly in high dimensions. This opens the door to several other important questions (theoretical, methodological, and applied) on how preconditioning can aid sparse high dimensional inference. For example, can preconditioning be formulated in a way that it both whitens the design matrix similarly to the Puffer transformation and also allows for fast computation?
This is the first paper to demonstrate how preconditioning the standard linear regression equation can circumvent the irrepresentable condition. This represents a computationally straightforward fix for the Lasso inspired by an extensive numerical linear algebra literature. The algorithm easily extends to high dimensions and, in our simulations, demonstrates a selection advantage and improved ℓ 2 performance over previous techniques in very high dimensions.