A statistical mechanics approach to de-biasing and uncertainty estimation in LASSO for random measurements

In high-dimensional statistical inference in which the number of parameters to be estimated is larger than that of the holding data, regularized linear estimation techniques are widely used. These techniques have, however, some drawbacks. First, estimators are biased in the sense that their absolute values are shrunk toward zero because of the regularization effect. Second, their statistical properties are difficult to characterize as they are given as numerical solutions to certain optimization problems. In this manuscript, we tackle such problems concerning LASSO, which is a widely used method for sparse linear estimation, when the measurement matrix is regarded as a sample from a rotationally invariant ensemble. We develop a new computationally feasible scheme to construct a de-biased estimator with a confidence interval and conduct hypothesis testing for the null hypothesis that a certain parameter vanishes. It is numerically confirmed that the proposed method successfully de-biases the LASSO estimator and constructs confidence intervals and p-values by experiments for noisy linear measurements.


Introduction
Estimating high-dimensional unknown variables from a limited number of data precisely and reliably is an important task in statistics, machine learning, signal processing, and so on. For instance, such demands arise in compressed sensing [1,2] and genomics [3]. Since, in these problems, the number of parameters often far surpasses that of observed data, it is clear that some sparsity assumptions on the parameters are necessary to reasonably estimate them. Therefore, one needs to simultaneously solve two problems: variable selection, which seeks relevant (or non-zero) parameters for the data generation process, and parameter estimation. In the past few decades, a number of methods have been developed to tackle such problems. One of the most successful approaches is the least absolute shrinkage and selection operator (LASSO) [4] method for high-dimensional linear regression problems in which the estimator is obtained by minimizing the L 1 norm regularized likelihood function. As LASSO estimators can be easily obtained by versatile algorithms for convex optimization [2,5] and have appealing consistency properties [6,7,8], they have received considerable attention.
Specifically, let us consider the linear measurement model: where x 0 ∈ R N and a i ∈ R N are the parameter (signal) and measurement vectors, respectively, σ 2 ∈ R is a parameter that describes the strength of the measurement noise, and N (µ, σ 2 ) is the normal distribution with mean µ and variance σ 2 . Notation ⊤ means the operation of matrix/vector transpose. In matrix notation, this model is expressed as where a ⊤ i corresponds to the i'-th row of the matrix A ∈ R M ×N . A is called the observation or measurement matrix by cases. The objective of high-dimensional linear regression is to find the parameter vector x 0 , where the number of measurements M is smaller than that of the parameter N. Note that in this high-dimensional setting, one cannot obtain a true solution with simple linear algebra because A ⊤ A is not invertible; by contrast, in the classical setting where M > N, the unique unbiased estimator is easily obtained as x classical = (A ⊤ A) −1 A ⊤ y by using the least squares method. To achieve this aim, LASSO seeks an estimator by solving an optimization problem that imposes sparsity via an L 1 penalty: x LASSO (y, A; λ) ≡ arg min where λ is a hyperparameter that controls the strength of the regularization. This convex optimization problem can be solved efficiently by using various versatile algorithms.
Although LASSO might be seen as simple heuristics, it has an appealing consistency property: in a certain sparsity condition on the true parameter x 0 and an appropriate control of the regularization strength λ, the LASSO solution and x 0 are consistent in the sense that x LASSO − x 0 2 2 /N vanishes as the measurement ratio γ ≡ M/N tends to infinity. For a more comprehensive review of LASSO in the context of high-dimensional settings, see [9].
Unfortunately, LASSO also has some drawbacks. First, the LASSO solution is biased as long as λ > 0 is finite. The amplitude of the LASSO estimator x LASSO is shrunk toward zero by the regularization term and its absolute value is typically smaller than that of the true parameter x 0 even in an ideal sparsity assumption. Second, no explicit form of the distribution is available for the estimator, as it is just expressed as a numerical solution of (3). Consequently, one can neither construct confidence intervals nor perform hypothesis testing for the null hypothesis that a certain element of the parameter vanishes. These bottlenecks are considered to be problematic in real applications in which the statistical reliability of the estimation result should be assessed. This situation is different from the one of classical statistics in which one can analytically obtain an unbiased estimator and its distribution.
To resolve the problems stated above, in this study, we develop a new scheme for de-biasing and uncertainty estimation in the LASSO estimation in the case that the observation matrix A is generated from rotationally invariant random matrix ensembles, which are concretely defined in the next section. The uncertainty addressed in this study concerns the randomness that arises from the random observation matrix A and measurement noise ξ. Our approach is based on a careful observation of the replica analysis of LASSO and an advanced mean-field method known as expectation consistent approximation or the adaptive Thouless-Anderson-Palmer (TAP) approach [10,11,12] developed in machine learning [13] and statistical mechanics. We numerically show that the proposed algorithm effectively de-biases the LASSO estimator and estimates its uncertainty.
The rest of this manuscript is organized as follows. In section 2, we explain the problem setting. In section 3, we describe the result of the replica analysis of LASSO and its physical implications. Then, the design of our scheme is introduced. The derivation of the free energy density is in Appendix A. In section 4, the proposed scheme is numerically tested by experiments for noisy linear measurements using various matrix ensembles. The last section provides a summary.

Model specification
In this study, we focus on random design models of (2), in which A is a random matrix and the true parameter vector x 0 is sparse in the sense that the number of its non-zero components is limited to ̺N (0 ≤ ̺ < 1). More precisely for A, we assume that for eigenvalue decomposition A ⊤ A = ODO ⊤ , O can be regarded as a random sample from the uniform distribution of the N × N orthogonal matrices and the empirical eigenvalue converges to a certain distribution ρ(λ) in the limit N → ∞ with probability one.

De-biasing and uncertainty estimation in LASSO
Let x LASSO (y, A; λ) be the LASSO estimator for the given y, A, and λ. We are interested in the two problems associated with x LASSO (y, A; λ). The first problem is that the LASSO estimator is biased. In other words, E x LASSO i A,ξ − x 0,i , (i = 1, 2, ..., M) remains finite for λ > 0 because of the shrinkage effect caused by the regularization term λ x 1 . The second is that the LASSO estimator does not have an explicit form of the distribution. As a consequence, one can neither construct a confidence interval nor compute a p-value to conduct hypothesis testing for the null hypothesis that a certain parameter vanishes.
In response to the aforementioned problems, we construct the following quantities. The first quantity is the de-biased estimators { x debiased The term de-biased means that this estimator coincides with the true parameter on average: The second quantity is the p-values to test whether the LASSO estimator is zero or not. We are interested in hypothesis testing with the null hypothesis H 0,i : x 0,i = 0. The confidence intervals concerning the de-biased estimators and hypothesis testing via p-values assess the uncertainty in LASSO.
In the past few years, several researchers have been working on the issue closely related to that stated here [14,15,16]. These studies discuss de-biasing and hypothesis testing in high-dimensional statistics for a fixed observation matrix where the randomness comes from the measurement noise, under tight sparsity assumptions on a true sparse signal, which corresponds to the ̺ → 0 limit in the current setting. In contrast to these studies, we concentrate on the case that the randomness comes from both the random observation matrix and the measurement noise without an explicit sparsity assumption on the true parameter keeping ̺ ∼ O(1).

Replica analysis for general rotationally invariant random design matrices and its physical implications
To investigate how the LASSO solution depends on the true solution, observation matrix, and measurement noise, we first evaluate the free energy density corresponding to the LASSO Hamiltonian H(x) ≡ y − Ax 2 2 /2 + λ x 1 at a zero-temperature limit: where β is the inverse temperature and Z is the partition function: We take the limit N → ∞ with γ = M/N ∼ O(1) fixed. In the zero-temperature limit β → ∞, the Boltzmann distribution e −βH(x) /Z is dominated by the configurations of the LASSO solution (3). Hence, one can evaluate how the LASSO estimator depends on x 0 , A, ξ by analyzing the macroscopic behavior of the typical free energy density (4) using statistical mechanics. Since the Hamiltonian defined above has a mean-field nature in the sense that all the variables are weakly connected, the free energy density (4) can be evaluated by using the replica method: where extr χ, χ,Q, Q,m, m F (χ, χ, Q, Q, m, m) denotes the extremization of the function F with respect to its arguments and G ′ (x; J) is the derivative of G(x; J) with respect to x. We have defined (...)Dz, J, G(x) as follows: where ρ J (s) is the asymptotic eigenvalue distribution of J. The derivative of the function G(x; J) has the following form: where z(x) is implicitly determined by the extremal condition of (9): The transformation S J that appears in (11) is called the Stieltjes transformation of ρ J . The introduced function G is connected to the R-transform R J (·) of the asymptotic eigenvalue distribution of J in studies of free probability theory [17]: Appendix A provides a brief derivation of the free energy density (6).
The connection between the free energy density (6) and macroscopic observables is as follows. At the extremum, Q, m, and χ correspond to the macroscopic physical observables: A,ξ /N. Each of these corresponds to the self-overlap, the overlap between the LASSO solutions and true solutions, and the macroscopic susceptibility. The notation ... represents the Boltzmann average in the zero-temperature limit: ... ≡ lim β→∞ (...)e −βH(x) dx/Z. In addition, from direct calculation, one can show the following relationships between the free energy density, regularization term, and residual sum of squares: wherer and RSS represent the per-element average of the regularization term and residual sum of squares, respectively. By using the relationships (13) and (15) and the extremal condition concerning Q, m, χ, the conjugate fields Q, m, χ can be represented via the macroscopic physical variables: Here, χ, G ′ (−χ; J) and G ′′ (−χ; J) are given as follows: where z ′ (−χ) is obtained from the derivative of equation (11): The minimization problem in equation (6) corresponds to the effective single body problem, which determines the value of the local magnetization x i . Splitting into effective single body problems from the original multi-body estimation problem is called the decoupling principle in the literature on information theory [18,19]. A comparison with the TAP/cavity analysis indicates that h i ≡ mx 0,i + χz i and m correspond to the local field and Onsager reaction coefficient, respectively [21]. Here, z i ∼ N (0, 1) effectively represents the randomness that comes from the observation matrix and measurement noise. Figure 1 schematically shows the distribution of the local fields and how the local field determines the LASSO solution. Each local field is distributed according to the normal distribution N ( mx 0,i , χ) and the LASSO solution is obtained by acting the soft-thresholding operator ST λ, Q on it: where Θ(z) is Heaviside's step function.
The LASSO solution takes a non-zero value if the amplitude of the corresponding local field is larger than λ. Conversely, if and only if it is smaller than λ, the LASSO solution is exactly zero. Hereafter, we call the non-zero and zero components of the LASSO solution the active and inactive components, respectively.
The above observations indicate that once the local fields and m, χ are estimated from the LASSO solutions, one can construct an intended p-value P i as and an unbiased estimator as with a confidence interval of significance α sig . These are the key observations for the design of our scheme.

Derivation of the adaptive TAP equations:
To derive the relation between the LASSO solution x LASSO (= x ) and the local fields, let us consider Gibbs free energy: The average x is determined as the global minimizer of G(m): x = arg min m G(m).
Once the above Gibbs free energy is exactly calculated, the extremal conditions of h and m generally associate the average x and local field [20]. However, the evaluation of equation (27) is computationally difficult in general. To overcome this difficulty, we take the following expectation consistent or the adaptive TAP approach [10,11,12]. First, we define an alternative Gibbs free energy: which provides the constraints on the first and macroscopic second moments so that x , |x| 2 /N = arg min m,Q G(m, Q).
Unfortunately, equation (28) is also difficult to evaluate directly. The adaptive TAP approach resorts this calculation to the following approximation: where φ(m, Q; l = 0), φ G (m, Q; l = 1), and φ G (m, Q; l = 0) are the free energies for the modified distributions: the first term is a factorized distribution but contains the original non-Gaussian prior factor e −βλ x 1 , while the second and third terms are the global and factorized multivariate Gaussian distribution that replaces the prior factor e −βλ x 1 with a Gaussian factor e −βΛ G x 2 2 /2 . In contrast to the original form of Gibbs free energy (28), adaptive TAP free energy (29) can be easily calculated as it is composed of only integration over the multivariate Gaussian and factorized distributions. The evaluation of the integrals and extremal conditions in the second and third terms of equation (29) provides the following expression of φ ada : where χ ≡ β(Q − q), q ≡ i m 2 i /N. It has been shown [11,22] that the above free energy φ ada (m, Q) is asymptotically consistent with replica theory in the sense that lim β→∞,N →∞ E [extr m,Q φ ada (m, Q)] A,ξ /N = E [f ] A,ξ when A is a sample from the rotationally invariant ensemble. Thus, the extremal condition on h, Λ, m, Q and linear response argument give the intended TAP/cavity equations, which connect the local field and LASSO estimator for the current matrix ensembles for β → ∞, N → ∞:

3.2.2.
General construction procedure of the de-biased estimator, confidence interval, and p-value: In summary, once the LASSO estimator x LASSO (y, A; λ) is obtained for a set of (y, A; λ) by using versatile algorithms for the optimization problem (3) such as least-angle regression [25], coordinate descent [26], and approximate message passing [7,27], one can estimate the local fields h(y, A; λ), de-biased estimator x debiased (y, A; λ), confidence interval {I i (α sig )} i , and p-value P i as follows. We emphasize here that there is no need to use the derived TAP equation to obtain a LASSO estimator.
Note that a consistent estimator of the error variance σ 2 should be needed when it is unknown.

Settings
We perform numerical experiments to assess the usefulness of the proposed scheme. For this, we artificially generate the true sparse parameter x 0 , observation matrix A, and measurement noise ξ. The true sparse parameter x 0 is generated from the Bernoulli-Gauss distribution: Then, the form of G ′ (−χ; J), G ′′ (−χ; J), χ and Q are given as follows: By substituting the above expressions of G ′ , G ′′ into (16), one can show that χ does not depend on σ 2 . This is the characteristic property of this ensemble. Generally, χ depends on the measurement noise σ 2 .
(ii) The row-orthogonal ensemble [22,29] constructed by randomly selecting M rows from a randomly generated N × N orthogonal matrix. For this ensemble, the asymptotic eigenvalue distribution is given as ρ(s) = (1 − γ)δ(s) + γδ(s − 1). In this case, the form of G ′ (−χ; J), G ′′ (−χ; J), χ and Q are given as follows: where (iii) The random discrete cosine transform (DCT) ensemble in which A is constructed by randomly selecting M rows from N ×N DCT matrix. While this ensemble shares the same eigenvalue distribution as the row-orthogonal one, it is much more relevant for practical purposes, as the computational cost for observation and inference can be significantly reduced by using the fast Fourier transform technique. In addition, although the rotationally invariant assumption on O does not hold, this ensemble is also compatible with the current adaptive TAP scheme, as pointed by [24].
(iv) The geometric setup [29,30] in which A is constructed as A = UΣV ⊤ , where U ∈ R M ×M and V ∈ R N ×N are random samples from the uniform distribution of orthogonal matrices, and Σ ∈ R M ×N is a diagonal matrix whose (i, i)-th element is given by ν i ∝ τ i−1 for i = 1, 2, ..., M. The parameter τ ∈ (0, 1] is chosen so that the given value of the peak-to-average eigenvalue ratio is met and the singular values are scaled to satisfy the power constraint 1 = where η and B are related to the peak-to-average ratio κ: In this case, the explicit form of G ′ , G ′′ cannot be obtained. Thus, it should be evaluated numerically. To achieve this aim, we conduct the procedure explained in section 3.2.2, using the expression of the Stieltjes transform and z ′ (−χ): . (66) We mainly use the random i.i.d. Gaussian ensemble and random DCT ensemble for the numerical experiments. The geometric setup is only used in section 4.2.3. We do not use the original row-orthogonal setup.
Once a tuple of (x 0 , A, ξ) is generated, we calculate x LASSO , h, χ, χ and Q = Λ, x debiased by using the procedure explained in section 3.2.2. To estimate the error variance σ 2 needed in the random DCT case, we use the naive cross-validation-based estimator: whereλ is selected by K-fold cross-validation. In [23], it is empirically shown that this estimator robustly estimates the error variance, more so than its competitors. We use N s = 1000 different sets of pairs (A, ξ) for fixed x 0 to evaluate the statistical properties of the observables. We set ̺ = 0.1, γ = 0.5, σ 2 = 0.02, and K = 40, except for the geometric setup. In the geometric setup, we set ̺ = 0.1, γ = 0.8, σ 2 = 0.02, and κ = 8.

Distribution of the local fields and de-biased estimators:
First, we examine the statistical properties of the local fields and de-biased estimators. Figure 2 plots the sample quantiles of {(h i − Qx 0,i )/ χ} i versus the theoretical quantiles of the standard normal distribution for one configuration of (x 0 , A, ξ). It is clear that all the points are close to the line with unit slope and zero intercept. Further, Figure 3 plots the average values of the x LASSO and x debiased versus the true parameter x 0 . In contrast to the LASSO estimators, which are shrunk toward zero by the regularization term, x debiased efficiently reduces the LASSO estimator's bias. The average is taken over N s realizations of (A, ξ). These results validate our theoretical predictions on the local fields and de-biased estimators. Figure 4 plots the constructed de-biased estimators and their 95% confidence intervals. We show only the first 80 components for the sake of clarity. Although

Hypothesis testing:
An important advantage of the proposed scheme over LASSO is that it provides a hypothesis testing method with a null hypothesis that a certain parameter vanishes. Although LASSO provides a parameter selection rule that selects an active component set A(y, A; λ) as A(y, A; λ) = {i| x LASSO i (y, A; λ) = 0}, it cannot measure the statistical significance for finding an active component.
Specifically, we are interested in testing an individual null hypothesis H 0,i : x 0,i = 0  versus the alternative hypothesis H 1,i : x 0,i = 0, assigning a p-value of P i for these tests.
To this end, we evaluate the p-value of {P i } by using equation (23) for a two-tailed test. Then, the decision rule is to reject the null hypothesis H 0,i if the observed p-value P i is lower thanα sig and to accept the alternative hypothesis otherwise: whereα sig is the significance level. We use T as a rejection flag. This procedure ensures that the type I error probability or the FPR isα sig . Here, the FPR is the probability of falsely rejecting the null hypothesis H 0,i : Indeed, Figure 6 shows that the significance levelα sig and empirical true positive rate (TPR) are in excellent agreement. Further, we examine the relation between the FPR and TPR or the statistical power achieved by LASSO and our hypothesis testing procedure. Here, the TPR is the probability that the test correctly rejects the null hypothesis H 0,i : Note that although we can control the FPR by varying the significance levelα sig , the TPR cannot be controlled. Thus, a performance measure of the variable selection procedure by hypothesis testing can be given as the TPR for each value of the FPR. We evaluate the performance of hypothesis testing by using the ROC curve, which plots the TPR versus the FPR as an implicit function ofα sig . We examine the TPR and FPR by varying the significance levelα sig for each regularization parameter λ. For comparison purposes, we also plot the ROC curve for LASSO. For LASSO, the TPR and FPR are examined by changing the regularization parameter λ. Figure 7 summarizes the results averaged over N s configurations of (A, ξ). It is observed that for some values of λ around which the variance of the de-biased estimator is minimized, our testing procedure performs slightly better than LASSO in the sense that the TPR of the testing method is slightly larger than that of LASSO's one for some values of the FPR. In the case of LASSO, when the measurement ratio γ is sufficiently small, the TPR and FPR do not coincide with (1, 1) for finite λ > 0, as the consistency property does not hold in such a situation and the number of active components of the LASSO estimator is always smaller than min(N, M) [9]. On the contrary, as our hypothesis testing procedure always approaches the point (1, 1) from (0, 0), we can examine the TPR for all the values of the FPR ∈ [0, 1]. The superiority of the TPR comes from the fact that we are using the knowledge of the ensemble of the observation matrix. Further, as the hypothesis testing procedure controls the FPR and TPR by varying the significance α sig but not λ, one does not suffer from the shrinkage effect in the variable selection procedure. This is another advantage over variable selection by LASSO. These observations show the utility of our hypothesis testing procedure.

Hyperparameter selection via confidence interval minimization:
The issue of hyperparameter selection is noteworthy here. As LASSO has the hyperparameter λ that controls the strength of the regularization, one should choose a value of λ based on some criteria. As shown in Figure 5, the estimated variance of the de-biased estimator χ/ Q 2 has a minimum value at some λ > 0. At this point, one can estimate x 0 with the highest conviction in the sense that the confidence interval has the smallest width. It is therefore expected that the estimated variance of the de-biased estimator itself serves as a hyperparameter selection criterion. Indeed, in the i.i.d. Gaussian and row-orthogonal/random DCT cases, one can analytically show that minimizing the confidence interval is the equivalent to the minimization of the leave-one-out crossvalidation error C: the row-orthogonal or the random DCT cases.
(71) In other words, the leave-one-out cross-validation error and width of the confidence intervals are related with the linear transformation in these cases ( Figure 8).
Here, the leave-one-out cross-validation error is a widely used hyperparameter selection criterion that evaluates prediction performance, defined as follows: where the symbol \i denotes the absence of the i-th component (e.g., a \i = (a 1 , ..., a i−1 , a i+1 , ..., a N ) ⊤ ) and each term in the summation evaluates the fitness to the i-th data when the true signal is inferred from the other data. In the settings considered here, the above leave-one-out cross-validation error is expressed as follows [32,33]: By substituting the expression of the leave-one-out cross-validation error (73) into equation (16), the relations (71) are obtained.
To investigate the validity of the above observation that the confidence interval minimization and leave-one-out cross-validation error minimization provide the same λ, we test the geometric setup case in which χ/ Q 2 is not expressed as a linear function of C. Figure 9 compares the variance of { x debiased − x 0,i } i with the leave-one-out crossvalidation error (73). Surprisingly, the minimization of these two quantities seems to provide the same value of λ, although they do not have a functional relation as (71).
From the above observations, we speculate that the minimization of the confidence interval proposed here and the minimization of the leave-one-out cross-validation error yields the same value of λ for LASSO in general rotationally invariant observation matrices, but further investigation in this direction is still needed.

Summary
We developed a new computationally feasible scheme for de-biasing and uncertainty estimation in LASSO in the case of rotationally invariant observation matrix ensembles and validated the proposed scheme by using numerical experiments. We focused on the development of a de-biased estimator that has a confidence interval and hypothesis testing scheme for the null hypothesis that a certain parameter vanishes.
The numerical experiments showed that the proposed method efficiently constructed de-biased estimators with confidence intervals and p-values for the intended hypothesis testing. We revealed that the proposed hypothesis testing slightly improved the variable selection performance in the sense that the TPR of the testing method achieves a slightly larger value than that of the LASSO's one for some values of the FPR. Further, we examined the utility of the estimator of the confidence interval as a criterion for determining the hyperparameter. Surprisingly, minimizing the width of the confidence interval was equivalent to the minimization of the leave-one-out cross-validation error in our investigation.
Although we only focused on LASSO for linear models, future work could include an extension to other sparse regression methods such as the elastic net [34] as well as generalized linear models.

Appendix A. Derivation of the Free Energy Density
To take the average that appears in (4), we use the replica method [31] based on the identity for n ∈ R: In the replica method, we first take the average of the n-th power of the partition function over the randomness of A, ξ for the positive integer n ∈ N, and then analytically continue the obtained expression to real n ∈ R to take the limit n → 0, exchanging the order of the limits. For the general matrix ensembles considered here, it is convenient to first take the average over ξ. By taking this average, we obtain the following expression under the replica symmetric ansatz: where L, u a , and S are defined as follows: where u a and x a are the a-th replica's variable. In [35], it was shown that under the rotational invariance assumption on the random matrix J = A ⊤ A for eigenvalue decomposition J = ODO ⊤ considered in this study, the average over A that appears in equation where G(x; J) is the function defined in (9). Under the replica symmetric ansatz, L/N has three types of eigenvalues: s 1 = β∆Q − βn(q − 2m + ̺) + nβ 2 σ 2 ∆Q, s 2 = −β∆Q, and s 3 = 0. The number of degeneracy is 1, n − 1, and N − n, respectively. Thus, we obtain the following expression up to the leading order in n: On the contrary, by using the Fourier transform of the delta function and Hubbard-Stratonovich transform: e B 2 /2A = e −Ax 2 /2+Bx A 2π dx for A > 0, B ∈ R, the factor e S is given as follows: ln φ(x 0,i , z i , Q, q, m; β, λ)Dz i d Qd qd m, (A.8) φ(x 0,i , z i , Q, q, m; β, λ) = exp − Q + q 2 For β → ∞, the relevant variables scale as β(Q − q) = χ, Q + q = β Q, m = β m, and q = β 2 χ of order unity to ensure an appropriate limit f exists. Finally, by combining equations (A.7-A.9) and evaluating the integrals by adopting the saddle point method, we obtain equation (6) for β, N → ∞.