Cross validation in LASSO and its acceleration

We investigate leave-one-out cross validation (CV) as a determinator of the weight of the penalty term in the least absolute shrinkage and selection operator (LASSO). First, on the basis of the message passing algorithm and a perturbative discussion assuming that the number of observations is sufficiently large, we provide simple formulas for approximately assessing two types of CV errors, which enable us to significantly reduce the necessary cost of computation. These formulas also provide a simple connection of the CV errors to the residual sums of squares between the reconstructed and the given measurements. Second, on the basis of this finding, we analytically evaluate the CV errors when the design matrix is given as a simple random matrix in the large size limit by using the replica method. Finally, these results are compared with those of numerical simulations on finite-size systems and are confirmed to be correct. We also apply the simple formulas of the first type of CV error to an actual dataset of the supernovae.


Introduction
Extracting rules from data has been at the heart of modern sciences. Johannes Kepler discovered his laws of planetary motion by examining the data of planetary orbits by trial and error, which later led to classical mechanics. Max Planck proposed his law of the black body heat radiation for accurately describing experimental data, which played a key role in the discovery of quantum mechanics. As these examples imply, rule extraction has relied mainly on human thoughts.
The never-ending innovation of measurements and experimental techniques is now resulting in the ongoing creation of a large amount of high-dimensional observation data every day. This provides us with situations where rule extraction from data is desired considerably more frequently than ever. Although the entire set of DNA sequences of human beings was identified in 2003, considerable effort must still be made in the days ahead for finding out what rules are written in the dataset. Worldwide observation networks of global climate are being consolidated, but analyzing the observed data in detail is indispensable for understanding the mechanism of global warming. Unfortunately, mechanisms underlying the genome and the global climate are considerably more complicated than those of the planetary orbits and the heat radiation. This makes it difficult to discover rules only by human thoughts as has been done thus far.
Sparse modeling may be a promising framework for resolving such difficulty [1, 2,3,4]. This generally means methods of statistical modeling or machine learning that describe rules by using a large number of parameters and select a "sparse" model in which many of the parameters are set to zero by minimizing sparsity-inducing penalties in conjunction with imposing a good fit to the observed data. Modeling methods of this type are preferable in the sense that one can discover a simple and reasonable rule in a semi-automatic manner, with little resort to human thoughts, from a set of many rules that represent various possible relations. The least absolute shrinkage and selection operator (LASSO) is a representative method of the sparse modeling [5,6]. In this method, many coefficients of large-dimensional linear regression are pruned by the effect of the ℓ 1 penalty that is defined by the sum of the absolute values of the coefficients. This technique has applications in a wide variety of fields, such as image processing [7], ecology [8], genetics [9], and astronomy [10,11]. A similar method is known for the signal recovery problem of compressed sensing [12,13,14,15], which exploits the intrinsic sparsity of objective signals for enhancing the signal processing performance [16,17,18,19,20,21,22,23].
LASSO is, however, required to solve another problem of determining the strength λ of the penalty term. Cross validation (CV) is a practically useful strategy for handling this task; its basic concept is to evaluate the prediction error by examining the data under control. Smaller values of the CV error are expected to be better to express the generative model of the data. The minimum, if it exists, of the CV error when changing λ is thus considered to obtain an optimal value of λ. Unfortunately, this reasonable strategy is not well controlled because the behavior of the CV error itself is not fully understood. In particular, there are several variants in the definition of the CV error, each of which can exhibit a different behavior and choose a different optimal value of λ. Further, conducting CV in a naive manner incurs high computational costs, which makes it difficult to systematically study the behavior of these variants. Even worse, this computational difficulty sometimes forces certain compromises such as scaling down the system size, usage of uncontrolled approximations, or even modifications in research plans.
Given the situation, in this study, we treat leave-one-out (LOO) CV and investigate two types of CV errors, to clarify their properties. Efficient formulas to calculate these two errors are proposed by using belief propagation (BP) in computer science or the cavity method in statistical mechanics [24,25,26], in a perturbative manner. A similar formula has also been proposed for Bayesian learning of simple perceptrons in [27]. Our derivation is analogous to that of the approximate message passing (AMP) algorithm [17,20,21,28]. The resultant formulas have two advantages: The computational cost of the resultant algorithm is considerably reduced from that of the naive algorithm; this reveals a simple connection of the CV errors to the residual sums of squares (RSSs) between the reconstructed and the given measurements, in the large system limit.
In response to this second finding, we analytically assess the two CV errors and the corresponding RSSs to reveal their general properties, in the large system limit under the assumption that the measurement matrix is a random matrix, each component of which is independently identically distributed (i.i.d.) from the zero-mean normal distribution. It is commonly found that both the CV errors exhibit their unique minimums as λ changes, but the locations of the minimums are different, and hence, the chosen "optimal" values of λ are discriminably different between the two CV errors. We compare these two values of λ by using the so-called receiver operating characteristic (ROC) curve and compare them to the so-called Younden's index, to find that in the weak noise case, the second CV error chooses a more preferable value of λ than the first one. This can be attributed to the fact that the first error tends to overestimate the false positive ratio. Unfortunately, however, our analytical result also clarifies that the above simple formulas derived by the BP in a perturbative manner are not applicable to the second CV error. This is understood by an intricate discussion on the change of the chosen variables in the leave-one-out procedure. These findings are confirmed by numerical experiments on finite-size systems, and our formula is clarified to work well for moderate-size systems.
The rest of this paper is organized as follows: In sec. 2, we state LASSO in the context of compressed sensing and explain the LOO CV. In sec. 3, we explain the application of the cavity method to the evaluation of the CV errors in the LOO CV, clarifying the relation be-tween the CV errors and the RSSs. In sec. 4, we present the analytical result in the case of a random-observation matrix. In sec. 5, we show the result of numerical experiments to support our algorithm and analytical results. An application of the proposed method to the Type Ia supernova data is also presented in this section. The last section is devoted to the conclusion.

Problem setting
Here, we state our problem setting and summarize the quantities of interest. These quantities are analyzed in the subsequent sections to clarify the behavior of CV errors in the LOO CV procedure.

Compressed sensing based on LASSO
In this paper, we introduce LASSO in the context of compressed sensing. Let us suppose that a vector y ∈ R M of measurement is generated from an unknown signal vectorx ∈ R N , which is assumed to be sparse, through the following linear process: where A = {A µi } µ=1,··· ,M ; i=1,···N ∈ R M ×N represents a measurement (design) matrix and ξ ∈ R M denotes the measurement noise each component of which is drawn from the zero-mean normal distribution with variance σ 2 ξ , indicated by N (0, σ 2 ξ ). The number of measurements M is supposed to be smaller than the dimensions of the representation N . Currently, we do not specify the ensemble of A but only assume the scaling of the component as A µi = O(1/ √ N). On this condition, we infer the representationx from the given measurement y, by utilizing the sparseness ofx. The sparsity ofx is quantified as follows: where |x| 0 results in zero if x = 0 and unity otherwise. The symbol || · || 0 is called ℓ 0 -norm. We can also introduce ℓ k -norm as follows: Given an inferred signal x, we introduce the RSS, E, and the rate, ǫ, as follows: The inference of x based on LASSO is expressed as follows: Unfortunately, the result is biased; i.e., x (1) (λ) does not agree withx even as M increases because of the presence of the penalty term λ||x|| 1 in eq. (5). A conventional alternative to x (1) is obtained by minimizing the RSS on the choice of column vectors associated with x (1) . This can be formulated as follows: x (2) (λ) = arg min where • denotes the Hadamard product defined as (v • w) i = v i w i , and |·| 0 of a vector is defined as (|v| 0 ) i = |v i | 0 . Corresponding to the two inferred signals x (1) and x (2) , we define the two RSSs as follows: 2.2 Leave-one-out cross validation The idea of LOO CV is as follows: We select one measurement µ among M measurements and leave it out while inferring the signal, which is formally written as follows: x (1)\µ (λ) = arg min where the symbol \µ denotes the absence of the µth observation. Using this, we can evaluate a CV error on the µth measurement as Summing this up for all µ = 1, · · · , M gives the CV error of the first type: We can reduce the bias effect by the regularization as eq. (6), defining The second type of the CV error can thus be expressed as follows: We particularly call these errors the LOO errors (LOOEs); they are central quantities of the analysis described in the subsequent sections.

Quantities to examine the quality of inference
In addition to the LOOEs, we need quantities to examine the quality of inference and compare them with the LOOEs. For this, we introduce the mean squared error (MSE) between the true and the inferred signal. Corresponding to the two inferred signals x (1) and x (2) in eqs. (5,6), the two MSEs can be calculated as follows: Further, the ratios of the correctly or incorrectly chosen variables are important. The true positive ratio, T P , and the false positive ratio, F P , are as follows: The so-called ROC curve is a plot of T P against F P . This characterizes the quality of inference: If a ROC curve is farther from the straight line T P = F P , then the inference is better. Accordingly, we also refer to the so-called Youden's index and define an "optimal" value of λ chosen according to this index, λ YI . The definition is as follows: 3 Message passing for leave-one-out error Suppose that the system size and the number of observations, N and M , are sufficiently large, implying that an inferred signal does not change considerably by an addition or deletion of observations. This assumption enables us to perform a perturbative treatment, clarifying the relationship between the LOOEs and the RSSs.

Revisiting approximate message passing
Let us start by stating a derivation of the known AMP algorithm. This can be done by using the belief propagation (BP) or the cavity method [12,20,21,28]. For this, we present a probabilistic formulation for the present problem on the basis of the prescriptions of statistical mechanics. We introduce a Hamiltonian, partition function, and Boltzmann distribution, respectively, as follows: where β denotes the inverse temperature and Φ µ represents the so-called potential function Note that β is independent of the strength of the observation noise σ ξ , and the limit β → ∞ is supposed to be taken after all the calculations as we are interested in the minimum of the above Hamiltonian. We denote the average over the Boltzmann distribution by the angular brackets · · · . BP allows us to calculate the marginal distribution by using two types of messages (i, j, k = 1, · · · , N, µ, ν = 1, · · · , M ) as follows: A crucial observation to assess eqs. (20,21) is that the exponent of the potential function has a sum of an extensive number of random variables; the central limit theorem thus justifies treating it as a Gaussian variable with the appropriate mean and variance. Hence, according to eq. (20), where x i is special, we can divide the extensive sum of Φ µ as follows: where z denotes the zero-mean unit-variance Gaussian variable. The second term on the righthand side represents the mean of j( =i) A µj x j , and thus, where the angular brackets · · · \µ denote the average over the Boltzmann distribution without the µth potential function. Now, let us call x \µ j cavity magnetization. The last term is derived by calculating the variance of j( =i) A µj x j as follows: where χ \µ jk is called the susceptibility matrix (without the µth observation), which quantifies the correlations between the variables. The terms added at the beginning of the second line have a small contribution of the scaling O (1/N ); thus, they are negligible, and their addition is justified.
The application of eq. (22) in eq. (20) replaces the integration over x to that over z. Performing this integration yields the following: Combining this formula with eqs. (20,21), we can derive a recursion relation of x \µ j , leading to the conventional BP equation.
A more convenient recursion relation can be obtained in terms of the full magnetization {x j = x j } instead of the cavity magnetization x \µ j = x j \µ . The full marginal distribution of x i necessarily takes the following modified Gaussian form: where and we set Hereafter, we call a µ the cavity residual. We can interpret that the cavity residual a µ contributes the µth observation to the effective field h i . Thus, the full magnetizationx j is obtained from x \µ j by adding this contribution of a µ to the effective field in a perturbative manner. This consideration yields the following: Note that we consider only the variation of h k and do not take into account the change in Γ k when adding the µth observation. This is because the µth observation's contribution to Γ k is proportional to A 2 µk = O(1/N ) and is smaller than that to h k proportional to Basically, our perturbation is connected to the smallness of A µk = O 1/ √ N , and only the linear term with respect to A µk is important. By substituting eq. (30) into eq. (29) and solving it with respect to a µ , we obtain a simple expression of a µ in terms of the full magnetization Similarly, the substitution of eq. (30) into m \µ i in eq. (28) yields the following: The neglected terms are proportional to the third and higher orders of A µi . The last term in eq. (32) is the well-known Onsager reaction term. The full magnetizationx i is a function of only Γ i and h i ; thus, now, we can calculatex i by recursion if the values of {Γ i } are determined. The functional form ofx i becomes simple in the limit β → ∞ and is identified with x (1) in eq. (5). The fixed point of the AMP is described in this limit as follows: where Θ(x) denotes the step function giving 1 for x ≥ 0 and 0 otherwise, and the superscript (1) is attached according to eq. (5).
Coefficients {Γ i } can be determined using BP in a similar manner; however, this is not an easy task. Therefore, we omit the derivation and just refer to [20,21] for the case of weak correlations where only diagonal terms are important, βV µ ≈ N j=1 A 2 µj χ \µ jj . The BP algorithm can also be applied for calculating x (2) . The derivation is essentially the same as eq. (33), and the result is as follows: i . This implies that x (2) are evaluated by solving eq. (34) conditioned by the solution of eq. (33). As explained later, this difference leads to a difficulty in evaluating ǫ (2) LOO , in contrast to ǫ

Simple formulas of leave-one-out error
For deriving the AMP, we conducted a perturbation onx \µ j . In the zero-temperature limit,x \µ j is identified with x (1)\µ j in eq. (9). By inserting eq. (30) in eq. (9), we obtain the following: This has a considerable advantage compared to eq. (9): Eq. (9) requires us to solve the optimization problem (8) M times for evaluating ǫ (1) LOO , but in the case of eq. (35), we need to solve the optimization of (5) only once.
The susceptibility matrix χ \µ ij is the origin of difficulty in the computation of {Γ i }. Fortunately, once the solution of eq. (5), x (1) , is obtained, this can be easily calculated. LASSO separates the variables into two types: Some variables become zero as the solution of eq. (5) and are called inactive; the other variables take finite values and are active. Suppose that the active and inactive variables are known and thatÃ is the submatrix corresponding to the active set. The active parts of x and χ \µ are also introduced asx andχ \µ , respectively. To evaluatẽ χ \µ , we need to determinex in the case without the µth observation, and denote the solution asx (1)\µ in accordance with eq. (8). Correspondingly, we introduce a notation y \µ expressing y without the µth component andÃ \µ representingÃ without the µth row. We assume that the active and inactive sets are stable: A small perturbation δh does not change these sets 1 . By using these notations and assumptions, we can now easily obtainx as follows: Considering the variation with respect to δh, we obtain the following: All the components of the inactive part of χ \µ are zero, and thus, the susceptibility matrix is fully calculated. Eq. (37) is seemingly inefficient in that the evaluation ofχ \µ for all µ requires M inverse operations of a matrix, which is computationally expensive. Fortunately, this computational difficulty can be overcome by using the Sherman-Morrison formula. Denoting u T µ as the µth row vector ofÃ, we obtain the following: Hence,χ \µ is calculated from Ã TÃ −1 by using a small number of simple products of matrices.
A similar discussion seems to be applicable to the evaluation of eq. (10). The active and inactive sets are common with x (1) by definition, and the values of active variables can be calculated as follows: This provides the same susceptibility matrix as eq. (37). Hence, the second type of LOOE can also be approximated as follows: Unfortunately, this approximation is not correct while eq. (35) is validated. The detailed reasoning is given in sec. 5.1.1.

In the large-size limit
In the case of the limit N → ∞, we can obtain an analytic formula forχ ij under certain conditions and thus, considerably simplify the computations of eqs. (35,40). We will derive this analytic formula in the next section; here, we will just refer to the result: where ρ(λ) = (1/N )||x (1) (λ)|| 0 denotes the sparsity of the inferred signal. This result is derived under the assumption that the observation matrix is a random matrix each component of which is i.i.d. from the zero-mean normal distribution N (0, N −1 ). In such a case, the non-diagonal part of the susceptibility becomes irrelevant as reported in [20,21]. The resultant formula of the LOOEs is now very simple Using this formula, in the next section, we examine the behavior of the LOOEs in the limit N → ∞ when λ is changed.
Before moving to the next section, we have two comments to make on eq. (42). One is about the relationship to the Akaike information criterion (AIC). It is known that the LOOE asymptotically agrees with AIC in general. This can be directly seen by expanding eq. (42) with respect to ρ/α in the limit ρ/α ≪ 1: The second term is expected to converge to 2N ρσ 2 ξ in the limit ρ/α ≪ 1, yielding the expression of AIC. The other comment is about the robustness of eq. (42). We have numerically examined some different ensembles of the observation matrix A and found that the relation (42) between the LOOE and the RSS seems to be fairly robust, while eq. (41) is not. We have observed that the non-diagonal components of χ become important in certain ensembles in which components of A are correlated. These non-diagonal components are complicated, but presumably as a result of nontrivial cancellations, the simple relation 1 + ij A µi A µj χ \µ ij 2 → (α/(α − ρ)) 2 seems to hold widely. This finding has an important consequence: Even for realistic situations where the observation matrix is far from the random matrix, eq. (42) can be used for accurate approximation of the LOOE. We will revisit this point later when treating the real data of the Type Ia supernovae in sec. 5. Apart from obtaining such a realistic benefit, we should examine whether the relation (42) actually holds for a wider ensemble of A than the simple random matrix ensemble in a more systematic manner, which will be an important future work.

Analytic formulas on the random observation matrix
In this section, in the case of the large-system limit N → ∞, we derive an analytic formula of quantities of interest such as the RSSs E 1,2 and MSEs M 1,2 , under the assumption that the observation matrix is a random matrix, each component of which is i.i.d. from N (0, 1/N ). Considering this limit, we keep the ratio α = M/N (< 1) finite along with the sparsity of the true signalρ = ||x|| 0 /N .
As noted in sec. 2, the noise ξ is i.i.d. from the normal distribution N (0, σ 2 ξ ). Moreover, the ensemble of the true signalx is assumed to be the Bernoulli-Gaussian distribution: Following statistical mechanical jargon, we call the average over A, ξ, andx configurational average, which is represented by square brackets with appropriate subscripts. For example, the average over ξ andx is written as [· · ·] ξ,x . In this section, we only state the outline of our analysis, give the resultant formulas of the free energies and the related quantities, and show some plots of the quantities of interest. The detailed derivations are deferred to Appendix A.

Outline of analysis
Following the usual prescriptions of statistical mechanics, we define the Hamiltonian H and the partition function Z. According to eqs. (5,6), we define two Hamiltonians as follows: The corresponding partition functions Z 1 and Z 2 are defined as follows: where and the Boltzmann distributions can be expressed as follows: Note that x (1) conditioning H 2 , Z 2 and p 2 is drawn from p 1 . We assume that the average over these Boltzmann distributions p 1 and p 2 is denoted by angular brackets with an appropriate subscript. We also introduce double angular brackets denoting the average over both p 1 and p 2 , · · · ≡ · · · 2 1 . The averaged free energies f 1 and f 2 can thus be defined as follows: These are the central objects of our analysis. The rates of RSSs, ǫ 1 and ǫ 2 , are derived in the zero-temperature limit. Other quantities of interest can also be derived from the free energies.
To take the configurational average and the average over x (1) in eq. (52), we use the replica method. For the evaluation of f 2 , we need to introduce two different replica numbers: n for 1/Z 1 in p 1 and ν for log Z 2 . Correspondingly, we introduce the following replica-generating functions: We derive the free energies from Φ 1 and Φ 2 by using the following identities: In the actual procedure, we first assume that n and ν in eqs. (53,54) are positive integers, which enables us to calculate the average over the quenched variables as well as the integrations over x (1) . Then, we assume the replica symmetry (RS) in the order parameters explained next, which makes it possible to analytically continue the resultant formulas of Φ 2 with respect to n and ν. Finally, using the analytic continuation, we calculate the limits n → 0 and ν → 0, yielding the free energies.

Order parameters and their significance
Let us start by summarizing the order parameters characterizing the free energies as follows: Their meaning is simple: m 1,2 denote the overlaps with the true signal; Q 1,2 represent the lengths of the reconstructed signals; q 1,2 quantify the lengths of the averaged reconstructed signals; and q c and Q c indicate the overlaps between the two reconstructed signals, the former reflects the fluctuation, and the latter does not. The MSEs (12) are connected to the order parameters as follows: For simplicity of notation, we also introduce the following symbols: In the calculation of the zero-temperature limit β → ∞, the thermal fluctuation shrinks and the order parameters Q and q have a common value. The following order parameters become finite in the limit β → ∞: From the definition of the order parameters (57), we can understand that χ 1 and χ 2 are nothing but the average of the diagonal part of the susceptibility matrix and that an equality χ 1 = χ 2 holds as discussed above.

f 1 -related quantities
We are only interested in the zero-temperature limit β → ∞, and write the explicit formula of f 1 in this limit as follows: where Ω 1 = χ 1 , Q 1 , m 1 ,χ 1 ,Q 1 ,m 1 and Extr x represents taking an extremization condition with respect to x. The variableχ 1 is a conjugate order parameter of χ 1 , as are the other hatted variables except forρ. Further, Next, variational conditions with respect to Ω 1 yield the following equations of state (EOSs) of Ω 1 :χ Thus, the sparsity of the reconstructed signal ρ, the true positive ratio T P , and the false positive ratio F P can be expressed as follows: From the EOSs, we get the following simple relations: The free energy f 1 includes the contribution of the regularization term λ||x (1) || 1 . This contribution can be represented by using the relation (65) as follows: Subtracting this from f 1 and using eq. (65) again, we obtain the following simple formula of ǫ 1 :

f 2 -related quantities
Similarly, the formula of f 2 is as follows: where Ω 2 = χ 2 , Q 2 , m 2 ,χ 2 ,Q 2 ,m 2 , χ c , Q c ,χ c ,Q c and The EOSs with respect to Ω 2 are as follows: For the order parameters of Ω 1 appearing in eq. (70), we insert the solutions of the extremization condition considered in eq. (65). From the EOSs, we obtain the following simple relations: As expected, the two susceptibilities χ 1 and χ 2 coincide.

LOOEs, MSEs, and ROC curve
Since the non-diagonal part of χ \µ ij can be neglected and the diagonal part is . From eqs. (42,65) and (69), the first type of LOOE becomes Hence, the LOOE directly connects to the corresponding MSE and calculating its minimum is meaningful. This result is natural and can be derived from a simple consideration. Given x (1)\µ , let us consider the difference between y µ and the counter part of the reconstructed data y (1) In the present situation, none of the rows of A and none of the components of ξ are correlated, and hence, x (1)\µ is also uncorrelated with {A µi } i and ξ µ . This implies that the average of (y µ − y (1) µ ) 2 over ξ µ and {A µi } i is as follows: The last relation follows from the smallness of the difference between x (1)\µ and x (1) . This relation immediately leads to eq. (74). This correspondence between the MSE and CV error can be proved in a more rigorous way [29,30]. Clearly, this discussion is applied to ǫ LOO since x (2)\µ is again uncorrelated with {A µi } i and ξ µ , and ǫ However, our calculation based on eq. (42), combined with the replica result (72,73c), yields the following: Only the last term is desired, but the other two terms appear and persist. Eq. (42) thus provides an incorrect approximation of ǫ LOO by eq. (77) (green). The deviation between the blue and the green curves is not negligible and is qualitatively different. The incorrect one converges to zero in the limit ρ → α, but the LOO is not negligible. The minimum value of the correct ǫ (2) LOO is located at a smaller ρ than that of ǫ LOO . This implies that the approximation (77) is completely useless. Its reasoning will be given later in sec. 5.1.1.
We observe that ǫ  LOO , we plot the ROC curves as a plot of T P against F P in Fig. 2. The N → ∞ solution is denoted by blue points, and the scatter plots of the finite-size simulation with N = 3600 over 10 samples are indicated with magenta circles. This simulation is conducted using the built-in "lasso" function of MATLAB R . In the above figure, we mark the points obtained using the minimum of ǫ (1) LOO as red points, those obtained using Youden's index as green points, and those obtained using the minimum of ǫ (2) LOO as blue points. The best inference is achieved at the upper-most left point, (F P, T P ) = (0, 1), and better ROC curves are increasingly skewed to the upper-left direction. The black straight lines denote the F P = T P line and are given as a reference for observing the skewness. Fig.  2 demonstrates that the inference based on LASSO has a good performance. Yet, the points obtained using the minimum values of ǫ (1) LOO , red points, are located a little away from the "optimal" values obtained using Youden's index. Meanwhile, the minimums of ǫ (2) LOO are very close to Youden's optimal values, and in this sense, ǫ LOO is better than ǫ (1) LOO for determining λ. However, note that for obtaining the minimum of ǫ     of the parameters and confirmed that this always holds if the noise is sufficiently weak. An empirical prescription to overcome this is the so-called one-standard error rule, which chooses a larger value of λ (corresponding to a smaller F P ) than that by the minimum of ǫ (1) LOO , by using the error bar of the minimum data point. Hence, it is important to approximate not only the minimum value but also its error bar. Fortunately, our formulas eqs. (35,42) can also provide the error bar of ǫ (1) LOO , which will be demonstrated by a real-data application discussed in sec. 5.2.

Comparison with data on finite size systems
In this section, numerical simulations are presented to examine the validity of our analysis and to determine the finite-size effect. We also apply the proposed method to SuperNova DataBase provided by the Berkeley Supernova Ia program [31,32] and find that this method reproduces the obtained result [11] considerably faster than the conventional 10-fold CV.

Examination using artificial data
In this subsection, we numerically generate the observation matrix A, the true sparse signalx, and the noise ξ, matching the assumptions made in our analysis. In all the simulations here, we set α = 0.8, σ 2 x = 1, and σ 2 ξ = 0.001. Under this condition, once a set of A,x, and ξ is given, we solve eqs. (5,8) by using a versatile algorithm of convex optimization, yielding the RSSs and the LOOEs. We generate N s = 1000 different samples of the set of (A,x, ξ) and adopt the mean value in the samples as our estimate. Error bars are evaluated as standard deviations among the samples divided by √ N s − 1. In determining active sets after convex optimization, we need to introduce a certain threshold value for each signal component. Here, the threshold value is empirically chosen as 10 −6 .
The RSSs for N = 16, 32, · · · , 512 are given in Fig. 3 and compared with the analytical curve of N → ∞. We find that the finite-size effect is moderate for both ǫ 1 and ǫ 2 , but the behavior LOO . There are two approximation methods: One is based on eq. (35) in conjunction with eqs. (37,38), which is called Approximation 1 hereafter; the other follows eq. (42), and we call this method Approximation 2. They are identical in the limit N → ∞, but for finite N , there exists a deviation. Fig. 4 provides ǫ (1) LOO for N = 16, 32, · · · , 512 evaluated using both Approximations 1 and 2. The finite-size effect is strong and unstable for small N , particularly for Approximation 1. This is reasonable because Approximation 1 requires an inversion of the matrixÃ TÃ . If the number of active variables is close to the number of observations, which is the case for a small λ,Ã TÃ has a mode whose eigenvalue is very close to zero. This leads to the divergence of (Ã TÃ ) −1 , explaining the drastic change in the LOOEs at small values of λ for small sizes. This effect weakens with an increase in the system size, and for N ≥ 64, the numerical data agrees well with the analytical curve.
Thirdly, we conduct the LOO CV directly without using any approximation and compare the result with our analytical calculations. The LOO CV is computationally expensive, and we execute it for smaller sizes up to N = 256. The result is given in Fig. 5. We can see that our analytical curves (black solid curves) of ǫ  Overall, our analytical calculations agree well with the numerical simulations for moderate system sizes, and the approximation formulas (35,42) provide reliable estimates of ǫ      Here, we probe the cause of the failure in approximating ǫ (2) LOO based on eqs. (40,42), although the same approximation is applicable for ǫ (1) LOO . While deriving eqs. (35,40) and (42), we assumed that the set of active variables is stable against small perturbations and that the addition or deletion of a row of the observation matrix A just leads to such small perturbations. This assumption is examined here.
In the absence of the µth row of A, the corresponding fixed-point equations of the AMP become Let us call this system the µ-cavity system. The difference between eq. (78) and eq. (33) is only the term A µi a µ = O √ N −1 in eq. (78b). Hence, it is expected that the difference in variables between the full and the µ-cavity systems is small and can be scaled as O √ N −1 . Even if this assumption is correct, it is not trivial to compute the variables of the µ-cavity system. However, the discussion in sec. 3.2 implies that we estimate this difference as follows: Further, we assume that We have examined these two relations by numerically solving both eq. (33) and eq. (78) independently, and found that the first one is satisfied in a moderate region of λ at a certain accuracy, while the second one is incorrect in the entire range of interest of λ. In fact, it can be proved that the change of active sets in the LOO procedure is inevitable in any sparse algorithm [33]. This poses another question: Why is ǫ (1) LOO well approximated by eqs. (35,42)? The violation of eq. (80) implies that the active and inactive sets are different for the full and the µ-cavity systems. Some variables belong to the same sets on both the systems and are called "stable". The others change the belonging sets and are called "unstable". The behavior of the unstable variables is a crucial issue. The effective field h (1) of any unstable variable must satisfy the relation |h This implies that the coefficient x (1) of any unstable variable is zero or very small as Further, the number of unstable variables is estimated as N uns ≈ |∆h (1)\µ | × N P (h (1) ) = O √ N , where P (h (1) ) denotes the distribution of the effective field of the full system and is assumed to be O(1) around |h (1) | ≈ λ. Summarizing these scalings of N uns and the coefficient x (1) , we can estimate their contribution as follows: where UNS denotes the set of unstable variables. Note that (1)\µ i in the above equation. Hence, the contribution from the unstable variables vanishes in both the full and the µ-cavity systems in the calculation of the cavity residuals a (1) and a (1)\µ . As a result, our perturbative discussion assuming eq. (80) is validated to calculate macroscopic quantities such as ǫ LOO is not correctly evaluated by eqs. (40,42). Now, the coefficients of unstable variables, {x (2) i } i∈UNS , are not proportional to |h (1) | − λ and are of O(1), as seen in eq. (34). Thus, its contribution is and is not negligible, implying that the solution of the corresponding fixed-point equations is influenced by the unstable variables. Hence, our perturbative discussion does not work even in the calculation of macroscopic quantities.

Application to Type Ia Supernova data
Here, we apply the proposed method for evaluating ǫ LOO to the data from SuperNova DataBase provided by the Berkeley Supernova Ia program [31,32]. Recently, LASSO techniques have been used on a part of the data which is screened by a certain criterion, and a set of important variables known to be significant in explaining the Type Ia supernova data empirically has been reproduced [11]. In this study, the 10-fold CV, which is an alternative to the LOO CV when the number of variables is large and performing the LOO CV is computationally difficult, is used for determining the value of λ. We calculate ǫ (1) LOO by using the proposed method on these data and compare the result with that of the 10-fold CV. The system size parameters of the screened data are M = 78 and N = 276.
The left panel of Fig. 6 shows the plots of ǫ (1) LOO in Approximations 1 and 2, and the CV error by the 10-fold CV against log λ without the error bars. Clearly, the curves are very similar, and the minimum values of all the curves coincide. We also observe the quantitative similarity of not only the mean values but also the error bars in the right panel of the same figure. The error bars of the 10-fold CV are obtained using a Monte-Carlo resampling, and those of ǫ (1) LOO evaluated by the proposed method are given by the standard deviation among the M terms of eq. (35) or eq. (42), which is justified by a simple resampling argument using the multinomial distribution. Clearly, the largeness of the error bars is quantitatively comparable. Hence, the proposed method reproduces the 10-fold CV result at a very satisfactory level. By applying the one-standard error rule for all the three methods, we obtain df = 6 (df: Number of active variables), which agrees with an empirically validated model, as explained in [11]. The benefit of the proposed method is apparent in the computational time. The required time for computations in an experiment is 31.6 s for the 10-fold CV, 3.20 s for Approximation 1, and 2.85 s for Approximation 2, the last two of which include the computational time of one run of LASSO. Therefore, the advantage of the proposed method is clear, and the reducing factor is about 10. If we compare with the LOO CV instead of the 10-fold CV, the factor will be considerably larger. Hence, our approximation is applicable to realistic problems and is very efficient for data with a large dimensionality. Therefore, readers are strongly encouraged to use the presented method.
The observation matrix in the Type Ia supernova data is significantly different from a simple random matrix. Hence, the success of Approximation 2 in the case of these data conversely suggests that the relation (1 + i,j A µi A µj χ \µ ij ) 2 ≈ (α/(α − ρ)) 2 , which is used for deriving eq. (42), holds rather universally, as discussed at the end of sec. 3.2.1. As mentioned above, Approximation 2 has a relatively low computational time, and its result is stable compared to that of Approximation 1. These facts positively motivate us to find further theoretical evidence for the universality of the relation (1 + i,j A µi A µj χ \µ ij ) 2 ≈ (α/(α − ρ)) 2 . Now, finally, we report the behavior of ǫ (2) LOO on the Type Ia supernova data. The LOO CV according to its definition is conducted, and the result is shown in Fig. 7. The minimum value of ǫ (2) LOO is located at log λ ≈ −1, leading to df = 1. This serious reduction of df from the case of ǫ (1) LOO might have a certain meaning. In fact, in [11], an appropriate preprocessing of the same data shows the same serious reduction of df, which can be attributed to some inconvenient properties of the data such as collinearity and bad statistics. The possibility that ǫ (2) LOO can diminish the influence of these bad properties is suggested.
Unfortunately, as seen in the rugged uncontrolled behavior of ǫ LOO , the quality of the presented data is not sufficiently good to judge whether this possibility is plausible or not, although it is natural that df is reduced when using ǫ LOO requires considerably better statistics to exhibit a meaningful behavior than ǫ (1) LOO . The reason for this is as follows: When changing λ, each term in ǫ (2) LOO consists of several different piecewise constants, and they are not necessarily monotonic. This is because the inferred signal x (2)\µ (λ) shows a sudden change only at certain several discrete points of λ, where the set of active variables changes, and remains unchanged in-between the neighboring pairs of these discrete points. This is in contrast to x (1)\µ (λ), which changes continuously [6,34]. These discrete points change among different terms in ǫ (2) LOO . Hence, ǫ LOO consists of the sum of the piecewise constants with different jumping points and heights, leading to large error bars and uncontrolled behaviors of ǫ (2) LOO . Once this uncontrolled behavior of ǫ (2) LOO is tamed, we expect an optimal value of λ to be chosen by using ǫ (2) LOO without employing the ad hoc one-standard error rule [34,35,36]. New ideas are desired for taming. An idea based on bootstrapping can be a good candidate: Increasing the statistics of the present data can diminish the abovementioned discrete behavior. Problems of rather large sizes may not pose this peculiarity in the first place, since the statistics of the LOOE automatically improve with an increase in M . In less large but moderate-size problems, the k-fold CV for ǫ (2) LOO with a moderate value of k is worth trying. However, further consideration of ǫ (2) LOO is desired.

Conclusion
In this study, we examined the LOO CV as the determinator of the coefficient λ of the penalty term in LASSO. We investigated two types of CV errors by using the LOO CV, namely the LOOEs ǫ (1) LOO and ǫ (2) LOO corresponding to two different estimators, and for both the errors, we derived simple formulas that significantly reduce the computational cost in their evaluation. This result was derived by using the BP or cavity method and by a perturbative argument assuming that the number of observations was sufficiently large.
On the basis of this finding, we analytically evaluated the LOOEs by using the replica method when the observation matrix is a simple random matrix. This provided quantitative information about the LOOEs. Further, the locations of the minimums of the two LOOEs were found to be different, and thus, the chosen "optimal" values of λ are different in general. Both the optimal values were examined by using ROC curves, and the one obtained using the second LOOE ǫ (2) LOO was found to be preferable. However, a replica analysis clarified that our simple formulas are not useful for accurately approximating ǫ (2) LOO . We need further consideration to design an efficient algorithm for computing ǫ (2) LOO . The above analytical calculations were compared with numerical simulations on finite-size systems. For small system sizes, there exists a deviation, but for moderate and large system sizes, the agreement between the numerical data and the analytical result is fairly good, and thus, our formulas are validated. We also applied these formulas to real Type Ia supernova data, to find that the proposed method reproduces the known result at a very satisfactory level. The benefit of our formulas is their low computational cost, and the actual reducing factor in the computational time was about 10. Further, the computation of ǫ (2) LOO according to its definition was conducted on this data set, but it turned out to be difficult to obtain any meaningful result. Larger amounts of data are desired to treat ǫ (2) LOO . The proposed method requires the computation of a pre-factor ij A µi A µj χ \µ ij and it is in fact a non-trivial task. We had recourse to an analytical formula for this quantity, which can be directly validated in the case where the observation matrix is a simple random matrix but cannot be justified in general. Some of our numerical (unreported) observations support that the analytical formula holds for a wider ensemble of the observation matrix, but deeper theoretical evidence is strongly desired.
A Assessing the generating function Φ 2 For positive integers n and ν, the generating function Φ 2 (n, ν) can be expressed as follows: where Hereafter, we assume that the unspecified domain of integration denotes the integration over [−∞ : ∞] or [−i∞ : i∞]. We also assume that indices a, b run over 1, · · · , n, and α, β over 1, · · · , ν. The quantities d µa andd να consist of extensive sums of random variables {A µi } i , implying that the central limit theorem works and that d andd can be expressed by certain Gaussian variables. The mean is clearly zero, and the covariance becomes where To proceed with the calculations, we use a trick to perform a trace over r and x: rewriting the order parameters as integration variables and introducing delta functions that require order parameters to take the values defined in eqs. (88,90). This yields the following: where Ω = {Q 1 , q 1 , m 1 , Q 2 , q 2 , m 2 , Q c , q c , } and Tr We rewrite these delta functions by using the Fourier representations. In doing so, constant factors can be applied to the Fourier integration variables, and we choose convenient factors for later calculations. For example, the delta functions of Q 1 and q 1 are expressed as follows: where C 1 and c 1 denote the normalization factors but will be discarded hereafter because they do not contribute in the limit N → ∞. These operations provide the following: whereΩ = Q 1 ,q 1 ,m 1 ,Q 2 ,q 2 ,m 2 ,Q c ,q c , and where we set∆ c =Q c −q c . The average overx appears as a result of the law of large numbers. As noted in the main text, we consider the Bernoulli-Gaussian distribution with respect tox. Denoting its Gaussian part as P G (x) = exp(−x Cross-quadratic terms in f can be transformed into linear terms. First, they are transformed as follows: Note that |·| k 0 = |·| 0 (k > 0). Let us set X ≡ a r a and Y ≡ α |r 1 | 0 x α . The minimum number of auxiliary variables to break the quadratic terms is two, but here, we introduce three auxiliary variables to make the resultant formula interpretable. Accordingly, we have the following: DvDuDw e (v,u,w)·(aX,bY,cX+dY ) t = e 1 2 ((a 2 +c 2 )X 2 +(b 2 +d 2 )Y 2 +2cdXY ) = e 1 2 (q1( a ra) 2 +q 1 ( α |r 1 | 0 xα) 2 +2qc( n a=1 ra)( α |r 1 | 0 xα)) .

A.2 Derivation of f 2
A small calculation from φ 2 yields the following: We assume the following scalings: After lengthy but straightforward calculations, we obtain the following: (1 + χ 1 ) 2 M 1 − 2 and lim β→∞