Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the SCAD penalty. The contributions of this study are three-fold: We first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with $\ell_1$ in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the theoretical results consistency and the approximate formula efficiency. Another numerical experiment on a real-world dataset is conducted; its results are consistent with those of earlier studies using the $\ell_0$ formulation.

E-mail: 1 obuchi@c.titech.ac.jp, 2,3 ayaka@ism.ac.jp Abstract. We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: We first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with ℓ 1 in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing. The characteristic behaviour of the annealed solution in the multiple solution region is addressed. Another numerical experiment on a real-world dataset of Type Ia supernovae is conducted; its results are consistent with those of earlier studies using the ℓ 0 formulation. A MATLAB package of numerical codes implementing the estimation of the solution path using the annealing with respect to λ in conjunction with the approximate CV formula and the instability detection routine is distributed in [1].

Introduction
Variable selection problems ubiquitously appear in statistics and machine learning tasks. Although traditional statistical approaches to variable selection work well in principle [2], difficulties in computational efficiency and stability emerge owing to the largeness and high dimensionality of the datasets. To overcome this, the possibility of sparse estimation has been pursued for decades. Naive methods for sparse estimation yet require solving discrete optimisation problems, involving a serious computational difficulty, even in the simplest case of linear models [3]. Hence, certain relaxations or approximations are required for handling such large high-dimensional datasets.
A breakthrough was made with the least absolute shrinkage and selection operator (LASSO) [4]. Its basic idea is to relax the sparsity constraint by using ℓ 1 regularisation. The success of LASSO motivated the usage of ℓ 1 regularisation in many different contexts and models [5,6,7], leading to an ongoing innovation in signal and information processing [8,9,10].
Although LASSO has many attractive properties, the shrinkage introduced by the ℓ 1 regularisation results in a significant bias in regression coefficients. To solve this, some nonconvex penalties, such as the smoothly clipped absolute deviation (SCAD) penalty [11] and the minimax concave penalty (MCP) [12], have recently been proposed. Although the estimators under these regularisations have desirable properties such as unbiasedness and continuity [11], there exist some concerns about the stability and interpretability of the estimators because of the potential local minimums owing to the lack of convexity. Hence, investigations of typical performance of those estimators are desired.
Under this situation, one of the present authors recently analyzed the SCAD estimator performance in [13], using the replica method and the message passing technique [14,15,16]. It was shown that there exist two regions in the space of regularization parameters, and in one of them the estimator is uniquely and stably obtained. It was also shown that the estimator outperforms LASSO in the fit quality, and that the emergence of the two regions can be viewed as a phase transition involving replica symmetry breaking (RSB). The latter finding yields a nontrivial insight to the behaviour of local search algorithms, and it was demonstrated that the convergence limit of the coordinate descent (CD) algorithm is closely related to the RSB transition and that a sufficient condition of the convergence, derived in [17], is not tight.
The above-mentioned analysis was limited to the data compression context, in which only the fit quality to a given dataset was considered important. However, with regard to certain applications of sparse estimation, such as compressed sensing [18,19], the reconstruction performance of the true signal embedded in the data generation process is more important. The current study addresses this problem and conducts a quantitative analysis for the case where noise exists, while the noiseless limit is investigated in a separate study [20]. We provide phase diagrams with respect to regularization parameters derived using the replica method and discuss their implications to the reconstruction performance and the behaviour of local search algorithms. Moreover, we develop an approximate formula for efficiently computing the cross-validation (CV) error, which can be identified with the reconstruction error in our setting. The key results are summarised as follows: (i) In the replica symmetric (RS) phase, a unique solution is stably obtained also in the signal reconstruction context.
(ii) The global minimum of the CV error is (presumably always) obtained in the RS phase.
(iii) Our approximate formula efficiently estimates the CV error, without actually conducting CV.
These imply that we need not to care about the RSB phase as long as our purpose is to obtain the model best reconstructing the true signal, and in the RS phase we can benefit from the proposed approximate CV formula enabling an efficient estimation of the reconstruction error. Below, we show the theoretical results supporting these messages. The remaining of the paper is organised as follows: In the next section, our problem setting and an overview of the SCAD penalty are given; in sec. 3, the replica analysis result is shown without the derivation because the essential part is already given in [13,20], and phase diagrams and plots of relevant quantities are shown; in sec. 4, the approximate formula of the CV error is derived; in sec. 5, numerical experiments are carried out on both simulated and real-world datasets to check the accuracy of the replica result and the approximate formula. The last section concludes the paper.

Problem settings
Suppose a data vector y ∈ R M is generated by the following linear process with a design matrix A ∈ R M ×N and a signal vector x 0 ∈ R N : where ∆ is a noise vector, the component of which is assumed to be an independent and identically distributed (i.i.d.) variable from the normal distribution with zero mean and variance σ 2 ∆ , N (0, σ 2 ∆ ). We denote our dataset as D M = {y, A}. In the context of compressed sensing, the design matrix A represents the measurement process, and we try to infer x 0 given A and y. The inference is herein formulated as a regularised linear regression, and the concrete form of our estimator is given by: where J(x; η) = N i=1 J(x i ; η) is the regularisation inducing the estimator sparsity, and η is a set of regularisation parameters with a concrete form shown below. To quantify the fit quality of the estimatorx to the data y, we introduce: and call it the output mean squared error (MSE). We also introduce a MSE between the estimator and the true signal as: which is termed input MSE, and these characterise the goodness of fit of our estimator x.
The purpose of this study is to compute the typical behaviour of ǫ x and ǫ y to obtain insights into the estimator behaviour; meanwhile, some other relevant quantities are also evaluated. The analytical techniques for achieving this purpose are explained in sec. 3, with a more detailed description on x 0 and A.

SCAD regularisation
As a representative piecewise continuous nonconvex penalty, we investigate SCAD regularisation in this study. The parameter set consists of η = {λ, a} (a > 1), and the functional form is: An illustration of this form is given as the left panel of Fig. 1. In the limit a → ∞, Behaviour of the SCAD estimator (7) at a = 3, λ = 1, σ 2 w = 1; the diagonal dashed line represents the OLS estimator. (Right) Behaviour of the LASSO estimator at λ = 1 for comparison with the SCAD; a shrinkage bias is clearly seen for a large w.
the SCAD regularisation tends to be the ℓ 1 regularisation J(θ; {λ, a → ∞}) → λ|θ|, and correspondingly, the SCAD estimator converges to the LASSO one, allowing the comparison in a continuous manner. For later convenience, we term a the switching parameter, and λ the amplitude parameter.
To obtain an intuitive view for the SCAD estimator behaviour, we compute the one-dimensional case: The solution is given by: where The middle panel of Fig. 1 presents an illustration of the estimator at a = 3, λ = 1, σ 2 w = 1, which behaves as the LASSO estimator when λ(1 + σ −2 w ) ≥ |w| > λ, and as the ordinary least square (OLS) estimator when |w| > aλσ −2 w . In the region aλσ −2 w ≥ |w| > λ(1 + σ −2 w ), the estimator linearly transits between LASSO and OLS estimators.
The one-dimensional case estimator plays a key role in our analysis, because our original problem with high dimensionality is, eventually, reduced to an effective onedimensional problem in the limit N → ∞, termed decoupling principle in [21].

Macroscopic analysis
In this section, we provide the order parameters and their determining equations of state (EOS). Associated phase diagrams are shown, and their implications on the performance and the computational stability of the SCAD estimator are also discussed.
For proceeding with the analysis, the ensemble of A and x 0 is required to be fixed. We assume that A is a random matrix whose component is i.i.d. from N (0, M −1 ). The true signal x 0 is also assumed to be a random number drawn from the independent Bernoulli-Gaussian distribution: We note that the i.i.d. assumption on A is crucial for completing the computation, while the choice of the distribution of x 0 does not matter for the analytical tractability. We admit that this i.i.d. assumption on A is not necessarily realistic, but it provides a sufficiently nontrivial setup for our purpose. Although it is possible to extend the present analysis to certain other ensembles [22,23,24,25,26,27,28,29], we leave this as a future study. In the following discussion, we consider the so-called thermodynamic limit N → ∞, while keeping α ≡ M/N = O(1).

Outline of analysis
In order to avoid duplication with [13,20], an outline of the analysis is presented here, instead of the EOS derivation.
Our analysis starts from defining Hamiltonian H, partition function Z, and free energy density f as follows: As seen from (2), the minimiser or the ground state of H corresponds to our estimator and, hence, we are interested in the β → ∞ limit of the free energy density. The input and output MSEs can be computed from the free energy in this limit, by following a standard prescription. The free energy density becomes the primary object to be computed, and enjoys the self-averaging property, and the typical value thus converges to the averaged one , where E y,A [· · ·] denotes the average over y and A. Unfortunately, the average density E y,A [f (β|D M )] is not analytically tractable. To overcome this, we employ the following identity: However, the computation for general n ∈ R is still intractable; thus, we additionally assume that n is a positive integer, because E y,A [Z n (β|D M )] is analytically computable for n ∈ N. Then, using an analytically continuable expression of E y,A [Z n (β|D M )] from N to R, we evaluate lim n→0 E y,A [Z n (β|D M )] at the final step. These procedures are termed replica method.
The final expression of f (β) is given as an extremisation problem with respect to a number of parameters, called order parameters. The extremisation condition appears because of the limit N → ∞, and yields EOS determining the values of the order parameters. The explicit formulas are given below. It should be noted that the following analysis is conducted only under the RS assumption, although RSB occurs in some parameter regions. This is because the RS analysis is sufficient for the present purpose of obtaining insights on the stability of the SCAD estimator. Beyond this purpose, the RSB analysis will provide further quantitative information about the estimator when many local minimums exist, which will be an interesting future direction.

Order parameters, equations of state, and stability condition
Here, we summarise the order parameters and EOS. In the RS level, our system is characterised by the following three-order parameters: where the angular brackets, · · · , denote the average over the Boltzmann distribution P (x|β, D M ) = e −βH(x|D M ) /Z(β|D M ). m is the overlap with the true signal x 0 and is relevant to the reconstruction performance. Q and q both describe the powers (per element) of the estimator, but the latter takes into account the 'thermal' fluctuation that results from the introduction of β. These two quantities fall within the limit β → ∞, but their infinitesimal difference yields an important contribution: This is O(1), even in the limit β → ∞. Besides, we introduce the conjugate parameters of Q, χ, m asQ,χ,m, respectively, and denote their sets as Ω = {Q, χ, m} and Ω = {Q,χ,m}. The RS free-energy density in the limit β → ∞ takes the following extremisation problem, with respect to Ω andΩ: where: and · · · represents the average over σ, whose distribution is: The minimiser of (20) is the solution of the one-dimensional problem (7), with σ 2 w →Q −1 and w → h/Q, thus we can denote it as: The extremisation condition in (19) yields EOS as: where The SCAD regularisation divides the domain of definition into some analytic components, and {θ i } 3 i=1 are the corresponding boundary values for z for the integration Dz(· · ·). The parameterρ is the density of non-zero components in the estimate.
Using the solution of EOS, the input and output MSEs can be expressed as: , where S c denotes the complement set of S. These are expressed by using the solution of EOS as: where |x| 0 expresses ℓ 0 operator giving 0 if x = 0 and 1 otherwise. Following the standard analysis [13], we can derive the stability condition of the RS solution, called de Almeida-Thouless (AT) condition [30]. The derivation of our specific case is already given in [13] and we just quote the resultant expression: Apart from the AT condition, we also notice that the RS solution does not exist when the switching nonconvexity parameter a is small. This is because (20) tends to have no solution in the small a limit, leading to the following existence condition: These provide sufficient information for the following analyses.

Phase diagram
In this subsection, we show the phase diagrams in the λ-a plane for a wide range of parameters. We introduce three boundaries: The first one, derived from (33), is the AT line a AT (λ) below which the RS solution is unstable; the second, derived from (34), is the existence limit of the RS solution a RS (λ), below which the RS solution does not exist; and the third, a IMSE (λ) represents the minimum point of the input MSE ǫ x , when sweeping λ given a. For clarity, the variance of the non-zero components of x 0 is fixed as First, we compare the phase diagrams for different noise strengths σ 2 ∆ at α = 0.5 and ρ 0 = 0.2 in Fig. 2. We plot a AT , a RS , and a IMSE by blue, red, and green lines, respectively. The green diamond represents the location of the global minimum of ǫ x under the RS assumption. A useful finding concerning this is that the location is above the AT line, which is always the case as far as we have examined, and some additional evidences are later given in Figs. 3 and 4. In the right panel of Fig. 2, the green diamond is not shown, because the input MSE continuously decreases as a grows, implying that the global minimum of ǫ x is obtained at the LASSO limit a → ∞. These imply that the best reconstruction performance of the true signal is always obtained in the RS phase, Phase diagrams in the λ-a plane for different noise strengths (σ 2 ∆ = 10 −4 (left), 10 −2 (middle), 1 (right)) at α = 0.5 and ρ 0 = 0.2. The blue, green, and red lines denote a AT , a IMSE , and a RS , respectively. The green diamond represents the location of the global minimum of the input MSE ǫ x . For the right panel of the largest noise case σ 2 ∆ = 1, there seems to be no global minimum of ǫ x for finite a (located in the LASSO limit a → ∞).
which is one of main claims of this study. Admittedly, there is a possibility that the true global minimum exists in the RSB phase, and the green diamond just represents a local minimum. Our present analysis does not exclude this possibility. To clarify this point, further quantitative analysis in the RSB framework is required, but this is beyond the scope of this study.
Another interesting observation in Fig. 2 is the re-entrant phase transition concerning λ in relatively small a regions for the weak noise cases (left and middle panels). For example at a = 2.8 in the left panel, when decreasing λ from a large enough value, we first go across the rightmost branch of a AT around λ ≈ 1 and enter into the RSB phase from the RS phase; further decreasing λ we meet the middle branch of a AT around λ ≈ 0.1 and thus re-enter into the RS phase; still decreasing λ we hit the leftmost branch of a AT around λ ≈ 0.01 and we are eventually in the RSB phase. Although the physical reason of the emergence of the re-entrance is not clear, it seems to only exist in the weak noise region. We also note that the AT line, a AT , is always located above a RS . The solution vanishment in the low λ region is thus an artefact of the RS assumption, and the corresponding parameter regions should be described by the RSB solution.
Next, we check the ρ 0 dependence of the phase structures. Phase diagrams at α = 0.5 and a moderate noise level σ 2 ∆ = 0.1 are shown in Fig. 3. Although the basic structure does not change from Fig. 2, the re-entrant transitions in the weak noise cases disappear. As ρ 0 increases, the minimum location of ǫ x increases along the a-axis, and for the large ρ 0 (right panel) the green diamond tends to disappear at finite values of a, implying the LASSO limit yields the minimum of ǫ x as the strong noise case.
The last phase diagrams are given for checking the α dependence. Phase diagrams for α = 0.3, 0.8, 1.5 at ρ 0 = 0.2 and σ 2 ∆ = 0.1 are shown in Fig. 4. As seen in the left panel, if the value of α is close to that of ρ 0 , the larger a tends to give smaller values of the input MSE, as in the right panel of Fig. 3. In contrast to the other diagrams   of the underdetermined case (α < 1), the right panel of the α = 1.5 case shows a particular behaviour, as both the AT line and the RS existence limit tend to converge to certain finite values of a in the λ → 0 limit. Hence, the whole λ region becomes RS at sufficiently large but finite a values.
In all the phase diagrams shown above, the minimum value of a is fixed to be 2. This is because the RS solution cannot describe the region a < 2. There is a simple reason for this. According to the argument of the approximate message passing technique [13,20], the effective one-dimensional problem (20) corresponds to the following marginal distribution: where the minimiser of (20) corresponds to the location of x i , at which the measure concentrates in the limit β → ∞. The factor M µ=1 corresponds toQ, while χ µ is a non-negative quantity related to χ in the RS solution. This means that, under the assumption of A µi ∼ N (0, 1/M),Q is bounded as: Combining this with (34), we find that the condition a ≥ 2 is necessary for the existence of the RS solution. The merging behaviour of the three lines to the a = 2 line, as λ grows in the phase diagrams, well matches to this condition. Although a = 2 is a known critical value [31,32], the above argument provides another perspective from a different viewpoint. We also note that this non-existence of the RS solution does not mean the non-existence of the SCAD estimators. Actually, numerical experiments easily show that the estimators take non-trivial values in the region a < 2, and they tend to show strong multiplicity and dependency on the initial condition. To analyse the behaviour of those estimators, we need to consider the RSB solution, but it is beyond the present purpose as already declared.

Receiver operating characteristic curve
To characterise the reconstruction performance of the true signal's support, we employ the so-called receiver operating characteristic (ROC) curve. The ROC curve is a plot of T P (31) against F P (32). The best ROC curve goes through the point (T P, F P ) = (1, 0). Accordingly, to quantify 'optimality' of the points on a ROC curve, we use the following quantity: Thus, the smallest value of R defines the 'optimal' point of the ROC curve. This easyto-use quantity is commonly applied as a criterion, followed here.  not monotonic and tend to change in the small λ region sharply, but the locations of the minimums of R, depicted by filled magenta circles, tend to be in the monotonic region. To compare the values of the R minimums, we plot them against a in the right panel. The global minimum is located at a ≈ 10, which matches to the minimum location of ǫ x , depicted by the green diamond in the middle panel of Fig. 3. This suggests a possibility that minimising ǫ x also approximately minimises the error in the variable selection.
To scrutinise the possibility, we show ROC curves when adaptively changing the nonconvexity parameters along the a IMSE (λ) line in the λ-a phase diagrams: The upper panels of Fig. 6 are the ROC curves for (α, panels. These figures show that the minimums of ǫ x are actually close to that of R. As far as we have searched, similar tendency holds in other parameters. These fully support the above-mentioned possibility. Such a nice property is absent in LASSO [33], and the minimum point of ǫ x in LASSO tends to give a solution with rather large FP ‡. Hence in the reconstruction performance of the true model, the SCAD estimator is superior to the LASSO one.
Readers may doubt the effectiveness of this statement, because the input MSE ǫ x cannot be computed for realistic settings with unknown true signals. As explained ‡ In [33], essentially the same analysis is done for LASSO, but a wrong terminology is used. The quantity R is termed Youden's index in that study, but it is contradictory to the conventional terminology. Youden's index is another similar but different criterion for choosing an 'optimal' point on ROC curve. later, the input MSE has a simple linear relation to the generalisation error estimated by CV, when rows of the design matrix are uncorrelated with each other. Hence, we may minimise the CV error instead of the input MSE.
Figs. [2][3][4] show that in some parameter regions there seems to be no global minimum of the input MSE at finite a, as for the strong noise case of (α, ρ 0 , σ 2 ∆ ) = (0.5, 0.2, 1) in Fig. 2 and the dense signal case (α, ρ 0 , σ 2 ∆ ) = (0.5, 0.4, 0.1) in Fig. 3. To examine those cases, we plot relevant quantities when changing a and λ again along the a IMSE (λ) line in Fig. 7. The left panels show the plots of T P and F P against a, the middle panels display the plots of R and ǫ x against a, and the right panels give the associated ROC curves. All the quantities of T P, F P, R, and ǫ x show monotonic behaviours with respect The associated ROC curves. The upper row is for the strong noise case of (α, ρ 0 , σ 2 ∆ ) = (0.5, 0.2, 1) corresponding to the right panel of Fig. 2, while the lower row is for the dense signal case (α, ρ 0 , σ 2 ∆ ) = (0.5, 0.4, 0.1) corresponding to the right panel of Fig. 3. In both the cases, all the quantities of T P, F P, R, and ǫ x behave monotonically with respect to a, and seem to converge to finite values in the LASSO limit a → ∞.
to a, and seem to converge to finite values in the LASSO limit a → ∞. The minimums of R and ǫ x would be thus obtained by LASSO, with the optimised λ. These observations imply that LASSO is sufficient for difficult cases with strong noises or dense signals. This also implies that it is difficult to determine a good value of a to find the least ǫ x solution given a dataset prior to actual analyses, because it strongly depends on the noise strength or the signal density.

Approximate formula for cross-validation
In this section, we derive an approximate formula for the leave-one-out (LOO) CV error. If the dataset size M is large enough, the difference between the estimators of the full and LOO datasets is considered to be small, and it is expected that those two estimators can be connected in a perturbative manner. We concretise this idea below.
The estimator without the µth data in (2) is, hereafter, termed µth LOO estimator, and the explicit formula is given by: The LOO CV error (LOOE) is accordingly defined as: where a ⊤ µ = (A µ1 , · · · , A µN ) is the µth row vector of A. The LOOE is an estimator for the generalisation error or extra-sample error, defined as: where {y new , a new } represents a new data sample, and P (y new , a new ) denotes its distribution. In our setting, the distribution corresponds to the i.i.d. process described around (1), and it is analytically shown that ǫ g has a direct connection to ǫ x as: Hence, we can estimate the input MSE from the LOOE. Note that the sufficient condition for (41) is that both the noise components and the rows of the design matrix are zero-mean and uncorrelated; correlations in the signal vector x 0 may exist because they do not affect (41).
Owing to sparse priors, the variables in the estimator are separated in two types. Some variables are set to zero and the others take non-zero values. We call the former inactive variables and the latter active variables. The index set of the inactive variables, or inactive set, is denoted by S I = {i|x i = 0}, while one of the active variables, or active set, is S A = {i|x i = 0}. The active (inactive) components of a vector x are formally expressed as x S A (x S I ). For any matrix X, we use double subscripts in the same manner and introduce the symbol * meaning all the components in the respective dimension. For example, for an N × N matrix X, X S A S I and X * S I denote X's sub-matrices having row components of S A and all, respectively, while their column components are commonly of S I .

Derivation
The basic assumption to derive the approximate formula is that the active set is 'common' between the full and LOO estimators. Although this assumption is literally not true, we numerically confirmed that this approximately holds in the RS region. In other words, the change of the active set is small enough compared to the size of the active set itself, when considering the LOO operation under the situation with large N and M. Moreover, in the LASSO case, it has been shown that the contribution of the active set change vanishes in a limit N, M → ∞, keeping α = M/N = O(1) [33]. It is expected that the same holds in the present problem. Hence, we adopt this assumption in the following derivation. We also note that this assumption is not applicable to the RSB region.
Once assuming the active set is known and common between the full and LOO systems, we can easily get the determining equations for the coefficients in the active set, by differentiating the cost function with respect to x S A , yielding: for the full system and: for the LOO system. Let us denote d =x −x \µ and expand (43) with respect to d up to the first order. Erasing some terms using (42), and solving the remaining expression with respect to d, we obtain an equation of d S A as: Using (44), we can connect the residuals of the LOO and full systems as: Using the relation A \µ ⊤ A \µ = A ⊤ A − a µ a ⊤ µ and the Woodbury matrix inversion formula, we finally get where The righthand side of (46) can be computed only from the full solution, enabling an approximate evaluation of the LOOE, without literally conducting CV. The error bar can be put as the standard deviation among all the terms in (46) divided by √ M §. This is convincing because each term of (46) gives an independent estimator to the generalisation error (40) and hence its error bar can be given as the standard error. Numerical experiments below show that this definition gives a reasonable error bar.
In the case of LASSO, the Hessian of the penalty term is identically zero, ∂ 2 J LASSO = 0, meaning that (46) comes back to the 'approximation 1' in [33]. For SCAD, the Hessian takes the following form: where I(statement) denotes the indicator function, giving 1 if the statement is true and 0 otherwise.

Numerical experiments and numerical codes
Here we present numerical experiments. To obtain the SCAD estimator, we use the CD algorithm because it is common and stable. We implement this using C language, while the approximate CV formula is implemented as a raw code in MATLAB R . Hence, it is not necessarily fair to compare the computational time in the literal and approximate CVs, which are computed by (39) and (46) respectively, conducted below as a part of experiments. However, even in this comparison there is a meaningful difference in the computational time. When showing the computational time, we fix our experimental environment which uses a single CPU of 3.3 GHz Intel Core i7. In a single step of the CD algorithm, we update all the components of x in a random order. To judge the convergence of the CD algorithm, we monitor the difference between the estimatex (t) at the step t and the previous onex (t−1) . If all the component- i −x (t−1) i |} i are smaller than a threshold value δ, then the algorithm stops; otherwise it continues. We set the threshold value as δ = 10 −10 in all the experiments below.

Simulated dataset
In this subsection, we conduct experiments using simulated datasets. The main purpose is to confirm analytical predictions and to examine the accuracy of the approximate formula. Our simulated datasets are generated by the process described around (1) and (10). The signal power is set to be σ 2 x = 1/ρ 0 , matching with sec. 3.3.  bars are given as the standard error among the samples. The vertical dashed line represents the AT instability point below which the RS solution is unstable. Focusing only on the RS region, we can find that the finite-size effect is quite weak, and the numerical results show an excellent agreement with the analytical ones, justifying our analytical solutions. In this experiment, even below the AT point, the numerical results show a strong regularity. This is because the solutions are obtained by gradually changing λ from large to small values, and hence these solutions below the AT point are, in some sense, continuously connected to the ones above the AT point. We term this scheme λ annealing, which can be considered as a part of the nonconvexity control proposed in [20]. We warn that the solution path obtained by the λ annealing is very atypical below the AT point, as implied in sec. 5.1.2.

Consistency check of the replica solution
As another check of the RS solution's consistency, we also draw ROC curves by numerical experiments. In Fig. 9, we give the ROC curves along the a IMSE line for numerical result is displayed as the scatter plots (orange cross points) of T P and F P , for the experiments of 10 different samples at N = 1000. The numerical plots show a fairly good agreement with the analytical curve, which again justifies our analytical solutions.

Accuracy of the approximate CV formula
To check the accuracy of the approximate CV formula, in Fig. 10 we compare the CV errors between the literal (by (39)) and approximate (by (46)) CVs for two specific samples of the system size N = 100. The other parameters are (α, ρ 0 , σ 2 ∆ , a) = (0.5, 0.2, 0.1, 3) and (α, ρ 0 , σ 2 ∆ , a) = (1.5, 0.2, 0.1, 4), which correspond to Fig. 8. Here, all the results are obtained using the λ annealing and they show regular behaviours, even below the AT point. In all the cases, the approximate result reproduces well the literal one up to the AT point, even for the error bars given by the way explained at sec. 4.1. The uncontrolled behaviour of the approximate formula below the AT point is owing to the singular behaviour in in (44). This is natural because this factor is nothing but the susceptibility, which is known to involve diverging modes when the AT instability occurs [14]. These considerations mean that our approximate formula is only applicable above the AT point or in the RS phase. Can we detect the instability point only from the approximate CV result without referring to the replica computation? Fig. 10 speaks for this, because the approximate CV error tends to show uncontrolled behaviours at and below the AT point. As a trial, assuming a combination use with the λ annealing, we examine the following procedures to detect the uncontrolled behaviours: (i) Detect 'irregular' datapoints by locally comparing each datapoint with neighbouring points along the λ path (here datapoints mean the approximate CV result, red circles in Fig. 10).
(ii) Find the maximum value of λ whose corresponding datapoint is irregular. Regard all the λ region below it as 'instability region'.
To obtain a concrete result, we need to implement the first step (i) as an algorithm. The actual implementation is: (i)-1 If the CV error difference between the irregular point candidate and the compared datapoint is larger enough in reference to the error bar of the compared point, then the candidate is regarded as "irregular".
By these procedures, we can separate all the parameter regions into two parts: Stable and unstable regions corresponding to the RS and RSB phases, respectively. By employing this, in Fig. 11 we draw 'phase diagrams' of two specific samples, used also in Fig. 10. This figure shows that the boundary between the white and black  Figure 11. 'Phase diagrams' drawn for two specific samples used in Fig. 10 using the instability detection procedures described in the main text. The white and black regions represent the stable and unstable regions which correspond to the RS and RSB phases. The blue thick curve denotes the AT line a AT computed by the replica method, while the green curve a CVE shows the λ location of the approximate CV error minimum given a in the stable region which is supposedly related to a IMSE .
regions behaves similarly to the AT line a AT (λ), although there is a gap supposedly owing to the sample fluctuation and the finite-size effect. As a reference, we also compute the minimum location of the approximate CV error with respect to λ given a, defining a CVE (λ), which is denoted by the green curve in Fig. 11. According to (41), this corresponds to a IMSE appearing in the phase diagrams in sec. 3.3. Note that these procedures are somewhat overcautious, and can miss some stable regions possibly existing in the small λ region. As typically seen in the left panel of Fig. 2, the reentrant transition can emerge in the weak noise case but the present procedures cannot detect this re-entrancy, because these procedures detect the first RS-RSB transition corresponding to the rightmost branch of a AT and all the region below this first transition point is regarded as 'instability region'. However, for practical use, it is more important to avoid giving wrong estimates by our approximate formula. Hence, we do not aim to improve the above instability detection procedures in this study. Apart from the re-entrancy, it is worthwhile to investigate the cause of the gap between the AT line and the instability points detected by our procedures. To this end, we compute the boundary value between the black and white regions in Fig. 11 given a, λ c (a), for many samples and different system sizes. The results at a = 5 and a = 10 are shown in Fig. 12. The left panels provide the histograms of λ c from N samp = 100 samples for different sizes N = 100, 200, 400, 800, 1600 discriminated by different colors. As the system size grows, the width of the histogram shrinks and the mode value tends to approach the λ value at the AT point. Here the number of bins N bin for the histogram is determined by the so-called Sturges rule [34] as N bin = ⌈1 + log 2 N samp ⌉, enabling us to define the mode value without ambiguity. To quantify the convergence behavior of To unambiguously define the mode value, we set the number of bins N bin by Sturges rule as N bin = ⌈1 + log 2 N samp ⌉.
As the system size grows, the width of the histogram shrinks and the mode value approaches the AT point.
the mode value, we plot the mode against the inverse system size 1/N in the right panels. Here the mode value shows a clear tendency of approaching to the AT value as N grows. This indicates that the gap is actually due to the sample fluctuation and the finite-size effect, and also implies that our instability detection procedures are reasonably connected to the AT instability.
The AT instability is known to be connected to the emergence of many local minimums [14]. To directly check this, we conduct the literal CV without the λ annealing. For each point of λ, the estimator is computed from ten different randomly initialized x, each component of which is i.i.d. from N (0, 1), by the CD algorithm. In Fig. 13, the resultant CV errors are given as scatter plots in combination with the CV error using the annealing. The experimental setup of each panel is again identical to the corresponding one in Fig. 10. This figure gives a clear evidence of the multiple solutions  Figure 13. Comparison of the literal CV errors with and without the λ annealing in the same experimental condition as the corresponding panel of Fig. 10. The result without the annealing is shown as scatter plots (magenta circles) for ten different random initial conditions and it exhibits visible differences from the annealing result (blue asterisks, identical result to Fig. 10) below the AT point, while no difference exists sufficiently above the AT point. For the lower panels, the region around the AT point is magnified because the difference is small, although it exists.
below the AT point. Fig. 13 also implies that the solution obtained by the λ annealing is rather atypical: Solutions obtained from random initial conditions tend to give rather different values of CV error from the annealed solution. To give a better theoretical background to this statement, we have to construct the full-step RSB solution and to figure out the characterisation of the annealed solution in the ensemble of all the local minimums. This is beyond the present purpose but will be an interesting future work.
Our present attitude to the multiplicity of the solutions is to avoid it. This is reasonable because the global minimum of the generalisation error is in the RS region without the multiplicity, as clarified by our analytical computation. Once accepting this attitude, we can use the proposed approximate formula efficiently estimating the generalisation error and, fortunately, the formula also enables us to avoid the multiple solution region by using the above instability detection procedures. This is the main outcome and contribution of this study.
Before closing this subsection, we check the computational time and the approximate accuracy of the proposed formula more quantitatively. Here we quantify the error difference between the literal and approximate CVs by a normalised MSE defined as: where ǫ CV,approx. and ǫ CV,literal are the CV errors evaluated by the approximate and literal CV procedures, respectively. According to the derivation of the formula in sec. 4, the accuracy is considered to be better as N and M increase. Thus, we plot the normalised MSE against N as the left panel of Fig. 14 normalised MSE quite small for large sizes of N ≥ 200, although at the smallest size N = 50 there is a non-negligible difference. This difference is dominated by a few percent of samples giving accidentally large values of ǫ CV,approx. . The probability of the accidents seems to become smaller rapidly as the system size grows. In the middle panel, the same plot in the double log scale for small sizes N ≤ 1600 is given, showing a clear decay of the normalized MSE in the scale N −2 as the system size grows. This is naturally understood from the scaling argument presented in sec. 5.1.1 in [33]. The corresponding computational time of the CD algorithm convergence and the approximate formula are given in the right panel. The approximate formula requires to take the inverse of the Hessian, leading to the computational cost of O(|S A | 3 ), which is scaled as the third order polynomial of N if |S A | = O(N). This computational cost can be more expensive than the optimisation cost by the CD algorithm in the large N limit, because the total computational time of the CD algorithm is considered to be scaled as O(N 2 ), under the assumption such that the convergence of the CD algorithm takes place in constant computational steps independent of the system size N, although the O(N 2 ) behaviour is hard to see in Fig. 14. Despite this inconvenience in the limiting case, Fig. 14 shows that the computational time of the approximate formula is much smaller than the CD algorithm convergence, in all the investigated range of the system size. We note that the computational time of the CD algorithm shown in Fig. 14 is just for one-time optimisation, and hence, for conducting the literal k-fold CV, the required computational time becomes approximately k times larger than that. Overall, although there is no superiority in the large N limit, our approximate formula practically works very efficiently in a wide range of system sizes.

A real-world dataset: Type Ia supernovae
Here we apply the proposed approximate method to a dataset of Type Ia supernovae. Our dataset is a part of the data from [35,36], which is screened by a certain criterion [37]. This dataset was treated by a number of sparse estimation techniques recently, and a set of important variables, which is known to be empirically significant, has been reproduced [33,37,38,39]. In those studies, the LASSO and ℓ 0 cases are treated, and the CV is employed for hyperparameter estimation. We reanalyse this dataset by using the SCAD penalty and compute the CV error by using the approximate formula. The parameters of the screened data are M = 78 and N = 276, and an appropriate standardisation is employed as pre-processing.
Again, we use the λ annealing to obtain the SCAD estimators for this dataset, and the CV error is computed by our approximate formula. The instability region is detected by the procedures explained in sec. 5.1.2. As Fig. 11, the instability detection gives a phase diagram which is in Fig. 15. The overall shape of this phase diagram is similar to the ones in sec. 3.3 or Fig. 11, supporting the practical relevance of our results so far. To directly check the approximation accuracy, we also conducted the literal CV at a number of values of a. The results for a = 4 and 50 are given in Fig. 16. The approximate error well matches to the literal one up to the instability point, determined by the procedures explained in sec. 5.1.2, which justifies our instability detection procedures. For the left . λ-a phase diagram of the Type Ia supernovae dataset from [37]. The right panel is the magnified view of the left one in the small a region. The black region represents the instability region for which the approximate formula cannot be applied, while the white one is the stable region in which the approximate formula gives a reliable estimate. The minimum point of the CV error in the stable region is given by a CVE (λ), depicted by the green line. panel of a = 4, however, even below the instability point, there exists a region in which two CV errors agree well. This implies the presence of re-entrant transition, and it is probably related to a protruding black region around a ∈ (3, 5) in the right panel of Fig.  15. As declared in sec. 5.1.2, we do not try to detect the re-entrancy in the present study, but there must be some ways. For example, the annealing with respect to a instead of λ would be able to identify the re-entrancy with respect to λ. We found that this strategy can actually detect the re-entrancy, but the strategy itself is far from perfect. There are some reasons for this. One reason is that the switching parameter a has no upper bound in contrast to λ (λ has an effective upper bound as explained in sec. 5.3) and hence the initialization becomes nontrivial. Another reason is that some instability "islands" seem to exist at unexpected regions on the parameter space for this specific dataset: Some compact parameter regions exhibiting the instability seem to be able to exist, in contrast to the theoretically derived phase diagrams in sec. 3.3, and hence isolating the instability regions becomes nontrivial even if the annealing with respect to a is correctly performed. Due to these difficulties, we leave further exploration of better ways of nonconvexity control as a future work.
To extract relevant values of the parameters, we plot the approximate CV error and the number of non-zero components K = ||x|| 0 along the a CVE line in Fig. 17. Here, some outliers exhibiting extraordinary small CV errors are omitted. At the CV error minimum, the solution with K = 10 is obtained, which is comparable with K = 9 of the LASSO solution at the minimum CV error [33,37]. In the case of LASSO, it is common to select a sparser solution than the one at the CV error minimum according to the one-standard error rule [10,40]. Although it is unclear if the application of this rule to the SCAD estimators is appropriate or not, we here try to apply to our case. As a result, we have the solution with K = 3 obtained at a = 2.2 or 2.3 as seen in Fig.  17. We globally examined all the datapoints within the one-standard error in the stable phase, and confirmed that the K = 3 solution is the sparsest. This sparsest solution consists of variables whose IDs are 1, 2 and 233. This is accurately matching to the result of [38,39,41], in which the ℓ 0 formalism is treated, while the LASSO estimator tends to give a denser solution with K = 6 even under the one-standard error rule [33,37]. These demonstrate the effectiveness of the SCAD estimator, and the presented analysis and approximate formula resolve its disadvantages of the multiplicity of solutions and the computational cost in hyperparameter estimation. The effect of the one-standard error rule on the SCAD estimator seems to be also good, though further exemplifications would be needed.

Numerical codes
In [1], a MATLAB package of numerical codes implementing the estimation of the solution path using the λ annealing in conjunction with the approximate CV formula is distributed; the optimization is performed by the CD algorithm as the experiments so far. In the package three regularizations, LASSO, SCAD, and MCP, are treated in a unified manner. All the parameters are tunable in the codes, but the minimally required quantities to run the codes are the data vector y, the design matrix A , and the switching parameter a. In the default setting, the L = 100 values of λ are chosen as to be a descending order λ 1 > λ 2 > · · · > λ L , and the largest is set to be λ 1 = ⌈max 1≤j≤N |a ⊤ j y| ⌉ where a j is the jth column vector of A, because only the trivial solutionx = 0 exists for λ > max 1≤j≤N |a ⊤ j y| ; the smallest is set to be λ L = ǫλ 1 with ǫ = 0.01 and the intermediate values are given to interpolate these two values by the geometric progression with a constant rate. This way follows that of a commonly-used package glmnet [10,42]. The λ annealing is basically the same as warm starts explained in [10], but it has a stronger meaning in the nonconvex penalties because it inevitably picks out a certain solution path as exemplified in the numerical experiments so far. On each point of λ k , the CD algorithm finds the solutionx(λ k ) from the initial condition x(λ k−1 ) (for k = 1 the initial condition is the zero vector), and hencex(λ k ) andx(λ k−1 ) are expected to be close each other. In the default setting, after obtaining the whole solution path over {λ k } L k=1 , the approximate CV formula is subsequently applied and it is followed by the instability detection routine, yielding the approximate values of CV error and its reliable region. In the package, demonstration codes are also included and some experiments in sec. 5.1 can be easily re-obtained; readers who are interested in the experiments are thus encouraged to try to use them. The details of usage are more explained in [1].

Conclusion
In this study, using the replica method, we analysed the macroscopic properties of the SCAD estimator in the context of the signal reconstruction in the presence of noise, under the assumption that the design matrix is the i.i.d. random matrix. We derived the phase diagrams involving the RSB phase, and showed the superiority of the SCAD estimator to the LASSO one based on ROC curves. We also provided an analytical evidence that the global minimum of the input MSE or the generalisation error is located in the RS phase. Furthermore, we derived an approximate formula for the CV error, although it is applicable only for the RS phase. We implemented procedures detecting In the package, the design matrix is denoted as X and the regression coefficients are given as β, following the statistics convention.
the AT instability or the approximation instability, enabling to clarify the applicable limit of the approximate formula and making the formula stand-alone.
To examine the analytical results, numerical experiments on simulated datasets and a real-world dataset of Type Ia supernovae were conducted. On the simulated datasets, the replica prediction was well reproduced. The accuracy and the computational time of the approximate CV formula were examined, and its effectiveness was demonstrated in a wide range of the system size. For the real-world dataset, the application of the SCAD penalty reproduced the variables known to be empirically important. By using the approximate formula, we could globally search the parameters efficiently, and find that the SCAD estimator can provide a very sparse solution giving a reasonable value of the CV error. This solution is matching to the one of the earlier studies using the ℓ 0 formulation [38,39,41], and cannot be found by LASSO. These experiments demonstrate the effectiveness of the SCAD estimator, and the presented analysis and approximate formula resolve its disadvantages of the multiplicity of solutions and the computational cost in hyperparameter estimation.
As an efficient strategy to obtain a solution path, we proposed nonconvexity annealing as a part of nonconvexity control proposed in [20], and especially focused on the usage of the annealing with respect to λ, termed λ annealing in this paper. It was shown that this strategy works well also in combination with our approximate CV formula, but it further raised up a question related to RSB. In the RSB phase exhibiting the multiplicity of solutions, what solution is obtained by the annealing? Our numerical experiments showed that the annealed solution tends to give a smaller CV error compared to the solutions computed from random initial conditions, and in this sense the annealing is a nice strategy even in the multiple solution region. A similar observation was obtained in an inference in Gaussian mixture model [43,44]. To make a more accurate and quantitative analysis about these findings, it is needed to construct the full-step RSB solution and to figure out the characterisation of the annealed solution in the ensemble of all the solutions. This will be an interesting future work.
The present instability detection procedures for the approximate CV formula are rather ad hoc and have some ambiguity, especially in specifying irregular datapoints along the solution path with respect to λ. This ambiguity is related to which points of λ should be sampled when computing the solution path. In the case of LASSO, the change points of active set, usually called knots, can be efficiently computed [45], which provides a clear criterion to the above ambiguity problem. It is expected that a similar technique computing knots for SCAD will be useful for improving the instability detection procedures.
As a final remark, we mention about the MCP penalty defined by: where η = {λ, a}. If we use this instead of the SCAD penalty, the effective one-dimensional estimator, (26) in the SCAD case, is replaced as: x * (h;Q −1 ) = V MCP (h;Q −1 , η)S MCP (h;Q −1 , η), where S MCP (x; σ 2 , η) =      x − sgn(x)λ for aλσ −2 ≥ |x| > λ x for |x| > aλσ −2 0 otherwise , Replacing x * in eqs. (27a)-(27c) by (51), we can get EOS for the MCP penalty, and the AT condition (33) can be replaced by the same way. Corresponding to (34), the RS existence limit of the MCP case is also given as Using these replacements, it is easy to obtain the result for the MCP case. As far as we searched, the MCP result is qualitatively similar to the SCAD one. For illustration of this, we give some phase diagrams, ǫ x plots, and plots of literal CV errors with and without λ annealing in Fig. 18. We see qualitatively similar results to the SCAD case: The re-entrancy for the weak noise region; the no global minimum of the input MSE in the RS phase at finite a for the strong noise or dense signal cases; the accurate accordance between the RS and numerical results above the AT point; the solution multiplicity below the AT point. Although there can be a difference between the SCAD and MCP penalties in a quantitative level as reported in [17], such a comparative study requires more detailed quantitative analyses and we also leave it as a future work. Note that the lower existence limit of the RS phase of the MCP case is given as a = 1, which is derived from (36) and (54), and hence the RS stable region tends to be wider than the SCAD case. However, the direct comparison of two parameter spaces is not necessarily meaningful, and another systematic way of comparison is desired.