Evaluation of Generalized Degrees of Freedom for Sparse Estimation by Replica Method

We develop a method to evaluate the generalized degrees of freedom (GDF), which is a key quantity of a model selection criterion, for linear regression with sparse regularization. Using the replica method, GDF is expressed by the variables that characterize the saddle point of the free energy without depending on the form of the regularization. Within the framework of replica symmetric (RS) analysis, GDF is provided with a physical meaning as the effective density of non-zero components. The validity of our method in the RS phase is supported by the consistency of our results with previous mathematical results. The analytical results in the RS phase are numerically achieved by the belief propagation algorithm.


Introduction
Statistical modelling plays a key role in extracting the structures of a system that may be hidden behind observed data and using them for prediction or control. A statistical model approximates the true generative process of the data, which is generally expressed by a probability distribution. Although it is necessary to adopt an appropriate statistical model, this will depend on the purpose of the modelling, and the definition of appropriateness is not unique. Akaike proposed an information criterion for model selection, where the appropriate model is defined using Kullback-Leibler divergence [1]. This criterion validates the relative effectiveness of the model under consideration, and mathematically expresses the contribution of the model to the prediction performance.
Since the systemization of the least absolute shrinkage and selection operator (LASSO) [2], which simultaneously achieves variable selection and estimation, sparse estimation has been attracting considerable attention in fields such as signal processing [3,4] and machine learning [5,6]. In general, sparse estimation is formulated as the problem of minimizing the estimating function penalized by sparse regularization. The estimated variables have zero components, a property known as sparsity. To find the sparse representation of the system from among various candidates, a seemingly hidden rule that controls the system is sought. Similar to LASSO, ℓ 1 regularization is widely used because of its convexity, which yields mathematical and algorithmic tractability [3]. In addition, non-convex regularization, such as using the ℓ p (p < 1)-norm [7,8], has been studied to obtain a sparser representation than that given by ℓ 1 -norm regularization [9]. Furthermore, the smoothly clipped absolute deviation (SCAD) and adaptive LASSO penalty have been investigated [10,11,12] to acquire the oracle property, which the LASSO estimator does not possess.
The emergence of the estimation paradigm associated with sparsity requires the development of appropriate model selection criteria. In sparse estimation, the determination of the regularization parameter can be regarded as the selection of a model from a family of models that have different sparsities controlled by the regularization parameter. In addition to the cross-validation (CV) method [13], which is a simple numerical approach for sparse estimation [14,15], analytical model selection methods with lower computational costs have been developed. One such method involves estimating the generalized degrees of freedom (GDF) [16]. The GDF is a key quantity for Mallows' C p , a model selection criterion based on the prediction error [17]. In particular, the derivation of GDF has been studied in linear regression with a known variance [18]. The analytical form of the GDF for LASSO [19] and elastic net regularization [20] are well known, but general expressions for other regularizations have not yet been derived.
In this paper, we propose an analytical method based on statistical physics for the derivation of GDF in sparse estimation. Certain aspects of statistical physics developed for random systems have already been applied to sparse estimation problems [21,22,23,24]. The analysis of typical properties provides physical interpretations of the problems based on phase transition pictures, and this contributes to the development of algorithms [25,26,27,28]. The statistical physical method can be applied to the estimation of GDF for sparse regularization. We show that GDF is expressed as the effective fraction of non-zero components for any sparse regularization. This expression is a mathematical realization of the meaning of GDF in terms of "model complexity" [19,29].
The remainder of this paper is organized as follows. Section 2 summarizes the model selection criterion discussed in this paper and highlights some previous related studies on sparse estimation. Section 3 explains our problem setting for the estimation of GDF. Sections 4 and 5 describe our analytical method based on the replica method for sparse estimation. Section 6 represents the behaviour of GDF for ℓ 1 , elastic net, ℓ 0 , and SCAD regularization. Section 7 proposes the numerical calculation of GDF using the belief propagation algorithm, and discusses the generality of the results. In Section 8, the approximation performance of our method for the calculation of GDF is examined in the case of ℓ 0 regularization. Finally, Section 9 concludes the paper.

Overview of model selection
In this section, we explain the criteria for model selection discussed in this paper. In addition, we summarize previous studies and identify our contributions. We focus on the parametric model, where the true generative model of z, denoted by q(z), is approximated by p(z|θ) with a parameter θ ∈ Θ ⊂ R N , where Θ is a parameter space. The parameter is estimated under the given model to effectively describe the true distribution using training data w = {w µ } (µ = 1, · · · , M, w µ ∼ q(w µ )). Let us prepare a set of candidate models M = {p 1 (z|θ 1 ), · · · , p m (z|θ m )} for the approximation of the true distribution, whereθ k is the estimated parameter under the k-th model. Model selection is then the problem of adopting a model based on a certain criterion.

Information criterion
The information criterion evaluates the quality of the statistical model based on Kullback-Leibler (KL) divergence. KL divergence describes the closeness between the true distribution q(z) and the assumed distribution p(z|θ ML (w)) as whereθ ML (w) is the maximum likelihood estimator from the training sample w. The dependency on the model appears only in the second term of (1), called the predicting log-likelihood, i.e. l(w) ≡ E z∼q(z) [log p(z|θ ML (w))]. Therefore, the maximization of the predicting log-likelihood is the basis for the information criterion. Unfortunately, it is generally impossible to evaluate the predicting log-likelihood, because we cannot determine the true distribution. We define the estimator of the predicting log-likelihood using the empirical distribution which corresponds to the maximum log-likelihood. The expected value of the difference between the predicting log-likelihood and the maximum log-likelihood, termed the bias, is given by where q(w) = µ q(w µ ). The information criterion is defined as an unbiased estimator of the negative predicting log-likelihood: whereb(w) is an unbiased estimator of the bias, and the coefficient 2 is a conventional value. The optimal model is defined as that which minimizes IC(w) among the models in M. Intuitively, the first and second terms represent the training error and the complexity of the model, respectively. As the complexity of the model increases, the model can express various distributions. However, overfitting is likely to occur, which hampers the prediction of unknown data. The information criterion selects the model that achieves the best trade-off between the training error and the level of model complexity.
The values of b can be calculated asymptotically. In particular, when the statistical model contains the true model, namely a parameter θ * exists such that q(z) = p(z|θ * ), the information criterion is known as Akaike's information criterion (AIC), where the bias term b is reduced to the dimension of the parameter θ [1].
The criterion explained thus far is for models constructed by maximum likelihood estimation. To determine the parameter with other learning strategies, we focus on maximum likelihood estimation under regularization, where the GDF facilitates the extension of the information criterion [29]. A general expression of GDF is naturally derived from another model selection criterion, namely, Mallows' C p [17].

Mallows' C p and generalized degrees of freedom
The prediction of unknown data is another criterion for the evaluation of a model. We define the squared prediction error per component as whereŵ is the estimate of w and z ∈ R M is independent of w ∈ R M , but each component of z is generated according to the same distribution as w. When the training sample is generated as w ∼ N (µ, σ 2 I M ), where I M is the M-dimensional identity matrix, Mallows' C p , calculated as is an unbiased estimator of the prediction error. Here, is the training error anddf(w) is an unbiased estimator of GDF defined by df = cov(w,ŵ(w)) which quantifies the complexity of the model [19,29], where cov(w,ŵ(w) In the framework of C p , the optimal model is defined as that which minimizes c p (w) among the models in M.
Another expression of GDF is given by [16] which corresponds to the expectation of Stein's unbiased risk estimate (SURE) for the prediction error [30]. GDF was originally introduced as an extension of the degrees of freedom in the linear estimation rule for a general modelling procedure in the form (9) [16]. When the assumed model obeys a Gaussian distribution p(w|θ) ∝ exp(− 1 2σ 2 ||w − µ(θ)|| 2 2 ) with a known variance, and taking µ(θ ML (w)) =ŵ(w), AIC (normalized by the number of training samples) is given by [19] AIC(w) = err train (w) Equations (6) and (10) indicate that model selection based on AIC and that based on C p give the same result; they are proportional to each other c p (w) = σ 2 AIC(w).

Model selection for sparse regularization and our contributions
The regression problems with sparse regularization is formulad as where e(x; w, A) measures the difference between training data w and its fit using regression coefficients x under the predictor matrix A, and r(x; η) is the regularization term with the regularization parameter η that enhances zero components in x. The regularization parameter determines the number of predictors used in the expression of the data distribution, and the model distribution under the determined number of predictors can be regarded as a model: where H is the support of the regularization parameter. Therefore, tuning the regularization parameter η corresponds to model selection. However, in general, the derivation of AIC based on the asymptotic expansion is not straightforwardly applicable to sparse regularization. In such cases, C p is useful for deriving the model selection criterion when the squared error is considered. In LASSO, it is mathematically proven that, when the number of training samples is greater than the number of predictors, the ratio of the number of nonzero regression coefficients to the number of training samples is an unbiased estimator of the degrees of freedom in a finite sample [19]. However, the derivation of GDF is analytically difficult for general sparse regularizations. To overcome this difficulty, GDF computation techniques have been developed using the parametric bootstrap method [18] and SURE [30,31].
In the present paper, we propose an estimation technique for GDF using the replica method under a replica symmetric (RS) assumption for linear regression with Gaussian i.i.d. predictors. The replica symmetric analysis for the estimation problems under sparse regularization are shown in [21,22,23,24]. In these papers, the replica method is employed to study phase transition or the property of estimators. We extend this analytical method for the calculation of GDF that is not taken into account in the current formalism of the replica analysis. The technique we propose is applicable to general sparse regularization. Using our method, the correspondence between GDF and the effective fraction of non-zero components in the large-system-size limit is shown to be independent of the form of regularization. Our approach differs from previous methods in which GDF has been derived for specific types of regularization. We apply our method to ℓ 1 , elastic net, ℓ 0 , and SCAD regularization to obtain the GDF. The results shown here for ℓ 1 and elastic net regularization are weaker than those in previous studies, where the unbiased estimator of GDF,df, is derived for one instance of the predictor. However, our method is consistent with previous results, which supports the validity of our approach. Furthermore, our method can be applied to non-convex sparse regularizations such as ℓ 0 and SCAD, and extends the discussion of GDF to general sparse regularization. For the ℓ 0 case, the solution under the RS assumption is always unstable against perturbations that break the replica symmetry, but we show that GDF under the RS assumption approximates the true value of GDF. In the case of SCAD regularization, our method can identify the most appropriate model based on the prediction error within the range of the RS assumption when the mean of the data is sufficiently small. The generality of the result in terms of the correspondence between GDF and the effective fraction of nonzero components is discussed using a belief propagation algorithm for other predictor matrices.

Problem setting and formulation
We apply a linear regression model with sparse regularization r(x; η) = i r(x i ; η), where η is a regularization parameter, to a set of training data y ∈ R M : where the column vectors of A = {A 1 , · · · , A N } ∈ R M ×N and components of x ∈ R N correspond to predictors and regression coefficients, respectively. Here, the coefficient of the squared error, 1/2, is introduced for mathematical convenience. The variable x to be estimated here corresponds to the parameter θ in the previous section, and the number of non-zero components in x corresponds to the number of parameters used in the model. We introduce the posterior distribution of x: where Z β (y, A) is the normalization constant. The distribution as β → ∞ is the uniform distribution over the minimizers of (12). Estimate of the solution of (12) under a fixed set of {y, A}, denoted byx(y, A), is given bŷ where · β denotes the expectation according to (13) at β. Using this estimatex(y, A) of x, the training sample y is estimated aŝ To understand the typical performance of (12), we calculate the expectation of the training error with respect to y and A, At a sufficiently large system size N → ∞, we set the scaling relationship as where K is the number of non-zero components ofx. The training error relates to the free energy density f as [34] f ≡ − lim where The expectation of the regularization term r is derived separately from f , as shown in the following section. Hence, the training error is derived as For the calculation of GDF, we introduce external fields κ and ν, and define the extended posterior distribution as where Z β,κ,ν (y, A) is the normalization constant. We define the extended free energy density as where φ β,κ,ν = E y,A [ln Z β,κ,ν (A, y)] and f = f κ=0,ν=0 . We derive the following quantities from the extended free energy density: Using these, the GDF for a Gaussian training sample is derived as where m y ∈ R and σ 2 y ∈ R are the mean and variance of the training sample, respectively. Further, C p given by (6) is the unbiased estimator of the prediction error. Hence, the expectation of the prediction error with respect to y and A, is given by

Analysis
For the derivation of GDF, we resort to the replica method [32,33]. The RS calculations for ℓ 0 and ℓ 1 minimization are shown in the typical performance analysis of compressed sensing [21] and dictionary learning [24]. We summarize the analytical method and explain how it can be extended to the evaluation of GDF. Hereafter, we consider

Replica method and replica symmetry
We calculate the generating function φ β using the following identity: Assuming that n is a positive integer, we can express the expectation of Z n β (y, A) by introducing n replicated systems: where . We characterize the microscopic states of {x (a) } with the macroscopic quantities Introducing the identity for all combinations of a, b (a ≤ b) the integration with respect to A leads to the following expression: where each component of u (a) , denoted by u µ }, its probability distribution is given by [21] and the function S(Q) is given by whereq (ab) is the conjugate variable for the integral representation of the delta function in (30), andQ is the matrix representation of {q (ab) }.
To obtain an analytic expression with respect to n ∈ R and take the limit as n → 0, we restrict the candidates for the dominant saddle point to those of RS form as For β → ∞, RS order parameters scale to keep β(Q − q) = χ, β −1 (Q +q) =Q, and β −2q =χ of the order of unity. Under the RS assumption, the free energy density is given by where extr Q,χ,Q,χ denotes extremization with respect to the variables {Q, χ,Q,χ}. The function π r , where the subscript r denotes the dependency on the regularization, is given by where h RS (z;χ) = √χ z is the random field that effectively represents the randomness of the problem introduced by y and A, and Dz = dz exp(−z 2 /2)/ √ 2π. The solution of x concerned with the effective single-body problem (37), denoted by x * r (z;Q,χ), is statistically equivalent to the solution of the original problem (12). Therefore, the expectation of the regularization term is derived as The variables Q, χ,Q,χ are determined by saddle point equations to satisfy the extremum conditions of the free energy density: Note that the functional form of the parametersχ andQ does not depend on the regularization, but the values of χ and Q are regularization-dependent. At the extremum, the parameters Q and χ are related to the physical quantities by and can be expressed using x * r as The extended free energy density with the external fields κ and ν is given by To evaluate f κ,ν for non-zero κ and ν, one has to solve the saddle point equation at non-zero κ and ν to determine the saddle point value of χ. However, since one would only need to evaluate derivatives of f κ,ν at κ = ν = 0 to obtain GDF, the saddle point value of χ that is to be used in such evaluations should remain the same as that obtained in the calculation of f . From (22)-(24), GDF is obtained as where χ andQ satisfy the saddle point equations (39) and (42), respectively. This expression is also independent of the form of the regularization. The effective single-body problem (37) can be interpreted as a scalar estimation problem in which x is estimated on the basis of the prior (regularization) exp(−r(x; η)) and the random observation h/Q, which is assumed to be generated as h/Q = x + n, where n ∼ N (0,Q −1 ) is the Gaussian observation noise. If one uses the observation itself in the single-body problem as an estimate of x, then it is an unbiased estimator of x and its variance isQ −1 . However, the actual variance of the estimates can change according to the regularization. The variable χ is the rescaled variance of the system expressed as (44). Therefore, GDF (48) corresponds to the effective fraction of the non-zero components of x (parameters), which is estimated by dividing the variance of the total system by that of one component when the observation is used as the estimate. The effective fraction of the non-zero components is measured under the assumption that the regularization does not change the variance of one component fromQ −1 . If this assumption is correct and the fluctuation of non-zero components is the unique source of the system's fluctuation, GDF is considered to be equal to the ratio of the number of non-zero components to the number of training samples. The RS solution discussed thus far loses local stability under perturbations that break the symmetry between replicas in a certain parameter region. Known as the de Almeida-Thouless (AT) instability [35], this phenomenon appears when In general, when AT instability appears, we have to construct the full-step replica symmetry breaking (RSB) solution for an exact evaluation. However, the RS solution remains meaningful as an approximation [32,33].

Applications to several sparse regularizations
As shown in the previous section, some regularization-dependency appears in the effective single-body problem (37). We now apply the analytical method to ℓ 1 , elastic net, ℓ 0 , and SCAD regularization. The ratio of the number of non-zero components to the number of training samples is denoted by δ =ρ/α, and we focus on the physical region δ ≤ 1, where the number of unknown variables is smaller than the number of known variables.

ℓ 1 regularization
In the ℓ 1 regularization r(x; η) = η||x|| 1 = η i |x i |, the maximizer of the single-body problem (37) is given by where sgn(z) denotes the sign of z and is 0 when z = 0. Figure 1 (a) shows the behaviour of x * ℓ 1 at α = 1 and η = 1. Setting θ 1 = η/ √ 2χ, the fraction of non-zero components is given by the probability that the solution of the RS single-body problem (50) is non-zero: and The regularization-dependent saddle point equations are given by From (41) and (42), the solutions of the saddle point equations (53) and (54) can be derived as Substituting the saddle point equations, the free energy density and the expectation of the regularization term are given by respectively. Hence, the training error is given by AT instability appears when which is outside the region of interest for physical parameters. Equation (56) leads to the following expression for the GDF: This expression is consistent with the result in [19], which verifies the validity of this RS analysis for the derivation of GDF.

Elastic net regularization
Elastic net regularization, given by was developed to encourage the grouping effect, which is not exhibited by ℓ 1 regularization, and to stabilize the ℓ 1 regularization path [36]. Here, the coefficient 1/2 is introduced for mathematical convenience, and η 2 = 0 and η 1 = 0 correspond to ℓ 1 regularization and ℓ 2 regularization, respectively. The solution of the effective single-body problem for elastic net regularization is given by The behaviour of this solution is shown in Fig. 1 (b) for α = 1, η 1 = 1, η 2 = 0.5, and where θ en = η 1 / √ 2χ. The fraction of non-zero components is given byρ = erfc(θ en ), and the regularization-dependent saddle point equations are given by At the saddle point, the free energy density and the expectation of the regularization term can be simplified as respectively. Hence, the training error is given by AT instability arises when (71) the right-hand side is always greater than 1 because η 2 ≥ 0 andQ > 0. Therefore, the RS solution is always stable under symmetry breaking perturbations in the physical parameter region α >ρ. From (67), the GDF for elastic net regularization is given by which reduces to the GDF for ℓ 1 regularization at η 2 = 0. An unbiased estimator of the GDF for one instance of A is derived in [20] aŝ where A is the set of indices of non-zero components, and the columns {A i |i ∈ A} constitute the submatrix A A . The number of the components of A is denoted by |A|.
Our expression (72) for df corresponds to the typical value (or the expectation) of df(A) for a Gaussian random matrix A. The physical implications suggested by the cavity method [32,38], which is complementary to the replica method, supports the correspondence relationship betweenQ in the replica method and the Gram matrix of A. This correspondence indicates that our RS analysis is valid for the derivation of GDF under elastic net regularization. As shown in (72), the GDF for elastic net regularization deviates from δ =ρ/ α. The ℓ 2 regularization term in elastic net regularization changes the variance of the non-zero components fromQ −1 to (Q + η 2 ) −1 . Hence, the effective fraction of the nonzero components measured by χ/Q −1 does not coincide with δ. By defining the rescaled estimates of the single-body problem as x * res en = (1+η 2 /Q)x * en , the corresponding variance is reduced to χ res =ρ/(αQ) from (45), and this gives df = δ. This rescaling corresponds to that shown in [36], which was introduced to cancel out the shrinkage caused by ℓ 2 regularization and improve the prediction performance.
Taking the limit as η 1 → 0, the GDF for ℓ 2 regularization can be obtained where the estimate is not sparse. The solution of the effective single-body problem is given by and the function π is given by This expression leads to the following GDF: which corresponds to the limit as δ → 1 of the elastic net regularization. An unbiased estimator of the GDF for one instance of A is proposed as [29] df Equation (76) corresponds to the expectation ofdf(A) for a Gaussian random matrix A.

ℓ 0 regularization
The ℓ 0 regularization is expressed by r(x; η) = η||x|| 0 = η i |x i | 0 , which corresponds to the number of non-zero components in x. The solution to the single-body problem for ℓ 0 regularization is given by where θ 0 = ηQ/χ, and by setting the fraction of non-zero components toρ = erfc(θ 0 ), we can derive and have two solutions: finite χ and Q and infinite χ and Q. We denote the finite and infinite solutions as S 1 = {χ 1 , Q 1 } and S 2 = {χ 2 = ∞, Q 2 = ∞}, respectively. Using (41) and (42), the finite solution can be simplified as where By definition, χ 1 and Q 1 should be positive, and so (82)-(83) are only valid when α >ρ + ω. According to a local stability analysis of (80) around 1/χ = 0, solution S 2 is a locally stable solution of the RS saddle point equation when α <ρ + ω, where as it is unstable when α >ρ + ω. Therefore, the stable solution of the RS saddle point equation changes from S 1 to S 2 at α =ρ + ω. Note that the stability discussed here refers to the RS solution, and does not relate to AT instability. The free energy density is simplified by substituting the saddle point equations as The second term of (85) corresponds to the expectation of the regularization term, and so the training error can be derived as The GDF is given by The term ω, given by (84), in the GDF originates from the discontinuity of the single-body problem at the threshold √ 2θ 0 , as shown in figure 1 (c). In addition to the fluctuation generated by the non-zero components, this discontinuity induces fluctuations in the system and increases the GDF from δ.
Under ℓ 0 regularization, AT instability always appears, but the estimated GDF under the RS assumption can be regarded as an approximation of the true value of the GDF, as shown in Sec. 8. Our calculations based on the one-step RSB assumption indicate that the form of the GDF, as the fraction of non-zero components plus the discontinuity term, is unchanged, although the values of these two terms does change (unreported).

SCAD regularization
SCAD regularization is a non-convex sparse regularization in which the estimator has the desirable properties of being unbiased, sparse, and continuous [10]. Mathematically, the SCAD estimator is asymptotically equivalent to the oracle estimator [10,11]. SCAD regularization is given by where λ and a are parameters that control the form of the regularization. The maximizer of the single-body problem for SCAD regularization is given by .
(89) Figure 1 (d) shows an example of the behaviour of the maximizer x * S at a = 5, λ = 0.1, and η = 1, where three thresholds are given by θ S1 = λη/ √ 2χ, θ S2 = λ(Q + η)/ √ 2χ, and θ S3 = aλQ/ √ 2χ. The threshold θ S1 gives the fraction of non-zero components aŝ ρ = erfc(θ S1 ). Between the thresholds θ S1 and θ S2 , and beyond the third threshold θ S3 , the estimate x * S behaves like the ℓ 1 and ℓ 0 estimates, respectively. Between θ S2 and θ S3 , the estimate transits linearly between the ℓ 1 and ℓ 0 estimates. The function π for SCAD regularization is derived as where The regularization-dependent saddle point equations are given by and the expectation of the regularization term is given by Substituting these equations into the free energy density, we get err train =χ.
The AT instability condition is given by which reduces to that for ℓ 1 regularization as a → ∞.
There are three solutions of {Q, χ}: For sufficiently large a, the finite solution S 1 is a locally stable solution of the RS saddle point equation when Beyond the range of (100), the stable RS solution is replaced by S 2 . For sufficiently small η, the stable RS solution can switch from S 2 to S 3 depending on the SCAD parameter, As a → ∞, we get π 4 → 0 and solution S 1 is always a stable RS solution, satisfying (100); hence, the GDF reduces to that for ℓ 1 regularization. The second term of the GDF for solution S 1 arises from the weight between the thresholds θ S2 and θ S3 . The manner of assigning the non-zero components to this transient region between the ℓ 1 and ℓ 0 estimates increases the fluctuation in the system, and the GDF does not coincide with δ. We note the pathology of solution S 3 under the RS assumption. As shown in the solution to the single-body problem (89) (Fig. 1 (d)), the magnitude relation θ S1 ≤ θ S2 ≤ θ S3 should hold. However, the Q, χ → ∞ solution leads to θ S3 → 0 with finite θ S1 = θ S2 . Solution S 3 appears in the region where AT instability appears, and so this non-physical phenomenon is considered to be caused by an inappropriate RS assumption. Hence, we must construct the RSB solution to correctly describe the GDF corresponding to solution S 3 . Figure 2 illustrates the δ-dependence of the GDF for ℓ 1 , the elastic net with η 2 = 0.1, ℓ 0 , and SCAD regularization with a = 8 and λ = 1 at α = 0.5, m y = 0, and σ 2 y = 1. At each point of δ, the regularization parameters η for ℓ 1 , ℓ 0 and SCAD regularization and η 2 for elastic net regularization are controlled such that δ =ρ/α. Under ℓ 1 regularization, the GDF is always equal to δ, as shown in (62). In elastic net regularization, the GDF is less than δ as the ℓ 2 parameter η 2 increases. For ℓ 0 regularization, the RS solution S 1 is unstable at δ > 0.248 in this parameter region, and is replaced by solution S 2 , which gives df = 1. In SCAD regularization, the solution S 1 loses local stability within the RS assumption at δ > 0.924, and AT instability appears before the RS solution S 1 becomes unstable at δ > 0.866 (denoted by the dashed vertical line in figure 2.) Figure 3 shows the prediction error (26) for the same parameter region as figure 2. At σ 2 y = 1, the prediction error is equivalent to the expectation of AIC. In the entire range of δ shown in figure 3 (a), the RS solutions for ℓ 1 , elastic net, and SCAD regularization are stable under symmetry breaking perturbations. Thus, we can identify the value of δ that minimizes the prediction error for each regularization. In this case, the models with δ = 0.085 (denoted by •), δ = 0.170 ( ), and δ = 0.072 ( ) are selected for ℓ 1 , elastic net, and SCAD regularization, respectively. In the current problem setting, sparse estimation with SCAD regularization minimizes the prediction error within the RS region when the mean of the data is sufficiently small. To identify the appropriate model using RS analysis, it is useful to standardize the data. As shown in figure 3 (b), the magnitude of the prediction errors at δ < δ 1 = 0.028, δ 1 < δ < δ 2 = 0.045, and δ > δ 2 runs in descending order as elastic net>SCAD> ℓ 1 , SCAD>elastic net> ℓ 1 , and SCAD> ℓ 1 >elastic net, respectively. The estimatesx have different supports depending on the regularization, even when the regularization parameters are controlled to give a certain value of δ. A comparison of the prediction errors within the framework of RS analysis guides the choice of regularization for each value of δ.

Parameter dependence of GDF and prediction error
The prediction error for ℓ 0 regularization under the RS assumption is shown in figure 3 (c) alongside those for other regularization types. The RS prediction error is minimized at δ = 0. This indicates that the appropriate model under RS analysis has a non-zero component of O(1). Our analysis assumes that the number of non-zero components is O (N); hence, the derived model selection criterion cannot identify the appropriate model in the current problem setting for ℓ 0 regularization.

Belief propagation algorithm for sparse regularization
The correspondence between replica analysis and the belief propagation (BP) algorithm suggests that the typical properties of BP fixed points at the large-system-size limit can be described by the RS saddle point [32,38]. Thus, we may expect that the numerically obtained GDF will be consistent with the RS analysis at finite system sizes using the BP algorithm. For the ordinary least squares with a regularization that can be written as (12), a tentative estimate of the i-th component at step t, denoted byx i [25,26,27], where and settingŷ (t) = Ax (t) , The variable χ i , and its determination rule depends on the regularization. For ℓ 1 and elastic net regularization, the variable is given by and respectively. For these regularizations, the GDF at the BP fixed point converges to that given by RS analysis as the system size increases. However, for these regularizations, the GDF can be calculated using least angle regression (LARS) [37], which has a lower computational cost than the BP algorithm. Hence, there is no need to introduce the BP algorithm. In the case of ℓ 0 regularization, the variable χ i is given by Unfortunately, AT instability appears across the whole parameter region for ℓ 0 regularization, and the BP algorithm does not converge. For SCAD regularization, no numerical method for the precise evaluation of GDF has been proposed. As shown in the previous section, SCAD regularization gives a parameter region where the RS solution is stable. Therefore, the BP algorithm is useful as a method of numerically calculating the GDF for SCAD regularization. The variable χ (t) i for SCAD regularization is given by After updating the estimatesx (t) , we can numerically evaluate the value of GDF using (8) with the data estimatesŷ (t) . To ensure convergence, appropriate damping is required at each update.
As for the replica analysis, we apply the BP algorithm for the case of Gaussian random data y and predictors A. Figure 4 shows the numerically calculated GDF given by the BP algorithm at N = 200, m y = 0, and σ 2 y = 1. The BP algorithm is updated until |x (t) i −x (t−1) i | < 10 −10 for each component, and the result is averaged over 100 realizations of {y, A}. The solid and dashed lines represent the analytical results given by the replica method for the RS and RSB regions, respectively. In the RS regime, the numerically calculated GDF from the BP algorithm coincides with that evaluated by the replica method.

Perspective for other predictors
The RS analysis discussed so far has been applied to Gaussian i.i.d. random predictors. Its extension to other predictors is not straightforward. To check the generality of the GDF being given by the effective fraction of non-zero components (χ/Q −1 ) at the RS saddle point for other predictor matrices, we resort to the BP algorithm. The typical properties of χ i at the BP fixed point denoted by χ * i andQ * i are described in the replica method by χ andQ of the RS saddle point at the large-system-size limit. Therefore, it is reasonable to define the effective fraction of non-zero components at the BP fixed point as where the overline represents the average over y and A. If δ BP eff and the GDF from (8) coincide at the BP fixed point, it is considered that the correspondence between GDF and the effective fraction of non-zero components holds at the RS saddle point. For ℓ 1 , elastic net, and SCAD regularization, we examine the behaviour of GDF and δ BP eff under two predictors [36] in a parameter region where the BP algorithm converges.
Example 1: Gaussian predictors with pairwise correlation. The correlation between predictors A i and A j is set to be c |i−j| , and the predictors are normalized such that ||A i || 2 2 = 1. Example 2: The predictors are generated as  [19] and [20].  [19] and [20]. are i.i.d. Gaussian random variables with mean zero and variance 1, and T is a parameter that takes an integer value smaller than (N − 1)/3. The predictors are normalized such that ||A i || 2 2 = 1. Figures 5 and 6 show the δ-dependence of GDF and δ BP eff at the BP fixed point for ℓ 1 , elastic net, and SCAD regularization at N = 1000 and α = 0.5 (M = 500). The values of each point have been averaged over 1000 samples of {y, A}. Under ℓ 1 and elastic net regularization, the GDF value calculated as the expectation of the unbiased estimator derived in [19,20] is shown as a dashed line. In both examples, the correspondence between GDF and δ BP eff holds for each regularization, although a small discrepancy appears due to finite-size effects at large δ. Furthermore, the values of δ BP eff and GDF at the BP fixed point are consistent with those of previous studies for ℓ 1 and elastic net regularization. The parameters c and T in these examples do not influence the results, although they do affect the convergence of the BP algorithm. These results imply that the correspondence between GDF and the effective fraction of non-zero components holds outside of Gaussian i.i.d. predictors. For both examples, the convergence of the BP algorithm is worse than with the Gaussian i.i.d. predictors, particularly at large δ. Thus, the algorithm must be improved to enable a discussion of the large-δ region and application to other predictor matrices.
In the case of ℓ 0 regularization, the BP algorithm does not converge. Thus, we cannot confirm the generality of the result using the properties of BP fixed points. The replica analysis for non-Gaussian i.i.d. predictor matrices is a necessary step towards verifying the generality of the result for ℓ 0 regularization. Although the range of applicable predictor matrices for replica analysis is narrower than that for the BP algorithm, the analysis of rotationally invariant predictor matrices offers a promising means towards demonstrating the generality [21].

RS solution approximates the GDF for ℓ 0 regularization
For ℓ 0 regularization, the RS solution is unstable in the whole parameter region, but it is known that this solution generally approximates the true solution. One can numerically obtain the exact solution of (12) for ℓ 0 regularization by an exhaustive search, and calculate the exact value of GDF at small system sizes. Comparing the GDF under RS analysis with its exact value, we can evaluate the approximation performance of the RS solution. Figure 7 compares the GDF approximated by the RS solution with its exact value for N = 20, 30, and 50 as calculated by 1000 samples of {y, A}. As N increases, the exact GDF approaches the RS solution, although intense finite-size effects are observed in the small-δ region. For a comparison at larger system sizes, we must develop a computationally feasible algorithm for obtaining precise solutions of (12) for ℓ 0 regularization, but this is beyond the scope of the present paper.

Summary and Conclusion
We have derived the GDF using a method based on statistical physics. Within the range of the RS assumption, the GDF is represented as χ/Q −1 , where χ andQ −1 correspond to the rescaled variance around estimates and the variance of estimates when the regularization term is omitted, respectively. This expression does not depend on the type of regularization, and indicates that GDF can be regarded as the effective fraction of non-zero components.
We applied our method for the derivation of GDF to ℓ 1 , elastic net, ℓ 0 , and SCAD regularization. Our RS analysis was stable for ℓ 1 and elastic net regularization in the entire physical parameter region, and the GDFs for these regularizations were consistent with previous results. This correspondence supports the validity of our RS analysis. The model selection criterion of prediction error was derived by combining the GDF with the training error. Theoretical predictions in the RS phase were then algorithmically achieved using the belief propagation method.
It has been implied that the equivalence between GDF and the ratio of the number of non-zero components to the number of samples, δ, only holds for ℓ 1 regularization [19]. Our representation of GDF as the effective fraction of non-zero components clarifies the origin of the additional component of the GDF from the fraction of non-zero components.
• In ℓ 1 regularization, the GDF is given by δ because there is no factor that induces fluctuations other than the non-zero components.
• Elastic net regularization changes the variance of the components, and so the GDF does not coincide with δ. However, as with ℓ 1 regularization, the nonzero components are the unique source of fluctuations, and so the correspondence between GDF and δ can be recovered by appropriately rescaling the estimates.
• In ℓ 0 regularization, the discontinuity of the estimates leads to additional fluctuations besides those caused by the non-zero components. Hence, the GDF is greater than δ.
• In SCAD regularization, the assignment of non-zero components to the transient region between ℓ 1 -type estimates and ℓ 0 -type estimates induces additional components in the GDF.
For regularizations with AT instabilities in certain parameter regions (e.g. ℓ 0 , SCAD, and other non-convex regularizations), it is generally necessary to construct the full-step RSB solution. In the case of SCAD regularization, model selection based on the prediction error under RS analysis can be achieved in the current problem setting. Even when the RS solution is unstable, the prediction error gives a meaningful approximation of the true value.
Further development of our method for the general function of prediction error [39] and real data will be useful for practical applications. The BP algorithm discussed here can numerically calculate the GDF and model selection criterion for practical settings at reasonable computational cost.