Statistical mechanical analysis of sparse linear regression as a variable selection problem

An algorithmic limit of compressed sensing or related variable-selection problems is analytically evaluated when a design matrix is given by an overcomplete random matrix. The replica method from statistical mechanics is employed to derive the result. The analysis is conducted through evaluation of the entropy, an exponential rate of the number of combinations of variables giving a specific value of fit error to given data which is assumed to be generated from a linear process using the design matrix. This yields the typical achievable limit of the fit error when solving a representative $\ell_0$ problem and includes the presence of unfavourable phase transitions preventing local search algorithms from reaching the minimum-error configuration. The associated phase diagrams are presented. A noteworthy outcome of the phase diagrams is that there exists a wide parameter region where any phase transition is absent from the high temperature to the lowest temperature at which the minimum-error configuration or the ground state is reached. This implies that certain local search algorithms can find the ground state with moderate computational costs in that region. Another noteworthy result is the presence of the random first-order transition in the strong noise case. The theoretical evaluation of the entropy is confirmed by extensive numerical methods using the exchange Monte Carlo and the multi-histogram methods. Another numerical test based on a metaheuristic optimisation algorithm called simulated annealing is conducted, which well supports the theoretical predictions on the local search algorithms. In the successful region with no phase transition, the computational cost of the simulated annealing to reach the ground state is estimated as the third order polynomial of the model dimensionality.


Introduction
Compressed sensing is a technique used to recover a high-dimensional signal from a limited number of measurements by utilising the fact that the signal of interest has redundancy and thus can be "sparse"; many of the coefficients are set to zero when described with an appropriate basis. This technique has a long history [1,2,3], but it has recently attracted increased attention as its high performance has been demonstrated in recent influential papers [4,5,6,7,8]. There has been a surge of research of compressed sensing, which is based on a general idea that the signal has a sparse representation on an appropriate basis, because the idea can be shared in many other contexts such as data compression, multivariate regression, and variable selection.
This trend has triggered a major evolution in techniques of signal and information processing, which is gradually forming a new framework called "sparse modelling" [9, 10,11,12].
For clarity, we provide a concise mathematical form to the problem treated here. Suppose a data vector y ∈ R M is generated by the following linear process with a design matrix A ∈ R M ×N and a signal vector x 0 ∈ R N such that: where ξ is a noise vector, the component of which is assumed to be an independent and identically distributed (i.i.d.) variable from the normal distribution with zero mean and variance σ 2 ξ , N (0, σ 2 ξ ). In the context of compressed sensing, the design matrix A represents the measurement process, and given A and y, we try to infer x 0 for the situation M < N . This is an underdetermined problem and the sparsity assumption that the number of nonzero components of x 0 is smaller than M is needed for solving it. With this assumption, the perfect reconstruction of x 0 is possible if the noise is absent. The most naive algorithm to achieve this is the exhaustive search, which selects the sparsest set of variables among error-free ones. This is clearly infeasible if the model dimensionality N is large, and more efficient algorithms or approximations should be tailored. A common approximation is to relax the sparsity constraint. Many studies have been conducted along this direction, and some theoretical studies demonstrated that the perfect reconstruction of x 0 is possible under reasonable conditions even under such a relaxation [13,14,15]. Associated efficient algorithms achieving perfect reconstruction in the noiseless case have been developed [16,17,18,19].
Some degree of compromise such as the relaxation above appears to be unavoidable, because variable selection in the present problem is NP-hard in the worst case [20]. However, more recent works suggest that, even without such relaxation, variable selection can be achieved at reasonable expense for "typical" cases [21,22] where the design matrix is assumed to be i.i.d. from the normal distribution. Their formulation is based on the Bayesian framework assuming the signal's generative process is known. The algorithmic limit was computed by using nonrigorous statistical mechanical techniques, and an associated message-passing algorithm was developed and shown to achieve the limit, and those results have been supported from a firmer mathematical basis [23].
However, the Bayesian framework is not always preferred in the context of signal processing. This is because the signal's generative process is not necessarily evident and is difficult to model in many practical situations. In such cases, it may be better to focus less on the signal sources and rely more on the less informative prior. According to this idea, in the context of data compression, the present authors recently proposed a variable selection criterion based on the following widely-used optimisation formulation [24]: where ||x|| k = ( i |x i | k ) 1/k denotes the k norm and the 0 norm ||x|| 0 is assumed to give the number of nonzero components of x. Our basic idea in [24] is to compare all variable sets of size K and to organise them as an ensemble in the statistical mechanical sense by regarding the fit error as energy. Our statistical mechanical analysis, again non-rigorous and performed under the same assumption on the design matrix as [21,22], showed that the configuration space of the variable set is rather "smooth", implying that certain local search algorithms can efficiently find the minimum-error variable set. Based on this finding, we developed an algorithm based on the so-called simulated annealing (SA) algorithm [25], which is a Monte Carlo (MC)-based optimisation solver, and demonstrated that it can efficiently find the minimum-error set for a wide range of parameters [26,27]. These results again suggest that variable selection in the present setting can be efficiently achieved, even without relaxation or resorting to the Bayesian framework.
The success of the MC based method further motivates us to analyse the property of the ensemble of the variable set in detail. The previous analysis [24] was limited to the data compression context, and another analysis directly relevant to compressed sensing or multivariate regression is desirable. The present paper addresses this point. The main difference from [24] is the presence of the true signal x 0 in eq. (1). This introduces other criteria on the reconstructed signal such as the prediction ability and the error to the true signal. We provide a quantitative analysis for these issues.
The remainder of the paper is organised as follows. In sec. 2, we state the problem setting and the formulation which we employ in this paper. The meaning of the formulation in relation to the Bayesian framework is also explained. In sec. 3, we provide the analytical solution of the fit error and related quantities derived by the statistical mechanical formulation. In sec. 4, we present the results of numerical experiments using a careful MC method to support our analytical computations. The performance of the SA algorithm for finding the minimum-error set is also revisited. The final section concludes the study.

Problem setting and notation
As noted in the sec. 1, the data is supposed to be generated by the linear process (1). We assume the true signal x 0 is K 0 -sparse and K 0 is less than M : ||x 0 || 0 = K 0 < M . For notational convenience, we introduce the 0 operator of a vector x as |x| 0 , which results in a binary vector whose component is (|x| 0 ) i = 1 if x i = 0, or (|x| 0 ) i = 0 otherwise. The support of the true signal is represented by a support vector c 0 ≡ |x 0 | 0 .
We are interested in the fit quality to the data y for a given set of variables described by a support vector c ∈ {0, 1} N , on a linear model basis. The fit quality is thus quantified by a mean squared error (MSE) for y and the coefficients of the chosen variables are assumed to be optimised to describe y. The optimised coefficients are written in the following form: x(c|y, A) = arg min where (c • x) i = c i x i represents the Hadamard product. To eliminate an ambiguity in eq. (3), the coefficients of variables out of the support are set to zero, c i = 0 ⇒x i (c) = 0, to provide consistency with the support operator: c = |x(c)| 0 . We denote the corresponding MSE with y by This is called the output MSE throughout this paper. In addition to the output MSE, we are interested in the MSE with the true signal defined by Hereafter this is termed the input MSE, and the hat symbol is assumed to represent an estimator of the corresponding quantity.
The primary object of our investigation is the histogram of y (c|y, A) when changing the set of variables c. There are two reasons for evaluating this quantity. One is related to the reconstruction of x 0 . In this context, lower values of the output MSE are not always preferable and we need more global information regarding the set of variables to obtain a good solution. The other involves possible algorithmic implications. To find a small output-MSE configuration, we usually conduct a local recursive search from certain initial conditions. The performance of such local search algorithms is strongly affected by the structure of the configuration space of c. The histogram of y (c|y, A) provides the necessary information about the structure.
More specifically, we evaluate the exponential rate of the histogram in the large size limit N → ∞ while keeping α = M/N and ρ = K/N = i c i /N finite. This quantity is simply the statistical mechanical entropy defined by Entropies or similar thermodynamic functions associated with certain optimisation problems have provided algorithmic implications and benefits in several contexts such as information theory, computer science, and neural networks [28,29,30,31,32,33,34,35,36,37,38,39]. We see this strategy actually works well in the present problem below.

Outline of analysis
We outline the analysis of the entropy. Evaluation of the entropy is replaced with an assessment of a generating function that is a Legendre transform of the entropy. Assumptions to enable this assessment are stated as well.

Generating function: Legendre transform of entropy
Direct computation of the entropy is not easy, and a more systematic method of evaluation is available using a Legendre transform of the entropy, which we call free entropy throughout this study [40]. We introduce a partition function G(µ|ρ, y, A) as where δ(·) is the delta function. The free entropy is represented by the exponential rate of G. Considering the definition of the entropy and assuming that the saddle-point method is applicable in the large N limit, we can easily see that the following relation holds where supp(f (x)) denotes the support of the function f (x). Hence g and s are the Legendre transforms of each other. Assuming s( y ) is concave with respect to y , then g and s have a one-to-one correspondence and are connected by the control parameter µ(≥ 0). This control parameter plays the role of "inverse temperature" in physics. The maximiser in eq. (8) and the corresponding entropy, y (µ|ρ, y, A) and s(µ|ρ, y, A) = s( y (µ|ρ, y, A)|ρ, y, A), are thus parameterised as s(µ|ρ, y, A) = g(µ|ρ, y, A) − µ ∂ ∂µ g(µ|ρ, y, A).
Employing these relations, we can easily handle the dominant output MSE and determine the corresponding entropy from g. However, there are two difficulties in the evaluation of g. One is the dependence on y and A, and the other is the presence of the least squares problem in the definition of y (c).
The first problem is overcome as follows. The free entropy g is a self-averaging quantity, as is the entropy. Typical values of self-averaging quantities are in accordance with their averaged values in the large N limit. This means that we can calculate the averaged value of g over A and y = Ax 0 + ξ instead of directly considering the entropy's dependence on those quantities. We denote the average over x 0 , ξ and A by square brackets with appropriate subscripts as [· · · ] x 0 ,ξ,A . Unfortunately, this average is not easy to calculate. The so-called replica method is a great aid in such a situation, and is symbolised by the following identity In addition to this identity, we assume that n is a positive integer. This assumption enables us to compute the average [· · · ] x 0 ,ξ,A . After computing this average, we take the limit n → 0 by employing the analytical continuation from n ∈ N to n ∈ R under the so-called replica symmetric (RS) or the replica symmetry breaking (RSB) ansatz, which will be explained later.
The second problem is solved by introducing a variable β and taking a limit as follows where where the factor i {(1 − c i )δ(x i ) + c i } is introduced to make the integral well-defined and can be regarded as a prior for x. In the limit β → ∞, only the contribution corresponding to the solution of the least squares problem in eq. (3) survives the integration, and H(c|β, y, A) → M y (c|y, A). We further assume that ν = µ/β is a positive integer as well as n in eq. (10). This enables us to treat in parallel the summation over c and the integration over x, as well as the average [· · · ] x 0 ,ξ,A . The limit β → ∞ ⇔ ν → 0 is taken after those operations through the analytic continuation. These operations can be summarised in a line g(µ|ρ) = lim with abbreviations Tr c = N i=1 Overall, to calculate the entropy, we assess the free entropy g. The averages over x 0 , ξ, A are taken through the replica method with a replica number n, and the internal variables x are integrated with an additional replica number ν.

Assumptions for theoretical computation
The description above is generic, but for technical reasons we need additional assumptions to complete the computation. The most crucial assumption is applied to the distribution of A: Each component of A is assumed to be i.i.d. from N (0, 1/N ). Relaxing this assumption, i.e. introducing correlations between components, makes the analysis much more complicated. Admittedly, this assumption is not necessarily realistic. However, the purpose of this paper is to provide an analytical basis to understand the variable-selection performance, and we consider that the random-matrix assumption can provide sufficiently nontrivial implications for this purpose.
Thus assuming the absence of correlations among components of A, we note that only the average behaviour of the signal components is relevant. According to this observation, without loss of generality, we may assume a factorised prior of the signal vector x 0 as where ρ 0 = K 0 /N is the density of nonzero components and P 0 (x) is a prior distribution for the nonzero component. Our theoretical computation can be performed for any prior P 0 (x) having no probability mass at x = 0, and we keep it unspecified for a while. Usually, the entropy function s( y ) enjoys some useful properties such as non-negativity, boundedness, bounded support, concavity, and analyticity. We assume these properties, but as shown below, the concavity and analyticity are partially broken in the present problem. This causes problems in evaluating the parametric form (9), but they can be bypassed by some additional considerations when conducting the saddle-point method. The analyticity breaking of the entropy is actually related to algorithmic performances and is one of the central issues discussed in this paper.

Probabilistic meaning and intrinsic hierarchy of the problem
Our formulation has a probabilistic meaning which involves two different intrinsic hierarchies in the present problem. We can define the distribution for c as Let us denote the average over P (1) by angular brackets as · · · c . This distribution is clearly conditioned by y and A. We also define another distribution for x given c as where This distribution is conditioned by c in addition to y and A. Hence, we denote the average over P (2) by · · · x|c . The simultaneous average over both P (1) and P (2) is denoted by double angular brackets · · · . Recalling that y and A are also random variables, we note that there are three different hierarchies of random variables: x is conditioned by c which is conditioned by y and A. This discrimination is a natural consequence of the structure of the present problem 1 .

Relationship to Bayesian inference using sparsity-inducing priors
Our formulation is related to a Bayesian framework. To demonstrate the relationship, we introduce the following prior distribution of x given c: where φ is the prior distribution of the nonzero components. As our purpose is to achieve a variable selection, i.e. choosing the best support c, the variable x can be treated as a hidden variable. Hence in the Bayesian framework, the most rational approach is to sample c from the following posterior distribution: where P (y|x, c) is our model distribution of data, which is derived through the noise distribution with variance σ 2 : and the prior distribution of c, P (c), is set to be a uniform distribution at a fixed K = N ρ This method is optimal when our parameters and model match the true generative process. This matching condition is called the Nishimori condition in physics. Performance at the optimality can be an issue to be studied further; however, in the present paper, we do not pursue this direction. Instead, we perform a maximum a posteriori (MAP) estimation for x by assuming a (un-normalised) flat prior φ(x) = 1, (∀x ∈ R). This yields thex(c|y) expression defined in eq.
Overall, the posterior distribution of c defined in eq. (15) can be regarded as a MAP estimation of eq. (19) in the Bayesian framework 2 . There are two reasons for treating the MAP estimator. The first reason is the computational cost of eq. (19). The computation of eq. (19) requires integration with respect to x, which is computationally expensive in general even if we employ versatile approximations such as the MC method. In contrast, the MAP estimator allows us to skip this integration and provides a reasonable estimator (3), which is relatively easy to compute. The second reason is the plausibility in matching our inference model with the true generative process, as also discussed in sec. 1. In many practical tasks for which our regression model is employed, it is not realistic to assume that we have precise knowledge regarding the generative process. Hence, certain mismatch between the inference and generative models is inevitable. This is a common criticism against applying the Bayesian approach to signal processing tasks. On the contrary, the MAP estimator with the uninformative flat prior φ(x) = 1 allows us to bypass this problem and can yield better performance than the Bayes estimator (19) in certain mismatching cases. These reasons naturally motivate us to investigate eq. (15) instead of eq. (19). 2 Insightful readers may doubt the probabilistic interpretation of P (x|c) because it is not possible to normalise P (x|c) due to the presence of flat prior φ. This inconvenience can be solved by replacing the flat prior with (1/ √ 2πL)e −x 2 i /(2L) and taking the limit L → ∞, although this causes another problem in model selection according to the marginal likelihood in the usual Bayesian framework. However in the following discussions, any model selection using this criterion is not performed and the entire treatment in the main text does not inherit any of these problems.

Related work
Here we give a brief summary of several preceding work treating related problems and make it clear how the present paper is similar to or different from those work.
In [41], Guo and Verdú studied a random linear estimation problem in the context of codedivision multiple access by using the replica method, as in [32]. Its MAP estimator was considered in [42] again by using the replica method. Reeves and Gastpar treated similar linear models in the variable selection context as in this paper: They computed the trade-off relation between the measurement rate (α = M/N in our notation) and a distortion quantifying an error rate in the reconstruction of the true support, and derived rigorous lower and upper bounds of the measurement rate to achieve given value of distortion and compared it with the replica result [43,44,45]. Another investigation by Reeves and Pfister succeeded to prove that the replica prediction is exact, by tightening the bounds under the assumption that the matrix A is i.i.d. from the normal distribution under the Bayes optimal setting [46]. In [47], the same problem was investigated by using the replica method except that the matrix A is drawn from rotationally invariant ensembles. These preceding results employing the replica method were conducted under the RS ansatz, and Bereyhi et al. examined the RSB ansatz and showed that the k-step RSB with small k can much improve the RS solution's inconsistency appearing when the MAP estimator with 0 -norm regularization is considered [48,49]. In [50], a similar problem with a restriction such that the signal components take only binary values was studied by using the improved second moment method. From a wider viewpoint, generic glassy natures of MAP estimators in inference problems were examined in [51] by considering a rank-one matrix estimation as an example; the so-called survey propagation, which is a variant of message-passing algorithms taking into account glassy natures, was reported to fail in improving the inference accuracy than the standard message-passing algorithm [16,17,18,19]. In [52], a similar model was considered in the context of reconstructing encrypted signals and investigated by using the replica method; the full-step RSB ansatz was applied and thus the result is expected to be exact and tight.
These study have a connection to this paper in the basic problem setting, but we stress that our present formulation is very different from all of them. We again note that in this paper all possibilities of the support are examined by computing the entropy curve, while the usual replica analysis treats certain specific supports or estimators only. All the preceding work referred in the previous paragraph fall into this case. For example, the estimator given in eq. (2), which is a subject of study in [14,42,43,44,45,47,48,49], corresponds to just one point in the entropy curve: The minimum y point given K. Meanwhile, the present formulation enables us to simultaneously compute all the other supports, or estimators optimizing the coefficients x on the chosen support. This leads to the distribution of the MSE which is nothing but the entropy curve. Hence, our formulation provides more global information about the problem, yielding deeper insights both in the information theoretic and algorithmic perspectives, as shown in [24] and below. A drawback of this is the analytical procedure much more complicated than the usual replica analysis, as explained in sec. 2.2.1.

Summary of order parameters
As shown below, the free entropy is characterised by a number of macroscopic order parameters and we summarise them here. The order parameters are defined as m is the overlap with the true signal x 0 and is relevant to the reconstruction performance of x 0 . R and Q describe the powers (per element) of the reconstructed signal, but the latter takes into account the "thermal" fluctuation that results from the introduction of β. These two quantities fall within the limit β → ∞, but their infinitesimal difference yields an important contribution This is O(1) even in the limit β → ∞. The last order parameter q directly reflects the fluctuation of the support vector c and exhibits the RSB in some parameter regions. Using these order parameters, the average value of the input MSE is where σ 2 x represents the typical power (per non-zero element) of the signal defined by The output MSE is computed from Eq. (9) once the free entropy is obtained in terms of the order parameters. Hence, we can naturally compute both the input and output MSEs in the present formalism.

RS solution
Postponing the details of analysis to sec. A, we present the resultant formulas of the free entropy and related quantities. The expression for g in the RS level is given by where for simplicity of notation we let Ω RS = χ, Q, q, m,ρ,χ,Q,q,m , Y RS 0 is obtained by substitutingm = 0 into Y RS m , and h RS 0 is defined similarly. The symbol Extr Ω denotes the extremisation condition with respect to Ω coming from the saddle-point method. This extremisation condition yields the following equations of state (EOS): Note that all tilde variables are conjugates of their respective variables and are introduced to expand delta functions with the Fourier transform as shown in sec. A. From eq. (32), we obtain some simple relations

RSB solution and the instability of the RS solution
The RS solution can be inaccurate when the configuration space of c exhibits spontaneous breaking into many locally separated components. In such a situation, the RSB ansatz should be adopted. The RSB solution is categorised by the level of the emerging hierarchical structure of the separation. In the simplest case called the 1st step RSB (1RSB) solution, only one level of hierarchy is taken and we examine this in the present paper. The 1RSB solution is actually sufficient to expose the instability of the RS solution and hence is sufficient to achieve the present purpose of obtaining implications to local search algorithms. We postpone the detailed derivation of the 1RSB solution to sec. A. The explicit 1RSB formula involving g is given as where τ is Parisi's breaking parameter, Ω 1RSB = χ, Q, q 1 , q 0 , m,ρ,χ,Q,q 1 ,q 0 ,m, τ , and By examining the 1RSB solution, we can determine the instability points of the RS solution. Empirically, two types of instabilities are known to appear in a wide range of systems: Global instability The RS solution is locally stable but there emerges another solution involving exponentially many metastable states, which induces the so-called random first-order transition (RFOT). We thus call the associated instability RFOT instability in this paper.
Local instability The local instability of the RS solution, which can be signaled by expanding the 1RSB solution with respect to ∆ 0 and observing its coefficient. This is also known as the de Almeida-Thouless (AT) instability.
According to these empirical facts, we derive two instability conditions below. The RFOT instability is known to emerge at τ = 1 and can be detected in an easy manner as follows. For τ = 1, we can identify q 0 andq 0 with q andq in the RS solution, respectively, because their EOS formally accord with each other. As well, the 1RSB EOS of all other order parameters except for q 1 andq 1 become identical to their corresponding RS EOS. Hence, we should compute q 1 andq 1 on top of the RS solution and examine whether the nontrivial solution q 1 = q 0 = q exists or not. The equations to be solved are written in terms of ∆ 0 and∆ 0 ≡q 1 −q 0 =q 1 −q as follows In these equations, we should readq 1 =q +∆ 0 in h 1RSB and Y 1RSB . The trivial RS solution ∆ 0 = 0 always exists and the question is whether a nontrivial solution ∆ 0 = 0 exists or not. Such a nontrivial solution is absent for the low µ region but is present at sufficiently large values of µ. The lowest value of µ for which the nontrivial solution exists defines the RFOT point µ RFOT . The AT instability is observed by examining the presence of a nontrivial solution around ∆ 0 = 0. This can be accomplished by expanding the right hand side of eq. (41) with respect to ∆ 0 up to the first order after inserting eq. (40) into∆ 0 . If the coefficient of the first-order term is greater than unity, a nontrivial solution emerges. This condition is written using the RS solution only, because the small ∆ 0 limit implies that the order parameters q 1 and q 0 of the 1RSB solution can be identified as q in the RS solution, and the corresponding tilde variables can also be identified. The explicit stability condition of the RS solution against the AT instability is given by This condition always holds for sufficiently low values of µ. The lowest µ value violating eq. (42) indicates the AT transition point µ AT . These two instabilities are known to affect the performance of local search algorithms. The origin of this affliction by the RFOT transition is clear-there emerge exponentially many local minima and thus the search/dynamics is easily trapped in one of those states; the typical trapping state will be the most numerous one and will be far from the true global minimum. Each trapping state in this case is separated by high energy barriers and hence escaping will take an exponentially long time [40]. Meanwhile, the influence of the AT instability is less trivial than the RFOT. According to the standard physical picture, when the AT instability occurs, the structure of the configuration space of c has many saddle-point-like structures, which leads to a complicated critical slowing down of the dynamics and thus the performance of local search algorithms will be strongly degraded. However, this degradation will be less serious compared to that caused by the RFOT transition, because in the AT case it is considered that there exist certain directions along which the system can escape from a local saddle point. This may take a long time, as the energy landscape will be very flat along the escape directions owing to the saddle-point nature, but will be shorter than the RFOT case, where an exponentially long time is required. These descriptions have actually been supported by numerical simulations of some metaheuristic algorithms in several optimisation problems [53,54,55,56,57,58]. We, however, stress that there are still many unclear points about the dynamics around the AT instability and further investigations using concrete algorithms, such as Monte-Carlo methods or message-passing algorithms, are desired.
As will be seen in the next section, we have several characteristic regions depending on the parameters ρ and σ 2 ξ . In some regions, the RSB transitions induced by AT and RFOT instabilities occur, which makes it difficult for the system's dynamics to converge to the equilibrium distribution. In other regions, the RS solution is always stable when µ is changed. However, the RS-stable regions are separated into two small regions, one has no phase transitions and thus the metaheuristic algorithm can work well, and the other has another 1st order transition which prevents the algorithm from approaching the global minimum. These descriptions will be actually confirmed by numerical experiments of a metaheuristic algorithm in sec. 4.2.

Some simple limits
To check our replica results, we summarise some simple solutions obtained at particular limits below.
High temperature solution A trivial solution in the high temperature limit, T = µ −1 → ∞, is derived from eq. (32). From simple algebra based on eqs. (32,33), we obtain Accordingly, Using the relationρ = − log(ρ/(1 − ρ)), we have where H 2 (ρ) is the binary entropy, giving a reasonable result. The local stability of this solution can be checked by substituting eqs. (43,44) into eq. (42), which yields This always holds in the meaningful setup of ρ ≤ 1 and ρ ≤ α, and hence the high temperature solution is stable.
Perfect reconstruction in the noiseless limit Of particular interest for the noiseless limit σ ξ = 0 is whether we can achieve the perfect reconstruction of x 0 . We call the corresponding solution the perfect reconstruction (PR) solution, which is defined by Note that the support components outside the true support may take unity: c i can be 1 even if |x 0i | 0 = 0. This is because the coefficientsx(c) outside the true support become automatically zero as a result of the optimization if all the components of the true support is covered as eq. (47), leading to the vanishing MSEs. In other words, the PR solution always exists if the estimated nonzero density is greater than or equal to the true one, ρ ≥ ρ 0 . This seemingly indicates that it is safer to overestimate the nonzero density ρ. This is, however, not necessarily true because larger estimates of ρ tend to involve unfavourable phase transitions, as shown below.
The output MSE y of the PR solution is zero and the entropy becomes We can actually find this PR solution in the limit µ → ∞ of our RS formula (32). To derive this, we have to carefully treat the scaling of V + q and ∆ RS within that limit. We first assume the following scaling: The consistency of this assumption is confirmed after the computation. Using this and eq. (33), we can easily determine the following limits These relations mean that Y RS m diverges or vanishes depending on the values of x 0 and z while Y RS 0 converges to e −ρ . The vanishing region of Y RS m is expressed in terms of z as As we have assumed (V + q) → 0 (µ → ∞), this vanishing region rapidly shrinks and does not provide any meaningful contribution. Hence, we may treat Y RS m as a diverging factor in all contributing regions. This consideration yields, from eq. (32e): Additionally, some algebraic operations from eqs. (32,33) lead to where which yields a finite contribution in the limit µ → ∞. The divergence of Y RS m implies that The precise scaling of the right hand side strongly depends on the choice of the prior distribution ; if the signal strength is a constant at P 0 (x 0 ) = δ(x 0 − C) with C = 0, then (V + q) exponentially decays as µ increases. Similar calculations apply to ∆ RS , which involves the same scaling of ∆ RS as (V + q). These scalings are consistent with the assumed ones (49).
Summarising these calculations in conjunction with eqs. (32,33), we determine that This implies that the free entropy g coincides with the entropy in the limit µ → ∞. Substituting the scalings obtained so far into eq. (27), we find that The limiting behaviours of ∆ RS and V + q cause the input MSE to vanish: Hence, the PR solution is successfully derived. The local stability (42) of this PR solution should be checked. Inserting eqs. (50,53) into eq. (42) results in the stability condition Hence, the PR solution is stable as long as ρ 0 ≤ ρ ≤ α where the PR solution exists. Furthermore, this stability is considered to be rather "robust". It is physically reasonable that the RS solution is stable even for ρ < ρ 0 if ρ is sufficiently close to ρ 0 because the inequality (60) is safely satisfied even at ρ = ρ 0 . This will be demonstrated in the phase diagram shown below. Note that eq. (60) is just a necessary condition and not a sufficient one. The stability of the RS solution at and around the PR solution has an algorithmic implication to the 0 -minimisation approach [59,60,61]. Namely, local search algorithms such as the message-passing algorithm will not be degraded by the rugged energy landscape if the initial condition is sufficiently close to the PR solution and hence the predictions based on the RS computation will be precise.

Entropy curve and phase diagram
To obtain a concrete result, in the following we set the prior distribution of the nonzero component as a Gaussian: Owing to this assumption, the double integrations with respect to x 0 and z in eq. (32) can be merged into a Gaussian integration by a variable transform (mx 0 + √q z) → q +m 2 σ 2 x z. The EOS (32) should be appropriately transformed. Except for eq. (32i), this can be accomplished by neglecting the integration dx 0 P 0 (x 0 ) and reading in the EOS and Y RS m . Eq. (32i) involves the cross term of x 0 and h RS m so some special care is needed. A slight calculation yields Furthermore, the signal variance σ 2 x is set to be σ 2 x = 1/ρ 0 to fix the per-element signal power to unity, ρ 0 σ 2 x = 1.

Noiseless case
We start from the noiseless limit σ ξ = 0. Fig. 1 shows the plot of entropy s( y ) versus y for several different values of ρ. Other parameters are common and are set as α = 0.5 and ρ 0 = 0.2. The drawn curves are based on the RS solution, and the RSB solution is only used to indicate the respective instability point. As seen from Fig. 1, we observe four different characteristic Figure 1: Entropy plotted against the output MSE y , which is represented by the solid black line, for α = 0.5 and ρ 0 = 0.2 in the noiseless limit σ ξ = 0. The nonzero density is different among the four panels: ρ = 0.04 < ρ AT (upper left), ρ AT < ρ = 0.18 < ρ 0 (upper right), ρ 0 < ρ = 0.3 < ρ SP (lower left), and ρ SP < ρ = 0.45 (lower right). In the last case, there are two entropy curves denoted by the solid and dashed dotted lines, which correspond to two different phases, and the inset is a close-up around ( y , s( y )) = (0, s PR ). Broken and dotted lines denote the high temperature solution (45) and the PR solution (48), respectively.
behaviours of s( y ). For small ρ, the AT instability occurs at small y , where the full-step RSB will be needed to correctly describe that region. This RS unstable region vanishes for larger ρ, defining a critical value ρ AT (< ρ 0 ). In the region ρ AT < ρ < ρ 0 , the RS solution is stable for the entire y region having the nonnegative entropy s( y ) ≥ 0. A somewhat surprising fact in this region is that the entropy crisis (EC), s( y ) = 0, occurs at a finite critical value of µ EC (< ∞).
As it approaches ρ 0 , this critical value µ EC (ρ) diverges and the entropy curve is continuously connected to the PR solution in the region ρ 0 ≤ ρ. For a wide range of ρ(≥ ρ 0 ), the RS solution is again stable for the entire region of y . However, at larger values of ρ, there emerges a new phase transition, defining another critical value, ρ SP . For ρ SP ≤ ρ, the PR solution is detached from the high temperature limit, and there are two different branches for the small y or large µ region. Two critical temperature points are accordingly defined -the spinodal point T SP at which the low-temperature branch connected to the PR solution vanishes, and the first-order transition point T F at which g values of the two branches coincide. To locate the corresponding critical temperatures at a given ρ(> ρ SP ), we plot g against T = 1/µ in Fig. 2. As a reference, the output MSE y is also plotted against T around the critical temperatures in the right panel.
The presence of the first-order phase transition implies that the simple SA algorithm with a rapid annealing schedule will fail to find the PR solution in ρ SP ≤ ρ. The system's dynamics goes along the branch connected to the high temperature limit and cannot move to the PR solution. Actually, this strongly affects the SA performance, as the values of input MSEs x are quite different between the two branches, despite the output MSEs y both being small, as shown in the right panel of Fig. 2. This will be demonstrated in sec. 4.2.
Summarising the above findings, we draw a phase diagram in the ρ-T plane for α = 0.5 and ρ 0 = 0.2 in Fig. 3. Note that the entropy-crisis line T EC (ρ) below the AT line T AT (ρ) has no direct physical consequence because the RS solution is unreliable in that region. The exact entropy-crisis line would be derived by the full step RSB solution, but this is beyond the scope of the present paper. Fig. 3 implies that finding the ground state is easy in the region ρ AT < ρ < ρ SP , while it is difficult in other regions: ρ ≤ ρ AT and ρ SP ≤ ρ. We focus on how the easy region behaves when changing the external parameters α and ρ 0 . We examine the parameters and find that the easy region shrinks as ρ 0 increases against a fixed α and finally vanishes at a certain critical value of ρ 0 . As an example, the ρ-T phase diagram for α = 0.5 and ρ 0 = 0.3 is shown in Fig. 4. The spinodal and first-order transition lines cover the entire ρ 0 < ρ region and hence the easy region disappears, implying that local search algorithms cannot find the PR solution for any ρ in this case. This vanishing of the easy region thus defines the algorithmic limit. By searching all parameter regions of ρ 0 and α, we can draw a phase diagram of the algorithmic limit, which is shown in Fig. 5. The performance of our formulation is clearly better than the 1 relaxation [14] and is competitive with the Bayesian result [21]. This implies that the present formulation, which can be regarded as a MAP estimation in the Bayesian framework, does not significantly lose its reconstruction performance despite discarding the signal source information. This is encouraging the use of the present formulation in the context of signal recovery and is one of the main results of this paper.

Noisy case
A new behaviour specific to the noisy case is the presence of the RFOT transition for strong noises at middle values of ρ. As an example, the entropy curves for σ 2 ξ = 10 are given in Fig. 6. Two critical ρ values, ρ AT and ρ RFOT , accordingly emerge. For ρ ≤ ρ AT , the AT instability first occurs as the temperature decreases. For ρ AT < ρ ≤ ρ RFOT , the RFOT begins to emerge above the AT instability temperature. For larger ρ, the RS solution is accurate for all temperature regions. The RS EC occurs at finite T in that region, which is somewhat similar to the Ising perceptron problem [37,38]. Note that ρ RFOT does not exist if the noise is sufficiently weak.
Summarising the above findings, we show phase diagrams for three noise strengths, σ 2 ξ = 0.001, 0.1 and 10 in Fig. 7. The first-order transition in the noiseless case is quite fragile and disappears for very weak noise as seen in the left panel of Fig. 7. We attempted to capture the critical values of σ ξ for the disappearance, but it proved numerically difficult and we did not pursue this point. As the noise increases, the phase boundary's shape expand more and more from the noiseless limit and the RSB region, ρ < ρ RFOT , seems to grow. We consider whether  The red solid line is the algorithmic limit derived here. Two other boundaries, the blue solid and red broken ones, indicate the 1 relaxation derived in [14] and the Bayesian inference shown in [21], respectively. Our result, which is regarded as an MAP approximation of the Bayesian inference, is competitive with the Bayesian result.  this RSB region will cover the low-temperature region completely as the noise is very large. To answer this, we tested the no signal case ρ 0 = 0 and σ 2 ξ = 1 and observed that ρ RFOT takes a value similar to the one in the right panel of Fig. 7. Hence, an RS region exists even in the strong noise case, which is consistent with our previous analysis in the data compression context [24].

Numerical simulations 4.1 Monte Carlo evaluation of entropy curves
Here we examine the analytical results by comparing with the numerical simulations. Our simulations calculate the free entropy by the exchange Monte Carlo (MC) sampling [62]. Estimation of the free entropy g is accomplished by using the multi-histogram method [63].
Our MC sampling is based on the Metropolis criterion, where an MC move c → c is accepted according to the probability During the update, we would like to keep the non-zero components density ρ = i c i /N constant. return c 13: end procedure each system at every temperature point. The exchange of every pair of neighboring temperature points is conducted after every 1/N MCS, which is a rather frequent exchange than conventions.
In all simulations, we set α = 0.5, ρ 0 = 0.2 and ρ 0 σ 2 x = 1. The configurational average is calculated by taking the median over 1000 different samples of (x 0 , ξ, A). The error bars are estimated by the Bootstrap method. The examined system sizes are N = 30, 40, · · · , 100. The equilibration is checked by monitoring the convergence of all measured quantities (g, s, y ) to stable values by changing the total MCSs; for reference, we note that 256×10 2 MCSs are needed for equilibration when N = 100, ρ = 0.3, σ 2 ξ = 10. For burn-in, the first half of the total MCSs is discarded.

Simulation in noiseless case
The free-entropy values evaluated by numerical simulations and the extrapolation to the infinite size limit of the noiseless case σ 2 ξ = 0 are presented in Fig. 8. The extrapolation lines result from linear regression using an asymptotic form g ≈ a + bN −1 + cN −1 log N −1 . The regression is conducted by applying the least squares method as follows: This asymptotic form is based on Stirling's formula and is exact at µ = 0, which motivates us to use the form even when µ = 0. The same asymptotic form is used for obtaining the extrapolated values of the output MSE y and the entropy s. Using these values for the limit N → ∞, we present the curves g(T ) and s( y ) in Fig. 9. The lines represent the RS analytical results. The circles represent the extrapolated values obtained from the numerical results. The extrapolated values show fairly good agreement with the RS analytical ones, justifying our analytical results. For the case of ρ = 0.1, the AT instability occurs at T ≈ 0.04, but even below this temperature, the agreement between the RS analytical result and the numerical one is fairly good, suggesting a weak RSB effect on s and y . This is, however, not the case for the input MSE x , as demonstrated in sec. 4.2 below.

Simulation in noisy case
A similar analysis for the case of strong noise (σ 2 ξ = 10) is performed and the results are shown in Fig. 10 and Fig. 11.
These figures again demonstrate good agreement between the RS and numerical results as long as the RSB does not occur. Below the RSB transition point, we observe a deviation between them, as shown in the left upper panel of Fig. 11. In such a situation, generally speaking, the RS entropy curve can be regarded as an upper bound of the entropy values [64], although a meaningful difference is not observed in the right upper panel of Fig. 11. Again, the RSB effect on s and y appears to be weak.
We have a noteworthy remark to make on the EC phenomenon for ρ = 0.3. We stress that this EC phenomenon can be described in the RS level and occurs at a finite temperature.   This may be somewhat surprising for readers familiar with other models exhibiting similar EC phenomena, because in most of such systems the 1RSB treatment is needed to describe the EC phenomenon. We note that the energy levels of the present system around the ground state can be very dense and the energy gap between the ground and excited states can be extremely small, which can be argued by the fact that the PR solution in the noiseless case has numerous degeneracies for ρ > ρ 0 . This gap is supposed to vanish in the N → ∞ limit, presumably enabling the RS EC phenomenon to appear at a finite µ. The agreement between the RS and numerical results strongly argues in favour of this description. Note that the EC phenomenon at small values of ρ is in a different situation and its RS description is not accurate. This is because the AT instability occurs at higher temperatures in that region and hence the full step RSB treatment is needed. This is in contrast to the large-ρ region in which no instability occurs at higher temperatures than T EC .

Monte Carlo-based optimisation and its performance
The SA is a metaheuristic solver of generic optimisation problems based on the MC method. A variant of the SA for the present problem was proposed in [26] and its performance was examined in a limited parameter region of the present synthetic model and in a real astronomical dataset [26,27]. We re-examine this over a wider range of parameters to provide more quantitative information.
Our SA algorithm is summarised in Alg. 2. The lines marked with # are not necessarily return c 13: end procedure needed for SA, but have been inserted for later convenience. In Alg. 2, we have a set of inverse temperature points {µ a } Lµ a=1 arranged in ascending order (0 =)µ 1 < µ 2 < · · · < µ Lµ ( 1) and the waiting times {τ a } a at those points. Hence, as the algorithm proceeds, the temperature of the system T = 1/µ decreases step by step. It is theoretically guaranteed that if the schedule of the decreasing temperature is slow enough, then the SA can find the optimal solution [65]. However, the guaranteed schedule is usually overcautious and in many practical situations we may choose a faster one. The actual schedule examined below consists of L µ = 200 temperature points chosen as µ a = 0.02 · a (a = 1, · · · , 50) 10 0.04·(a−50) (a = 51, · · · , 200 = L µ ) , The first linear region of the schedule is simply inserted for visibility in the plots shown below and the important point is that the schedule is exponentially increasing as a grows. The final temperature is very low, T Lµ = µ −1 Lµ = 10 −6 . The waiting time at each temperature point is kept constant, at τ a = τ (∀a), for simplicity. We show below that this rapid schedule works very efficiently for a wide range of parameters and discuss that the performance is closely related to the system's property at equilibrium, which was already calculated in sec. 3.
A noteworthy remark applies to the computational cost of this SA algorithm. This cost can be formally written as O(L µ τ N C MC ), where the last factor is the computational cost of each MC update. The most expensive operation is the matrix inversion required to calculate the energy of the output MSEs. If we use simple multiplication and Gauss elimination in the inversion process for each step, then C MC = O(M (N ρ) 2 + (N ρ) 3 ). However, we employ pair flipping in each update and the change in the relevant matrices in each update is small and successive. Using this fact and the matrix inversion formula, the total cost of each MC update can be reduced to (N 2 αρ), as explained in [26]. Hence, if L µ and τ do not scale with N and can be kept constant, the total computational cost is O(L µ τ N C MC ) = O(L µ τ αρN 3 ) and is scaled as the third order polynomial of the system size N . This is comparable with the versatile algorithms solving the 1 relaxation and thus the present algorithm solves the 0 problem with fairly reasonable computational cost. The assumption of constant L µ and τ is not trivial, but it appears to be correct, i.e. sufficient to find the PR solution in the successful region, in the region we have numerically searched. Hence, we adopt this constant assumption below.  ρ = 0.45 for the left, middle, and right panels, respectively. The numerical results agree well with the black solid line representing the RS solution connected to the high temperature limit in all cases. The middle panels show the successful region for finding the PR solution, ρ 0 < ρ < ρ SP , and the vanishing x means that we actually find the PR solution. The right panels are in ρ SP < ρ, meaning that the search is trapped in the metastable state connected to high temperatures. The SA result follows the high-temperature branch and cannot reach the low-temperature one denoted by the blue dashed dotted line. Overall, the SA experiments demonstrate that our theoretical predictions are very precise, and the presence of phase transitions strongly degrades the SA's performance in finding the minimum-MSE configuration.
Noisy case Next, we show the results of the noisy case. The SA results for the strong noise case σ 2 ξ = 10 at α = 0.5 and ρ 0 = 0.2 are given in Fig. 13. As seen from the figure, the MSEs show good agreement to the analytical curve (black solid line) up to the transition points for the left and middle panels. An exceptional deviation is observed at low temperatures in the upper right panel, but this is considered to be a finite-size effect because the range of y in this region is very small and supposedly unreachable by N ≈ 100 systems. Hence, these behaviours are very consistent with the analytical predictions that the system's dynamic behaviour is affected by the RSB transitions and ceases to follow the equilibrium state.
The effect of the RSB on the reconstruction performance becomes much clearer by examining the achievable-limit values of the MSEs for a moderate noise case. Fig. 14 shows the plots against ρ of the limit values obtained at very low temperatures by the rapid SA with τ = 5 for σ 2 ξ = 0.1. The x values at middle ρ values are clearly dominated by the ones at the AT transition points rather than the ones at the EC points, implying that the system's search is trapped by local minima emerging at the transition points. An interesting outcome of this phenomenon is a better reconstruction of the planted signal x 0 . As seen from Fig. 14, the x values at the AT points are lower than the ones at the RS EC points, implying that the reconstruction performance of the local minima induced by the RSB is better than that of the minimumy configurationĉ by solving eq. (2). This means that the generalisation capability of the rapid SA is no worse than exactly solving eq. (2) in this case because the input MSE x is proportional to the generalisation error when each component of the design matrix and the noise is i.i.d. from the zero-mean Gaussian. This encourages the use of the presented formulation and algorithm for practical purposes, as reported in [27].

Conclusion
In this study, we have analytically provided an algorithmic limit of an 0 -based formulation of compressed sensing through evaluation of the entropy using statistical mechanical techniques. The results are mainly characterised by the ratio of number of observations α, the nonzerocomponents density of the inference and generative models ρ and ρ 0 , and the strengths of the signal and noise σ x and σ ξ . The entropy curves and the associated phase diagrams have been provided, and their implications to local search algorithms have been discussed. Quantitative analysis of the noisy cases has also been conducted. To validate the analytical computation, we have performed a careful MC simulation using the exchange MC method and the multi-histogram method, the results of which have exhibited fairly good agreement with the analytical results. To test the theoretical predictions on local search algorithms, we have also performed numerical experiments using the SA-based algorithm. The performance of the SA is well understood through the phase diagrams. Over a wide parameter region in the noiseless case, we have actually located the PR solution with reasonable computational cost of O(N 3 ) although, in other hard regions, the RSB and the RS first-order transitions prevent rapid convergence of the SA to the PR solution.
To overcome the problems caused by phase transitions, it may be interesting to tailor new algorithms based on MC methods. The idea of extended ensembles can be useful: It will be a promising future work to invent an algorithm relaxing the hard constraint on the nonzero com-ponents density, by introducing a "chemical potential" softly controlling the density. The idea of multi-canonical sampling such as the Wang-Landau algorithm [66] may also be an interesting direction. By extending the work in [51], it is also worth trying to take into account glassy natures in algorithms in certain ways.
Relaxing the i.i.d. random-matrix assumption on A is also an interesting problem. This is even practical because in an ideal situation of compressed sensing, a new observation should be conducted along with maximising the resultant information, implying that all rows of A should be orthogonal. Considering such orthogonal ensembles in the presented framework is a high-priority issue which should be done in near future.

Acknowledgement
These summations over c and x are now calculated over all the replicated variables {c a } n a=1 and {x aα } a=1,··· ,n,α=1,··· ,ν . Let us set The variable d is an extensive sum of random variables and can be expressed by an appropriate sum of Gaussian variables with a certain covariance. The covariance is expressed by Evaluating this full description is difficult in general. Instead, we consider more amenable subspaces which are described by the RS or RSB ansatz.

A.1 RS computation
In the RS ansatz, the dominant contribution is assumed to come from the following subspace: and hence the covariance matrix is described by four order parameters and one external parameter ρ 0 σ 2 x . The corresponding useful description of d aα µ is where u, v and z are i. i. d. from N (0, 1). The average with respect to A thus can be replaced by the average over u, v and z, yielding where I is the subshell (the state density) characterised by the above four order parameters such that and L describes where we merge two Gaussian variables (z, ξ) into z and where we keep only the linear term with respect to n at the first line and take the limit ν → 0 while keeping βν = µ at the last line and applying β(R − Q) = χ according to eq. (24). Summarising eqs. (78,80), we obtain . (81) Evaluation of I requires additional algebra. We break the delta functions by using the Fourier transform as follows The replica indices have been superscripts so far but we rewrite them as subscripts for visibility. Using the Gaussian integrals, we break the sum a<b into a single replica sum as eq a<b α,β cac b xaαx bβ = eq 2 ( a α caxaα) 2 − a ca( α xaα) 2 = Dze √q z a α caxaα−q 2 a ca( α xaα) Hence we can take ca=0,1 for each a independently where The sums α<β and ( α x α ) 2 can also be broken where and X 0 is obtained by insertingm = 0 in Xm. This relation means that the average over x 0 in eq. (83) is actually not needed, which comes from the absence of correlations among components of the design matrix A. This validates the use of the factorised prior (14).
Applying this rescaling and keeping only the linear term with respect to n, we get Combining eqs. (81,94) and using the saddle-point method yield eq. (27).

A.2 1RSB computation
In the 1RSB level, the overlap between c a and c b with different a, b = 1, · · · , n should be treated in a more involved manner. In the standard 1RSB construction, the n replicas are separated into n/τ blocks of the size τ . We may label the blocks by B = 1, · · · , n/τ , and the replica index a is represented by two indices as a = (B a , I a ) where I a denotes the component index inside the block. For simplicity, we identify the index set of those components with the block label B a , allowing us to represent them as I a ∈ B a . The overlap q aα bβ = (1/N ) i c a i c b i x aα i x bβ i is assumed to take two discriminative values depending on whether the replica indices a, b are in the same block or not. This can be written as Correspondingly, d aα µ can be expressed as where u, v, w, and z are i.i.d. from the normal distribution. Then, eq. (67) is rewritten as where h 2 (u, v, w, z) = R − Qu + Q − q 1 v + √ q 1 − q 0 w + ρ 0 σ 2 x + σ 2 ξ − 2m + q 0 z.
L 1RSB can be computed by recurring Gaussian integrations as in the RS case, and the result is The computation of I 1RSB can also be performed in parallel with the RS case. The delta functions of common variables with the RS case are broken in the same manner as eq. (82), and the ones of q 1 and q 0 are broken similarly to eq. (82d). These transforms yield where where we insert X BI = α x BIα . Repeating similar computations, in the leading order of n we finally get To derive this, when taking the limit ν → 0, we have rescaledq 1 andq 0 as ν 2q 1 →q 1 , Other tilde variables are rescaled in the same manner as eq. (93).