Sparse approximation based on a random overcomplete basis

We discuss a strategy of sparse approximation that is based on the use of an overcomplete basis, and evaluate its performance when a random matrix is used as this basis. A small combination of basis vectors is chosen from a given overcomplete basis, according to a given compression rate, such that they compactly represent the target data with as small a distortion as possible. As a selection method, we study the $\ell_0$- and $\ell_1$-based methods, which employ the exhaustive search and $\ell_1$-norm regularization techniques, respectively. The performance is assessed in terms of the trade-off relation between the representation distortion and the compression rate. First, we evaluate the performance analytically in the case that the methods are carried out ideally, using methods of statistical mechanics. Our result clarifies the fact that the $\ell_0$-based method greatly outperforms the $\ell_1$-based one. Second, we examine the practical performances of two well-known algorithms, orthogonal matching pursuit and approximate message passing, when they are used to execute the $\ell_0$- and $\ell_1$-based methods, respectively. Our examination shows that orthogonal matching pursuit achieves a much better performance than the exact execution of the $\ell_1$-based method, as well as approximate message passing. However, regarding the $\ell_0$-based method, there is still room to design more effective greedy algorithms than orthogonal matching pursuit. Finally, we evaluate the performances of the algorithms when they are applied to image data compression.


I. INTRODUCTION
Information processing based on the sparseness of various data is an active area of research. This sparseness means that data are typically expressed by a small combination of non-zero components when a proper basis is used. The significance of sparseness for information processing had already begun to be noted when principal component analysis was invented, in 1901 [1]. Low-rank approximation of a matrix is known to be a useful method of collaborative filtering for recommendation systems [2][3][4]. In neuroscience, the sparse-coding hypothesis has gradually been accepted as a method of elucidating visual and auditory systems [5][6][7][8][9][10]. Recent interest in information processing with sparse data has been triggered by compressed sensing, since it was demonstrated that ℓ 1 -norm minimization can give exact solutions in a reasonable time, under appropriate conditions [11][12][13][14].
In this study, we discuss sparse data processing from a different viewpoint, namely that of sparse approximation. Sparse approximation refers to the process of representing target data by a small number of non-zero elements, the purpose of which is to achieve a better trade-off relation between the representation distortion and the compression rate [15][16][17][18][19][20][21][22][23][24].
We adopt a strategy of sparse approximation that utilizes an overcomplete basis (OCB). An OCB can also be called a frame in the field of signal processing. OCBs contain more basis vectors than the dimension of target data. This means that a better and smaller set of basis vectors may be chosen to compactly express the data. Therefore, in terms of the trade-off relation, the OCB-based strategy is expected to outperform naive strategies such as random projection.
For selecting basis vectors from an overcomplete basis, we discuss the ℓ 0 -and ℓ 1 -based methods, which employ the exhaustive search and ℓ 1 -norm regularization techniques, respectively. Our adoption of these methods is motivated by their application in compressed sensing [25,26]. Focusing on the trade-off relation, we evaluate the performance of sparse approximation from two different viewpoints. First, we theoretically analyze the ideal performance that is achieved when the l 0 -and l 1 -based methods are performed exactly, by using methods of statistical mechanics. We regard the distortion and the compression rate as the thermal averages of physical quantities derived from partition functions. In the large-system limit, these are assessed by the replica method and the saddle-point method [27,28]. In order to validate the results of our analysis, we extrapolate physical quantities in the limit, from finite-size results obtained using the exchange Monte Carlo method [29,30] and quadratic programming. Second, we investigate the practical performance of the OCB-based strategy.
We examine the performances of two well-known algorithms, orthogonal matching pursuit [31,32] and approximate message passing [33], when they are employed to approximately execute the ℓ 0 -and ℓ 1 -based methods, respectively. We also apply the approximate algorithms to a task of image data compression and evaluate their performances, as a practical example.
The rest of this paper is organized as follows. In section II, we set up the problem of sparse approximation that we will focus on, and explain the ℓ 0 -and ℓ 1 -based methods and related work. In section III, we analyze the ideal performances of these methods, in terms of the trade-off relation. In section IV, we discuss the practical performance of the OCB-based strategy, and its application to image data. In section V, we conclude this paper.

II. PROBLEM SETTING
A. Sparse approximation using a random overcomplete basis Given a data vector y ∈ R M and a compression rate r, the purpose of sparse approximation is to obtain a compressed representation x ∈ R N using a basis matrix A = (a 1 , . . . , a N ) ∈ R M ×N , while keeping the representation distortion ǫ as small as possible.
The compression rate r is defined as the ratio of the number of non-zero components of x to the dimension of the data vector. That is, where || · || 0 denotes the so-called ℓ 0 -norm of a vector. The ℓ 0 -norm represents the number of non-zero elements of a vector, defined as ||v|| 0 = i |v i | 0 , where |v i | 0 is equal to 0 (v i = 0) or 1 (v i = 0). We measure the distortion using the mean squared error, as where ||·|| 2 is the ℓ 2 -norm of a vector, defined as ||v|| 2 = i v 2 i . Note that this representation distortion measures how close a data vector y is described by a sparse representation x with a given basis A, and it is different from the reconstruction error often used to measure the distance between an original sparse signal x 0 and an estimated sparse representationx in the field of signal processing and compressed sensing. For our purpose of an analytical evaluation of ǫ, we consider the case where the elements of the data vector y are independently and identically distributed (i.i.d.) random variables from the normal distribution, whose mean and variance are 0 and σ 2 y , respectively, and together are denoted by N (0, σ 2 y ). The elements of the basis matrix A are also i.i.d. random variables from N (0, M −1 ). Then, the matrix A is almost surely of rank min(M, N), and the distortion becomes a random variable.
If N = rM, the minimization of (2) is nothing but the method of least squares (LS), and the corresponding compressed vector is easily obtained aŝ where A + is the pseudoinverse of A, given by Let us call this the naive method, which is illustrated in figure 1 (a). In the large-size limit M → ∞, the corresponding distortion converges to with probability one. In general, in the limit M → ∞ certain random variables, such as ǫ, have the so-called self-averaging property, and will almost surely converge to their average values. This enables us to present a clear discussion, and hereafter we focus on this limit. On the other hand, for N > rM we have a lot of options in choosing a combination of rM basis vectors from the matrix, as illustrated in figure 1 (b). If the chosen combination is more suitable for representing the data vector than one that is chosen randomly, then the distortion becomes smaller than ǫ naive . This is the idea behind the OCB-based strategy.
However, this strategy presents the problem of how to choose the combination of basis vectors. We investigate the performances of ℓ 0 -and ℓ 1 -based methods.
B. Methods

ℓ 0 -based method
The basic idea of the ℓ 0 -based method is to minimize the distortion by choosing the best combination of rM basis (column) vectors from a given OCB. More generally, we would like to define the distortion as a function of the chosen combination of basis vectors, and to control it in a simple manner. This motivates us to introduce a binary vector c ∈ {1, 0} N , to store information on whether each basis vector is chosen (c i = 1) or not (c i = 0). We also introduce a distortion, labelled by c, with where • is the Hadamard product of two vectors, defined as (v • w) i = v i w i . In addition, we define an entropy function s(ǫ|y, A) to represent the number of configurations c that give a value of ǫ for the distortion, as follows: where # denotes the number of elements of the following set.
This entropy function is expected to be analytic and convex upward with respect to ǫ, and cannot be negative, by definition. A typical shape of the entropy is depicted in figure 2.
There are two zero points in the entropy function and the smaller and larger ones are denoted by ǫ 0 and ǫ + , respectively. The smaller zero point ǫ 0 of the entropy function, s(ǫ 0 ) = 0, gives the minimum value of the distortion Hence, our original motivation for introducing the ℓ 0 -based method, to find the minimum distortion led by the best combination of basis vectors, can be achieved through the evaluation of the entropy function. In addition, the evaluation of the entropy function is easier than the direct evaluation of ǫ 0 , and moreover the entropy function provides more information about the space of the variables c, which can be useful for practical applications such as designing algorithms. Thus, the entropy function s(ǫ) is the primary object of our analysis in the ℓ 0 -based method. A similar analysis has been proposed for examining the weight space structure of multilayer perceptrons [34].

ℓ 1 -based method
The ℓ 0 -based method is the most closely matched to the original idea of the OCB-based strategy. However, its algorithmic realization of searching combinations of basis vectors is computationally inefficient, because it requires an exponentially growing computational cost as the system size N increases. In practical situations, instead of the ℓ 0 -based method, a method based on ℓ 1 -norm regularization can be employed. This motivates us to examine the following ℓ 1 -based method.
Our ℓ 1 -based method arises from the following minimization problem: where || · || 1 is the ℓ 1 -norm of a vector, defined as ||v|| 1 = i |v i |, with the absolute value denoted by | · |. The solution of this minimization problem,ξ, provides useful information for finding the compressed vector we desire. This minimization problem is equivalent to the least absolute shrinkage and selection operator, also known as LASSO [35]. The main benefit of this approach represented by (9) is the computational ease of performing the minimization. As the objective function of (9) is convex, its minimization can be exactly carried out with a computational time in O(N 3 ), using versatile algorithms of quadratic programming. Furthermore, the ℓ 1 -norm term in (9) results in a sparsifying effect inξ, and its coefficient λ is adjusted according to the compression rate. Namely, λ is chosen so that Our aim in the analysis in the ℓ 1 -case is to evaluate the distortion resulting fromξ. The expression of the distortion is given by An inconvenience presented by this distortion is that it is not minimized on the set of basis vectors chosen byξ, owing to the presence of the ℓ 1 -norm term. In order to remove this extra distortion, we determine again the values of the non-zero components by purely minimizing the distortion, after the support estimation of the compressed vector by the ℓ 1 -norm regularization. This procedure is described as follows: where | · | 0 of a vector is defined by (|v| 0 ) i = |v i | 0 . This can be carried out by the method of LS for the sub-matrix of A that is composed of columns corresponding to |ξ i | 0 = 1. These two quantities, ǫ 1 and ǫ LS 1 , are the objects of our analysis in the ℓ 1 case.

C. Related Work
The problem of sparse approximation has been studied widely in the fields of signal processing, statistics and information theory. Sparse approximation involves searching for an optimal small combination of given basis vectors, and it was proved to be NP-hard [15]. In our setting, we seek a linear combination of a given number of basis vectors to approximate a given signal with as small a representation distortion as possible [23]. This setting is also called N-term approximation [17,18,36]. As stated after equation (2), our purpose is to minimize the distortion in describing a given signal by a sparse representation.
Note again that this distortion is different from the reconstruction error used to measure the distance between an original sparse signal and an estimated signal from scarce data in compressed sensing. Our motivation is similar to that of rate-distortion theory for lossy data compression in information theory [37].
We investigate the performances of the ℓ 0 -and ℓ 1 -based methods in solving the sparse approximation problem. In the ℓ 0 case, the exhaustive search is considered to be an absolute method for obtaining the most suitable representation. A major contribution of this paper is the theoretical analysis of the exhaustive search method for the sparse approximation problem, by using methods of statistical mechanics.
In order to reduce a computational cost of the ℓ 0 -based method, some greedy algorithms were proposed. Orthogonal matching pursuit (OMP) is a well-known greedy algorithm [31,32]. The approximation bounds of OMP was proved and has been improved theoretically by previous studies [21-23, 38, 39]. On the other hand, the ℓ 1 -based method based on convex relaxation is known to be useful such as basis pursuit [40] and LASSO [35]. The problem of sparse approximation allows a distortion in a compressed representation, though it should be small, and we evaluate the performance of the method of ℓ 1 -norm regularization equivalent to LASSO. In this paper, we are also interested in comparing greedy algorithms and convexrelaxation approach.

III. ANALYSIS OF IDEAL PERFORMANCE
A. Analytical treatment in the limit M → ∞ We investigate the limit M → ∞, as stated above. For this purpose, we employ some statistical mechanical tools, which provide useful assistance investigating this limit. According to the terminology of statistical mechanics, we call the limit M → ∞ the thermodynamic limit, and the average over y and A the configurational average, which is denoted by [·] y,A .
In taking the limit M → ∞, the aspect ratio of the basis matrix, α = M/N, is fixed.

ℓ 0 -based method
A versatile technique of statistical mechanics is to introduce a generating function Z of an energy function H, called a partition function. This defines a canonical distribution p.
In the ℓ 0 case, we define the energy function, partition function, and canonical distribution respectively as follows: where d c x i is equal to dx i (c i = 1) or 1 (c i = 0). The parameter β denotes the inverse temperature corresponding to the method of LS, and the limit of β → +∞ will be taken in accordance with the execution of the method. The parameter µ denotes the inverse temperature corresponding to the support estimation, and plays an important role in combining the entropy and the distortion values to depict the entropy curve as shown in figure 2. The energy function is related to the distortion of a given basis-vector choice c as follows: The cumulant generating function φ 0 (µ|y, A) is obtained from Z 0 by and is connected to the entropy (7) by the Legendre transformation in the large M limit, as The maximization problem of (17) must be solved on the well-defined region of s, which requires appropriate bounds the minimum value of distortion ǫ 0 and the maximum value of distortion ǫ + . Overall, we can calculate the object of our analysis, s(ǫ), through the inverse Legendre transformation, once we have obtained φ 0 . Therefore, we turn our attention to the calculation of φ 0 .
The cumulant-generating function has the self-averaging property, as does the entropy, and we assess the configurational average, given by We employ the replica method in order to calculate this average, and a detailed analysis is provided in Appendix A. Though it is possible that the correct solution to the ℓ 0 -based method might break replica symmetry (RS), the result under the RS ansatz is given by where extr Θ {·} denotes the operation of extremization with respect to Θ,Θ 0 = {Q, χ, q,r,Q,χ,q}, By applying the extremization condition, we obtain the following equations of state (EOSs): (21g) where we write ∆ = Q − q. From the EOSs, we obtain some simple and general relations, which we summarize here for later convenience: The relation involving the entropy, (17), enables us to employ a convenient parametric form of ǫ(µ) and s(µ) = s(ǫ(µ)), and (21,22) allow us to simplify ǫ(µ), as The explicit form of s(µ) is not enlightening, and therefore we omit it. As the value of µ is increased from µ = 0, the point of (ǫ, s) moves along the entropy curve from the summit (µ = 0) in the direction of decreasing the distortion (µ > 0) as shown in figure 2. When the entropy curve crosses the zero-entropy line at µ = µ 0 , the minimum distortion is given by Here, we make a technical remark on the derivation of (19). In contrast to the usual prescription of the replica method, we require two different replica numbers for the present analysis, because we have two different integration variables, x and c, in the calculation of φ 0 . Using (16,18), and introducing a variable ν = µ/β, we can rewrite φ 0 (µ) as In the last line, we use the replica identity [ln X] y,A = lim n→0 (1/n) ln [X n ] y,A . We identify n and ν as the two replica numbers, and assume that they are natural numbers, which enables us to expand the powers and to calculate the configurational average. The remaining calculations follow the usual procedure of the replica method, and we assume the RS ansatz in the order parameters. Our present framework in calculating φ 0 is actually similar to the one-step replica-symmetry-breaking (1RSB) ansatz. In this identification, ν is identified as the 1RSB breaking parameter (usually written as m), and each configuration of c corresponds to a pure state in the 1RSB free-energy landscape; the entropy can be regarded as complexity.
The analytical results obtained on the basis of RS assumption will be justified later, in a comparison with numerical calculations.
2. ℓ 1 -based method a. Derivation of ǫ 1 Similarly to the case of the ℓ 0 -based method, the energy function, partition function, and canonical distribution of the ℓ 1 case are defined respectively as The parameter µ denotes the inverse temperature corresponding to the support estimation by the method of ℓ 1 -norm regularization. The parameter κ is an auxiliary variable introduced to analyze the compression rate r and the limit of κ → 0 is taken in the end. The energy function H 1 is exactly the minimized object in (9). We also introduce the averaged freeenergy density, given by which plays the role of the cumulant-generating function that is given by φ 0 in the ℓ 0 case.
In the limit µ → ∞, the minimizer of the energy function becomes dominant in p 1 , and we focus on this limit. Any quantity of interest can be calculated from f 1 . For example, the compression rate r and the distortion ǫ 1 are calculated as An analytically compact form of f 1 is assessed by using the replica method in the limit M → ∞, through the replica identity, as As in the ℓ 0 case, we assume the replica-symmetric solution. The details of the necessary calculations are presented in Appendix B. The result is given by , and erfc(·) is the complementary error function, The extremization condition gives the following EOSs for the present case: By using (31,32), we obtain . In addition, a simple formula is derived from the EOSs of (35) in the limit of κ → 0, and a useful relation which is similar to (22d), is offered by (35,36).
We also evaluate ǫ LS 1 , as defined in (11). The computations are rather technical, and there we defer the details to Appendix B. Here, we present an outline of the analysis, and the result.
Again, we use the energy function defined in the ℓ 0 case, but here the argument is |ξ| 0 , determined by p 1 (ξ). Thus, we obtain Since the vector ξ is drawn from p 1 , we calculate the average value of (1/M)H 0 (|ξ| 0 ) over p 1 , in addition to the configurational average. Taking the limits of µ → ∞ and then β → ∞ afterward, we obtain the desired distortion ǫ LS 1 as follows: By utilizing the replica method again, we can calculate this. We defer the details of the calculations to Appendix B, and here write down the resultant formula: . One point to remark on is that we should not take the extremization condition with respect toΘ 1 = {P, χ p ,P ,χ p } in this expression. Instead, we should substitute the extremizer of (34) into it. Applying the extremization condition with respect toΘ LS giveŝ From the EOSs, we can obtain the following simple relations: We now make some comments regarding the derivation of (42). In order to calculate the configurational average, we are required to deal with two different factors, Z 1 in p 1 = (1/Z 1 )e −µH 1 , and the logarithm in H 0 . Correspondingly, as in the ℓ 0 case, we introduce replicas of two different kinds: n replicas to handle 1/Z 1 , and ν replicas to handle the logarithm. Using them, we can rewrite (41) as It is now possible to calculate the configurational average by assuming n and ν are natural  [29,30], and then estimate the cumulant-generating function φ 0 using the multi-histogram method [42].
In all simulations, we set α = 0.5 and σ 2 y = 1. We treat two values of r equal to 0.
The asymptotic form is based on the Stirling's formula and is exact at µ = 0, which motivates us to use the form even for µ = 0. The cumulant-generating function and entropy density in

ℓ 1 -based method
Similarly to the case of the ℓ 0 -based method, we examine the analytical results of the ℓ 1 -based method by performing numerical simulations on finite-size systems. We carry out the ℓ 1 -norm regularization using quadratic programming, and evaluate the distortion before the method of LS, ǫ 1 ; the distortion after the method of LS, ǫ LS 1 ; and the compression rate r.
The values of α and σ 2 y are fixed as α = 0.5 and σ 2 y = 1 for all simulations. We treat two values of λ equal to 1 and 2. We calculate (9) and (11)   Relation between the rate and the regularization coefficient in the ℓ 1 -based method.

C. Comparison in the trade-off relation
We compare the ideal performance in the M → ∞ limit for different methods in terms of the trade-off relation between the representation distortion and the compression rate. Figure   5(a) shows the trade-off relations in the case of α = 0.5. We see that both of the OCB-based methods achieve a better trade-off relation than the naive one. In the OCB-based strategy, the ℓ 0 -based method significantly outperforms the ℓ 1 -based one, even if the method of LS is operated after carrying out support estimation by the ℓ 1 -norm regularization. We attribute the inferiority of the ℓ 1 -based method to the regularization term. Indeed, as shown in figure   5 (b), the regularization term is necessary to decrease the rate, but it distorts the original purpose of minimizing the distortion, as clearly seen from (27). the limit α → 0, or more quantitatively, how ǫ is scaled by α in the small limit.
Deferring the detailed calculations to Appendix A 2 and B 2, here we summarize our analytical results on the behavior of ǫ in the limit α → 0 The asymptotic behaviors of ǫ 0 and ǫ LS 1 are examined using numerical solutions of the corresponding EOSs, (21,43), in figure 7. Our analytic formulas show an excellent agreement with the numerical results.
We stress the consequence of (47-49). First, they give a firm indication that it is reasonable to apply the method of LS after the ℓ 1 -norm regularization, which is heuristically employed in related problems such as compressed sensing in practical situations. The difference in (48, 49) indicates that the method of LS actually diminishes the distortion, and even eliminates it in the ideal limit α → 0, which never happens with only the use of ℓ 1norm regularization. Second, (47) provides a general bound for the computational cost of searching the appropriate basis vectors. From (47), given a target value of the distortionǫ and some data on the length M, the required size N req (ǫ, M) of the basis matrix to achieve this distortion value is scaled as This grows in a polynomial manner as the target distortion valueǫ decreases, and the exponent of the polynomial negatively grows as the compression rate r decreases. This quantitative information will provide a theoretical basis in designing algorithms. Finally, (49) manifests the limit of the ℓ 1 -based method. The size N req required to achieve the target distortionǫ in this case is scaled as This grows exponentially asǫ decreases, which is considered to be reasonable. If it were a polynomial, versatile algorithms exactly solving the ℓ 1 -norm regularization could be applied to solve the problem with a computational cost of a polynomial order of the system size and the precision, which is believed not to be possible. However, (51) can still be useful, because it provides a quantitative comparison between the data size M and the acceptable distortionǫ in an unified manner.

A. Algorithms and their performances
A lot of computational time is required to conduct the exhaustive search used in the ℓ 0based method. However, it is considered that certain greedy algorithms might work well for practical applications. Orthogonal matching pursuit (OMP, figure 8) is a greedy algorithm that may be suitable for the present purpose [31,32]. OMP only requires a computational time of order O(M 4 ) for the current purpose. We compare the performance of OMP with the ideal performances of both the ℓ 0 -and ℓ 1 -based methods.
In addition to OMP, we also examine approximate message passing (AMP), as a representative algorithm carrying out the ℓ 1 -norm regularization. From the viewpoint of quadratic programming, ℓ 1 -norm regularization is solved exactly using versatile algorithms, which require a computational time of order O(M 3 ). In contrast, AMP only requires a computational time of order O(M 2 ) per update. Despite the low computational cost, AMP is known to be able to recover the results of those versatile algorithms, in certain reasonable situations [33].
Input: a data vector y, a basis matrix A, a rate r. Initialization: Iteration: repeat from n = 1 until n = rM : Output: a compressed vectorx = x (rM ) .
FIG. 8. The procedure of OMP. ∅ is the empty set. supp(·) is the support set.
Input: a data vector y, a basis matrix A, a regularization coefficient λ, a tuning parameter δ. Initialization: Iteration: repeat until convergence at n =n: 1+χ (n−1) , r (n) = (1 −Q)r (n−1) +Q(y − Ax (n−1) ), h = A T r (n) +Qx (n−1) , Output: a compressed vectorx = x (n) . The present case, where the basis matrix A and the data vector y are generated from i.i.d. normal distributions, is expected to be one such situation. Hence, we can fairly compare the result of AMP with the ideal performance of the ℓ 1 -based method, and therefore with that of OMP.
We evaluate the performances of OMP and AMP when they are employed for sparse approximation with the OCB-based strategy. We examine the case with σ 2 y = 1 and α = 0.5. Figure 10 presents the results of the performance evaluations of OMP and AMP.

B. Application to image data
We investigate the performance of sparse approximation, when it is applied to a task of image data compression. We compress image data composed of 256 × 256 pixels. The experimental procedure of compression is as follows. First, image data are normalized so as to set the mean and variance to 0 and 1, respectively. Next, 256 × 256 pixels are randomly permuted, in order to obtain 1024 column vectors, whose dimension is 64. Following these operations, the data can be regarded as random numbers with a mean and variance of 0 and 1, which approximates the properties of the data to the situation which we have already studied theoretically and numerically. Finally, setting r = 0.5, we compress each of the column vectors into a compressed vector by using a 64 × 128 random matrix, namely α = 0.5. We examine the performances of OMP and AMP. When applying AMP, we set the regularization coefficient to 0.65, so that r ≈ 0.5, and the method of LS is operated after the support estimation by the ℓ 1 -norm regularization. The results of experiments are presented in figure 11. Although OMP requires a computational time that is several times larger than that of AMP, OMP outperforms AMP in terms of appearance and peak signal-to-noise ratio (PSNR), defined as PSNR = 10 log 10 255 2 where I = {I ij } andÎ = {Î ij } represent an original image and a compressed image, respectively, and N is the number of image pixels.
If the scope of application is limited to image data compression, more convenient bases, such as a discrete wavelet transformation, will achieve much better results in the performance and computational time [43,44]. However, in general contexts it is not easy to find a proper basis for sparse approximation in advance. A solution to this problem is to use blind compressed sensing and related techniques such as dictionary learning [45][46][47], but the computational costs are rather high. Our OCB-based strategy may overcome this difficulty, because it avoids the learning of the dictionary by preparing many candidates for basis vectors and choosing a suitable combination. Our theoretical analysis and numerical experiments positively support this possibility.

V. CONCLUSION
In the present paper, sparse-data processing has been discussed from the viewpoint of sparse approximation. We have focused on a strategy of sparse approximation that is based on a random OCB, and have discussed the use of the ℓ 0 -and ℓ 1 -based methods. We have analyzed the ideal performances of these methods in the large-system limit in a statisticalmechanical manner, which has been validated by numerical simulations on finite-size systems and their extrapolation to the infinite-size limit. Our results have indicated that the ℓ 0based method outperforms the naive and ℓ 1 -based methods in terms of the trade-off relation between the representation distortion and the compression rate. A notable result is that any small distortion is achievable for any finite fixed value of the compression rate, by increasing the degree of overcompleteness, for both the ℓ 0 -and ℓ 1 -based methods. This result allows us to determine both the theoretical limit of the OCB-based strategy and the limit for practical algorithms based on the ℓ 1 regularization. In addition, it provides a firm basis for the use of the method of LS after the ℓ 1 regularization, which is frequently applied in related problems such as compressed sensing in practical situations.
In addition to the ideal performance analyzed in section III, we also investigated the practical performance of our strategy in section IV. We evaluated the performances of OMP and AMP as algorithms to approximately perform the ℓ 0 -and ℓ 1 -based methods, respectively.
Our evaluation showed that OMP surpasses both AMP and the exact execution of the ℓ 1based method, in terms of the trade-off relation. This suggests that greedy algorithms are more suitable for sparse approximation using our strategy than convex relaxation algorithms, although there is still room to design more effective greedy algorithms than OMP. We are currently undertaking further research in this direction.
We considered the application of our method to image data compression, as a practical example, and evaluated its performance when OMP and AMP are utilized. OMP outperforms AMP in appearance and PSNR, although OMP requires a computational time that is several times larger. In order to efficiently decrease the computational time of our strategy, it is important to find a proper basis. This suggests the use of some prior knowledge in constructing the overcomplete basis. Some further possibilities, such as combining our methods with dictionary learning, are still open, and would be interesting to address in future work.
2. The limit α → 0 in the ℓ 0 case We examine the behavior of the zero point of entropy, ǫ 0 , in the large-size limit of the basis matrix, α → 0. The parameter µ corresponding to the zero point ǫ 0 , µ 0 , can be formally written using (23,24), as A numerical calculation indicates the behavior of µ 0 → ∞ as α → 0, whileQ,q, Q, q, χ ∼ O(1) are kept finite. We will determine the scalings of the relevant variables for α → 0 so as to agree with these observations. A crucial observation from (21d) is that the factor Y should vanish, in order to cancel the vanishing α, yielding where we introduce an exponent ρ controlling the divergence speed ofr and µ 0 . Since we assume the divergence of µ 0 , ρ must be larger than unity. The value of ρ is determined by solving (A7) in a self-consistent manner. The scaling of the remaining order parameterχ is determined byχ Now, we know all of the scalings of the order parameters, and can reduce (A7) to the dominant part, as 2ρr ln α + ln(1 + χ + µ 0 ∆) .
Then, through the formula we obtain (42).
From (36) and the asymptotic formula of the complementary error function erfc(·), we see in the limit α → 0 we have which is realized by controlling λ as O( | ln α|). Using these scalings, and the asymptotic expansion of the complementary error function for large θ in (35d), we obtain By inserting these scalings into (38), we obtain (48).
The asymptotic form of ǫ LS 1 can be similarly obtained. Following some lengthy but straightforward calculations, we obtain By substituting these scalings into (44c), we obtain (49).