On the universality of noiseless linear estimation with respect to the measurement matrix

In a noiseless linear estimation problem, the goal is to reconstruct a vector from the knowledge of its linear projections . There have been many theoretical works concentrating on the case where the matrix is a random i.i.d. one, but a number of heuristic evidence suggests that many of these results are universal and extend well beyond this restricted case. Here we revisit this problem through the prism of development of message passing methods, and consider not only the universality of the -transition, as previously addressed, but also the one of the optimal Bayesian reconstruction. We observed that the universality extends to the Bayes-optimal minimum mean-squared (MMSE) error, and to a range of structured matrices.


Introduction
The problem of recovering a signal through the knowledge of its linear projections is ubiquitous in modern information theory, statistics and machine learning. In particular, many applications require to reconstruct an unknown n−dimensional signal vector x * from the linear projections where y is a m-dimensional vector, and Φ is a m × n random matrix. For instance, if x * is sparse, this task of estimating the signal from its linear random projections is at the roots of compressed sensing [1]. A fundamental question in the field is how much the algorithmic and the information theoretic performance depends on the choice of the random matrix Φ.
In the present letter, we concentrate on the noiseless and asymptotic, large n, regime with a fixed value α = m/n. We consider x * to be k-sparse, i.e. to have only k non-zero values, and we shall work in the limit where n → ∞, k → ∞, and a finite value of ρ = k/n. In such setting, a classical result is the following: for random matrices Φ with independent standard Gaussian entries, the (convex) reconstruction with 1 penalty displays a precisely determined phase transition. For a certain region in the (α, ρ)-phase diagram, it typically finds back the vector x * , being the sparsest solution, whereas outside that region, it typically fails. The obundary between these two regions is called the Donoho-Tanner line [2]. It has been shown empirically that the very same phase transition location seems to hold for a wider range of random matrix ensembles, see e.g. [3,4], suggesting a large universality of the Donoho-Tanner phase transitions. Another line of work showed that the convex 1 reconstruction problem can be treated through conic geometry, and the success probability of signal recovery only depends on a geometric number characterizing a subcone (statistical dimension or Gaussian width) [5,6].
Here we investigate the universality of the phase transition not only for the 1 transition, but also to the performance of the optimal Bayesian reconstruction. We analyze this question through the prism of information theory, message passing methods, and random matrix theory. We shall see that the universality indeed extends to a more generic set of properties than the 1 transition, such as the minimum mean-squared (MMSE) error or the easy-hard phase transition for optimal Bayesian learning, and empirically to structured matrices such as the one appearing in [7,8].
We note that investigation of universality are very common to physics problems, and understanding how large is the class of model for which a given result applied is a very fundamental question. The message-passing-based algorithm that we investigate in this paper to demonstrate the universality also has their origin in pysics works, such as [9].

A short review of results for i.i.d. random matrices
A first well-understood case of universality holds for random matrices Φ where all the elements are generated i.i.d. from a well-behaved distribution -with zero mean and unit variance-which all exhibit the same transitions as Gaussian random matrices. This is known for multiple retrieval problems:

1 recovery
Consider for instance the Donoho-Tanner line [2] that regulates the 1 recovery. Thanks to the approximate message passing solver (see below) that has been shown to be universal with respect to all i.i.d. distributions with finite moments [10,11], we know that the Donoho-Tanner phase transition is the same for all such random matrices.

Information theoretic optimal reconstruction
There has been a considerable amount of work in the information theory community on the computation of the mutual information and on the MMSE for problems such as (1) with Gaussian matrices. In particular, following the replica method from statistical physics (the Tanaka formula [12]), a heuristic formula has been postulated in different situations, see e.g. [13][14][15][16]. This heuristic replica result has been recently rigorously proven in a series of papers [17,17,18]. In a more recent proof [19], it has been shown, again, that the formula is not specific to Gaussian i.i.d. matrices, but that any matrix with i.i.d elements of unit variance and zero mean leads to the same exact result for the mutual information and the MMSE.

Hard phase for Bayesian decoders
A third interesting point is to ask about tractacle decoders that aim to perform the optimal Bayesian estimation, i.e. with a perfect prior knowledge on the distribution of x * . For simplicity, consider for instance the case where each element of x * has been sampled from a Gauss-Bernoulli distribution: In this case, the best known solver is again AMP, using a Bayesian decoder (instead of the soft thresholding function for 1 recovery) [14,15,20,21]. Interestingly, it shares with the 1 recovery a similar phase transition: for a certain region in the (α, ρ) plane it typically finds back the vector x * , whereas outside that region it fails. We shall denote the limit between these regions the "Bayesian hard-phase" transition. The "Bayesian hard-phase" line, that has been precisely computed in [14,15] is always better than the Donoho-Tanner line (as it should, since it exploits additional information). Once more, the universality of AMP shows that this phase transition is not restricted to Gaussian matrices, but extends as well to all (well normalized) i.i.d. matrices.
The fact that these three properties (the 1 , the hard-phase line, as well as the MMSE) are universal for all i.i.d. matrices makes the case for Gaussian computations, as done in theoretical computation, stronger. We shall see that this universality extends well beyond these simple cases.

Random rotationally invariant matrices
Moving away from the well-known i.i.d. examples, we start by considering a much larger set of random matrices defined through their singular value decomposition (SVD): any real matrix Φ can be decomposed into Φ = U ΣV , with U and V orthogonal matrices, and Σ's elements being Φ's singular values. We shall look at the left rotationally invariant random matrix ensemble: these are matrices Φ that can be written as with an arbitrary rotation matrix U and singular values Σ, but where the matrix V has been randomly (and independently of Σ and U ) generated from the Haar measure (that is, uniformly from all possible rotations).
When the singular values are different from zero, it is straightforward to justify the universality property for matrices from this subclass. We start by the definition of the problem: we wish to find x such that If m ≤ n, then Σ is written as Σ = Σ 0 and we define Multiplying (2) on both sides by U T , and then by Σ inv ; one reaches whereṼ is a m × n matrix composed of the first m lines of V . If instead m > n, Σ is written as In both cases, we thus see that the problem has been transformed -in a constructive way-into a standard linear system with the sensing matrixṼ when m ≤ n being a (sub-sampled) random rotation one, or sensing matrix V when m > n. This shows that all rotationally invariant matrices, which satisfy U and Σ's independence on V , can be transformed the same way and are in the same universality class as far as noiseless linear recovery is concerned, i.e. they will display the same phase transitions. Since Gaussian i.i.d. matrices belong among random rotationally invariant matrices (in this case Σ follows the Marcenko-Pastur law [22]) this means that all the information theoretic rigorous results (such as phase transitions and MMSE value) with zero noise for random Gaussian matrices applies verbatim to all rotationally invariant ensemble, as long as the SVD's matrices U and Σ are independent of V . This is a very strong universality, that applies to the three cases (1, 2, 3) from sec. 2. Note that the universality of the Donoho-Tanner line with rotationally invariant matrices was already hinted by the replica method [23].
Notice, however, that the above construction depends crucially on the fact that we consider here noiseless measurements. It would not work if an additional Gaussian noise were added in eq. (1): in this case, the transformation would make the i.i.d. Gaussian noise a correlated one. Indeed, the replica formula for noisy measurements underlines that the MMSE depends on the precise set of matrices in noisy reconstruction [13,24] (this formula is not yet fully rigorous, but see [25] for a proof in a restricted setting). Any differences, however, must go to zero in the noiseless limit.

Approximate Message Passing
Having discussed the universality with respect to random rotationally invariant matrices, we now wish to discuss its effect on specific solvers, concretely the message passing algorithms.

AMP
We first consider the original approximate message passing (AMP) [26] to compute the phase transition between the phase where the algorithm reconstructs x * perfectly, and the one where reconstruction may be possible but is not achieved by the algorithm. AMP is an iterative algorihm that follows: where t is the iteration index, x t is the current estimate of x * , z t the current residual, · is an averaged sum of components, and η t is a prior-dependent threshold function applied component-wise (the soft thresholding for 1 , or the Bayesian decoder [14,15]). One of the most interesting features of AMP is that, if Φ is a Gaussian i.i.d. matrix, its mean squared error (MSE) σ t can be tracked accurately by the state evolution formalism [10,11,26]. State evolution is a relatively simple recursive equation: where the expectation is with respect to independent random variables Z ∼ N (0, 1) and X, whose distribution coincides with the empirical distribution of the entries of x * . Analyzing the evolution of this equation for the 1 decoder yields the Donoho-Tanner line [26], while using the Bayesian decoder it yields the hard-phase line for Bayesian decoding [14]. It would be interesting to use AMP for rotationally invariant matrices. In order to do this, we follow the construction of sec. 3: starting from equation (3) we then multiply by Σ 0 , a m × m diagonal matrix with singular values sampled from Marcenko-Pastur law (singular values of a Gaussian i.i.d. matrix ‡), and U 0 a m × m Haar-generated orthogonal matrix, thus ensuring that Σ 0 and U 0 are generated independently of V : After this transformation, Φ = U 0 Σ 0Ṽ is a random matrix that belongs to an ensemble very close to the Gaussian i.i.d. matrices ensemble. In fact, a recent work showed that ‡ The singular values of a Gaussian matrix are correlated, so in fact we may want to generate Σ 0 by first generating a random Gaussian matrix, and then calculating its singular values.
AMP applied to a Gaussian matrix follows the same state evolution as matrices such as Φ where U 0 ,Ṽ are uniform orthogonal matrices and Σ 0 diagonal's elements are singular values sampled from the Marcenko-Pastur law [27]. Combining this result with the matrix transformation, we have thus constructively mapped the noiseless reconstruction problem back to the well-understood noiseless compressed sensing case for a Gaussian i.i.d. matrix, where we can safely apply the algorithm, and its state evolution. In the section 5.2, we apply this matrix transformation for numerical experiments using AMP.

Vector-AMP
While the transformation trick allows to make AMP work with random rotationally invariant matrices, another alternative is to work directly with a dedicated solver. To this means, different but related approaches were proposed [24,28], in particular, using the general expectation-propagation (EP) [29,30] scheme. Ma and Ping proposed a variation of EP called OAMP [31] specially adapted to rotation matrices. Rangan, Schniter and Fletcher introduced a similar approach called VAMP [32] and proved that it follows state evolution equations corresponding to the fixed point of the replica potential [13,24,25]. The multi-layer AMP algorithm of [33] also display the same fixed point.
We shall concentrate here on the VAMP (Vector-AMP) approach, and for a moment, put back a small additional random Gaussian i.i.d. noise of variance ∆ in the measurement in eq. (1) as it is needed for stating the algorithm. VAMP then consists in the following fixed-point iteration: where we denote by E t ,r the expectation w.r.t. the tilted distributions Q t ,r (x) ∝ P ,r (x)Q t ,r (x), and by Var t r, (x) the variance of these distributions. Here, where, as for AMP, we define the denoiser that yields the estimates of x by z(u, ρ) = dxP r (x)e − 1 2 ρx 2 +ux , Again, the performance of the recursion can be analyzed rigorously through the state evolution [32]. For simplicity, let us concentrate on the Bayes optimal case in which case the state evolution can be closed on the variables (see [32]): by writing where the expectation is above the distribution of the singular values Σ of the matrix Φ, and where we recognize the Stieljes transform S X (r) = E [1/X − r].
Though this transform, we see that the performance depends crucially on the distribution of eigenvalues. Let us now go back on the noiseless limit when ∆ → 0 and analyze how the universality shows up. Consider again the Stieltjes transform: out of the n singular values of the n × n matrix Φ T Φ, we shall have (1 − α)n of them to be zero (assuming α < 1) while the rest are positive (since m < n). In this case, the limit r → 0 of the Stieltjes transform will behave as S X (r) ≈ −(1 − α)/r so that Again, we see that all the complicated dependence on the spectrum of the matrix Φ has been eliminated. This is a direct, alternative, proof that VAMP will also yield universal results in the zero noise limit for the Bayesian reconstruction. Given that VAMP has the same fixed point as the replica mutual information [13,25], this argument applies to the replica prediction for the MMSE as well.

Structured matrices
We now move to very structured matrices, in order to test the universality as well as the quality and the prediction of the state evolution out of its comfort zone. In order to do so, we have considered different matrix ensembles:

Tested ensembled of matrices
Discrete cosine transform matrices The first ensemble we consider consists in Fourierlike matrices. A n × n discrete cosine transform (DCT) matrix Y is defined by: where j, k ∈ 0, n − 1 , 0 = 1/ √ 2, i = 1 for i = 1, ..., n − 1. We used a sub-sampled version of these matrices in which we picked some rows randomly.   Random features maps Finally, we wanted to consider here random features maps (RFM) as encountered in nonlinear regression problems. In such settings, a random features matrix Φ = f (W X) is obtained from the raw data matrix X by means of a random projection matrix W and a pointwise nonlinear activation f . Kernel regression models, nonlinear in the original data X, can then be approximately but efficiently solved by the linear estimation problem (1), with an appropriate choice for f and the Wdistribution [34]. Such matrices, that can be seen as the ouput of a neuron with random weights, have been investigated in particular in the context of neural networks [7,8]. Indeed, in neural networks configurations with random weights play an important role as they define the initial loss landscape. They are also fundamental in the random kitchen sinks algorithm in machine learning [34] and it is thus of interest to test our understanding of linear reconstructions with AMP and VAMP in this case.
In what follows we will test random features matrices where both W and X are random Gaussian i.i.d. matrices.

Numerical results
We provide the codes used to generate the data on github in the repo http://sphinxteam/Universality-CS-2019. To generate Figure 1 and 2, we ran VAMP 50 times on 50 × 50 points spanning the (α,ρ)-space, and computed the average meansquared error (MSE) between the signal x * and the reconstructed configuration x. The MSE is represented with a color bar (white means perfect reconstruction). For a DCT and a Hadamard matrix, we observe a phase transition in the Bayes-optimal case that matches the theoretical transition for Gaussian i.i.d. matrices. We also ran VAMP for the 1 reconstruction problem. Averaging on 20 executions (or 50 for small α where finite-size effects are more important), we recover again a phase transition matching the theoretical Donoho-Tanner line for Gaussian i.i.d. matrices [3]. Besides, we compared the MSE obtained by VAMP at each point of the phase diagram for different matrices. In figures 3 and 4, we plot the MSE averaged on 20 executions for ρ fixed and α ranging between 0 and 1. We get the same error in reconstruction for all matrices, following the MSE for Gaussian i.i.d. matrix for ρ = 0.25, 0.5 and 0.75. We also checked that AMP, provided one uses the trick eq. (7), reproduce these results as well: indeed the two algorithms returned extremely similar results.

Discussion
Figures of the previous section perfectly illustrate our main point: the universality in noiseless compressed sensing is not limited to the 1 -type reconstruction as in [3,4], but extends to other quantities and estimators, such as the hard-phase line in Bayesian reconstruction, and the MMSE. Besides, it is not limited to random orthogonal matrices, but empirically extends to Fourier-type matrices and to the random features maps currently studied in machine learning. It is an open question to extend the proof of state evolution to these challenging matrices. It would be interesting to find a good creterion to identify which matrices satisfy this universality and which do not; this is something that we are yet unable to predict in advance. An example of structured matrices that do not seem to follow these universal phase transitions is given by Haar wavelet matrices, which can be defined recursively by: where I k is the identity matrix of size k and ⊗ is the Kronecker product. In fact, VAMP even fails to converge for these matrices. Investigating this behavior is an interesting direction of research.