Fundamental limits in structured principal component analysis and how to reach them

Significance The assumption of unstructured “noise,” i.e., that the complement of what is considered an interesting “signal” in a certain dataset is pure randomness, has pervaded analytical studies in high-dimensional inference. However, this hypothesis is often too simplistic to capture realistic scenarios. We thus need to understand the role of the noise structure, namely the presence of correlations within it. We address this problem in spiked matrix models and provide characterization of the information-theoretic limits to inference. We also propose a message-passing algorithm which is observed through simulations to saturate these limits by optimally capturing the noise statistical dependencies. The resulting picture shows that both signal and noise structure should be exploited by algorithms in order to produce better signal estimation.

The success of inference and learning algorithms depends strongly on the structure of the high-dimensional noisy data they process. Consequently, quantifying how this structure helps algorithms to overcome the curse of dimensionality has become a central topic in statistics and machine learning. Classical examples include sparsity in compressed sensing (1), low-rank structure in matrix recovery (2), or community structure in community detection (3). In all these models, structure is usually assumed only at the signal's level. But the decomposition of the data into "signal" (the component considered of interest) and "noise" (the rest) is often arbitrary and application dependent. For example, in the classification of "dogs/cats," the training images contain a lot of information unrelated to dogs and cats-e.g., on the notions of "inside/outside," "day/night," etc. Yet, this highly structured potential source of information is discarded as random noise (independent, Gaussian, etc.). Most of the research effort has thus focused on understanding how the signal structure alone helps inferring it. In contrast, much less is known about the role of the noise structure and how to exploit it to improve inference.
Given their ubiquitous appearance in the statistics literature, spiked matrix models, which were originally formulated as models for probabilistic principal component analysis (PCA) (4), are now a paradigm in high-dimensional inference. Thanks to their universality features, they, and their generalizations, find numerous applications in other central problems, including community detection (3), group synchronization (5), and submatrix localization or high-dimensional clustering (6). They thus offer the perfect benchmark to quantify the influence of noise structure. In this paper, we focus on the following estimation problem: A statistician needs to extract a rank-one matrix (the spike) P * := X * X * , X * ∈ R N , from the data with "noise" Z and signal-to-noise ratio (SNR) λ ≥ 0. The spectral properties of finite rank perturbations of large random matrices like Eq. 1 were intensively investigated in random matrix theory (see, e.g., refs. [7][8][9], showing the presence of a threshold phenomenon coined BBP transition (in reference to the authors of ref. 7): When λ is large enough, the top eigenvalue of Y detaches from the bulk

Significance
The assumption of unstructured "noise," i.e., that the complement of what is considered an interesting "signal" in a certain dataset is pure randomness, has pervaded analytical studies in high-dimensional inference. However, this hypothesis is often too simplistic to capture realistic scenarios. We thus need to understand the role of the noise structure, namely the presence of correlations within it. We address this problem in spiked matrix models and provide characterization of the information-theoretic limits to inference. We also propose a message-passing algorithm which is observed through simulations to saturate these limits by optimally capturing the noise statistical dependencies. The resulting picture shows that both signal and noise structure should be exploited by algorithms in order to produce better signal estimation. of eigenvalues. Its corresponding eigenvector has then a nontrivial projection onto the sought ground truth X * and can be used as its estimator. The problem has also been approached from the angle of Bayesian inference (10)(11)(12)(13). In particular, besides the previous spectral estimator, there exists a whole family of iterative algorithms, known as approximate message passing (AMP), that can be tailored to take further advantage of prior structural information about the signal and noise. AMP algorithms were first proposed for estimation in linear models (14,15) but have since been applied to a range of statistical estimation problems, including generalized linear models (16,17) and lowrank matrix estimation (11,18). An attractive feature of AMP is that its performance in the high-dimensional limit can often be characterized by a succinct recursion called state evolution (19,20). Using the state evolution analysis, it has been proved that AMP achieves Bayes-optimal performance for some models (11,16,18), and a conjecture posits that for a wide range of estimation problems, AMP is optimal among polynomial-time algorithms (21).
The references mentioned above rely on the assumption of independent and identically distributed (i.i.d.) noise, often taken Gaussian Z ij = Z ji ∼ N (0, 1), under which Eq. 1 is the wellknown spiked Wigner model (4). This independence, or "absence of structure," in the noise simplifies greatly the analysis. In order to relax this property, we may seek inspiration from the statistical physics literature on disordered systems. An idea that was first brought forth in refs. 22 and 23 for the Sherrington-Kirkpatrick model, and later imported also in high-dimensional inference (24,25), is that of giving an inhomogeneous variance profile to the noise matrix elements; we mention that this idea in inference is similar to the earlier definition of "spatially coupled systems" (26,27) in coding theory, see ref. 12 for its use in the present context. The procedure makes the (Z ij ) no longer identically distributed, but it leaves them independent. This is an important step toward more structure in the noise. Yet, the independence assumption is a rather strong one. In fact, ref. 25 showed that a broad class of observation models, as long as the independence assumption holds, are information-theoretically equivalent to one with independent Gaussian noise.
One way to go beyond is to consider noises belonging to the wider class of rotationally invariant matrices. Since the appearance of the seminal studies (28)(29)(30), there has been a remarkable development in this direction, as evidenced by the rapidly growing number of papers on spin glasses (31)(32)(33) and inference (34)(35)(36)(37) that take into account structured disorder, including the present one. Indeed, we hereby consider a spiked model in which the noise Z is drawn from an orthogonal matrix ensemble different from the Gaussian orthogonal ensemble (the only one with independent entries). Intuitively, the presence of dependencies in the noise should be an advantage for an algorithm sharp enough to see patterns within it and use them to retrieve the sought low-rank matrix. Going in that direction, Fan (35) proposed a version of AMP designed for rotationally invariant noises (using earlier ideas of refs. 31 and 32). Furthermore, in a recent work (38), part of the authors analyzed a Bayes estimator and an AMP, both assuming Gaussian noise, whereas the actual noise in the data was drawn from a generic orthogonal matrix ensemble. However, besides intuition and the mentioned studies, to the best of our knowledge, there is little theoretical understanding of the true role played by noise structure in spiked matrix estimation and more generically in inference. In particular, prior to our work, there was no theoretical prediction of optimal performance to benchmark practical inference algorithms.

Setting and Main Results
Our analysis focuses on two types of signal's distributions: the factorized prior dP X (x) = i≤N dP X (x i ) and a uniform prior measure over the N -dimensional sphere of radius √ N . By convention, x 2 dP X (x) = 1, which amounts to rescale λ. The noise matrix Z is drawn from a trace random matrix ensemble, defined by a certain potential V : R → R. V is extended to matrices as follows: if A = diag(a 1 , . . . , a N ) then V (A) = diag(V (a 1 ), . . . , V (a N )). For real symmetric matrices M = UAU , with U orthogonal, V (M) = UV (A)U . With these notations, we can write the density of the trace ensemble (with normalization constant C V ) as Instances of such ensembles have a spectral decomposition Z = ODO , with O uniformly distributed over N × N orthogonal matrices. The distribution of the eigenvalues in the diagonal matrix D, which is independent of O, can be explicitly written, see SI Appendix, section 1.2. Only the special case V (x) = x 2 /(2σ ), corresponding to the Gaussian orthogonal ensemble, induces independent (Gaussian distributed) matrix entries. Any other potential generates dependencies among matrix elements and thus structure. For example, if we take V (x) = x 4 /4, the probability density would be proportional to exp(− N 8 Z ij Z jk Z kl Z li ), which is clearly not factorizable over matrix entries.
Analyzing the model for a generic potential V is possible through the methodology presented in this paper. Indeed, as discussed in SI Appendix, A, this can be done by studying the inference problem whose noise's potential is a polynomial approximation of V . However, if we take a generic polynomial potential V , the higher the order, the more technical and cumbersome our derivations become. Therefore, for the sake of clarity, we focus on a concrete example of nontrivial correction to i.i.d. noise: the quartic matrix potential V (x) = µx 2 /2+γ x 4 /4, where µ and γ are two nonnegative real numbers (39). We could have also considered a nonsymmetric potential with a cubic term too, but for simplicity, we restrict ourselves to that case as symmetry slightly simplifies the computations. The noise Z drawn from the quartic matrix ensemble has a known N → ∞ asymptotic eigenvalue distribution (40) where a 2 := ( µ 2 + 12γ − µ)/(6γ ). In order to have a coherent definition of SNR, we also fix When µ = 1, γ (1) = 0 and we recover the pure Wigner case. On the contrary, (µ = 0, γ (0) = 16/27) corresponds to a purely quartic case with unit variance, the "most structured" ensemble in this class. Therefore, µ allows us to interpolate between unstructured and structured noise ensembles.
We emphasize that, although this model may seem rather academic at first sight, we will see that our main assumption, that is, the rotational invariance of the noise, turns out to yield a theory which accurately predicts the empirical performance of algorithms for inference of low-rank matrices hidden in noise coming from real datasets from various application domains. This is probably a consequence of strong universality properties, yet to be understood from a theoretical perspective, along the lines of refs. 41 and 42. We thus argue that our assumptions are in fact rather mild, making our inference algorithms relevant for potential future applications.
We now introduce the Bayesian framework we are going to analyze. Let P := xx . The posterior measure reads

[4]
The evidence P Y (Y) is simply the integral of the numerator. We stress that the prior P X and the likelihood P Y |X match respectively the distribution of the signal and the noise density P Z , and λ is known. Therefore, we are in the Bayes-optimal setting. Studying the limits of inference in this setting draws a fundamental line between what is information-theoretically possible and what is not in terms of performance of inference.
A main object of interest is the free entropy, which is minus the Shannon entropy of the data: It is related to the mutual information between signal and data through the identity . The relevance of the latter is extensively discussed in SI Appendix, section 1.4. Using the form of the observation model in Eq. 1, it reads where the Hamiltonian linked to the partition function Z is In this way, the problem is mapped onto a statistical mechanics model with "quenched randomness" Z, X * and "spins" x with Gibbs-Boltzmann distribution associated with this Hamiltonian (i.e., the posterior). This Hamiltonian is tricky to directly deal with, so a key point will be to "convert" it into a more tractable quadratic form; see Section 1 and SI Appendix, section 3.1.

Result 1: Information-Theoretical Limits.
Our first result is a variational formula for the mutual information via the celebrated replica method (43) outlined in Section 1: If we let * := argmax{f ρ ( ) : ∈ R 13 , ∇f ρ ( ) = 0}, then we have the following low-dimensional expression for the mutual information between hidden spike and the data: The argmax is selected and not the argmin as f ρ is a free entropy (i.e., minus free energy, the free energy being minimized in physics). f ρ and its derivation are reported in SI Appendix, section 3.2. The 13 coupled fixed point equations coming from ∇f ρ = 0 will reduce to only 2 (SI Appendix, Eqs. 79-84) thanks to special symmetries inherent to the Bayes-optimal nature of our analysis. One of the two remaining order parameters, denoted m 2 and called (squared) "magnetization," quantifies the asymptotic trace inner product between the minimum mean-square error (MMSE) estimator dP X |Y (x | Y)xx and the spike X * X * . It allows us to compute the MMSE as with m solving the aforementioned system of equations. The above results hold for a factorized prior P ⊗N X . Nevertheless, if X * is uniformly distributed on the sphere, a variational formula analogous to Eq. 7 can still be derived, as shown in SI Appendix, section 3.6, and the related MMSE computed. Analytical arguments and numerical experiments show that the latter can be achieved using the naive spectral estimator C of P * obtained from the principal eigenvector = (Y) of Y properly rescaled by a certain factor C(λ, ρ); see ref. 9.
Result 3a: Optimal Preprocessing of the Data. Instead of using an AMP with iterates based on Y, we introduce a preprocessing procedure driven by the AdaTAP formalism (32). The end result is an effective quadratic model (i.e., with only pairwise interactions) which is "equivalent" (in a proper sense described below) to the original one, with coupling matrix [9] This model being quadratic is now solvable using AdaTAP/AMP and possesses the same thermodynamic properties (free entropy, phase transitions, etc.) as well as the same marginal means and variances as the model in Eq. 4 when N → ∞ (and thus equivalent for our purposes). Therefore, to approximate the MMSE estimator, one can simply "preprocess" Y by applying J (Y) and then efficiently compute the marginals of the resulting quadratic model by AdaTAP/AMP, see next section. AdaTAP allows us to parametrize the free entropy (i.e., log-partition function) of a model with quadratic Hamiltonian, for a given instance of the interaction matrix, in terms of O(N ) order parameters, some of which correspond to the sought marginal means ( x i ) i≤N and associated variances. The extremization w.r.t. them yields equations that can be solved iteratively and identified with an AMP algorithm. However, the Hamiltonian in Eq. 6 is not quadratic in x but can be made so by fixing certain order parameters as outlined in Section 1. The resulting coupling matrix depends on Y and on the fixed order parameters, whose values are constrained by Bayes optimality (SI Appendix, section 1.4). Using these values, for an initial quartic V (x) = µx 2 /2 + γ x 4 /4, we get the above interaction matrix in Eq. 9 (SI Appendix, sections 5.1 and 5.2). The "cleaning effect" of J (Y) is illustrated in Fig. 1. In general, for a (K + 1)-order polynomial matrix potential, the preprocessed matrix is a polynomial J (Y) = k≤K c k Y k , with (c k ) k≤K depending on V . For example, for V (x) = ξ x 6 /6 (with ξ = 27/80 to select unit variance), the preprocessing (derived similarly to the quartic case, see SI Appendix, section 5.3) is it has an effect similar to that in Fig. 1. We point out that the statistics of the noise could be only partially known. This issue can be overcome by learning the (c k ) from the data; see SI Appendix, B.
Result 3b: Bayes-Optimal AMP. First, we show in SI Appendix, section 4 that existing AMPs (35,36) do not saturate the MMSE predicted by Eq. 8. We provide a replica-based theory showing that despite these existing AMPs are aware of the noise structure/statistics, they nevertheless make an implicit mismatched assumption of i.i.d. Gaussian noise: The noise structure is "only" exploited to enforce convergence despite the mismatch, rather than as a source of greater statistical accuracy, in contrast to the proposed AMP we explain now. To cure this issue, we employ the preprocessed J (Y) in AMP, which leads to our (conjectural) Bayes-optimal approximate message-passing (BAMP) algorithm with recursion with g t+1 applied component-wise. For simplicity, we assume to have access to an initialization u 1 ∈ R N independent of the noise Z and with a strictly positive correlation with X * , i.e., This requirement is rather standard in the analysis of AMP algorithms (16,35,44). However, as having access to such an initialization is often impractical, recent work (18,36,45) has designed AMPs initialized with the top eigenvector (Y).
By carefully choosing the Onsager coefficients {c t,j } j∈[t] , we rigorously obtain BAMP's state evolution characterization.

Theorem 1 (State evolution of BAMP). Let J (Y) = i≤K c i Y i .
Consider the AMP of Eq. 10 initialized as Eq. 11, with Onsager coefficients {c t,j } j∈ [t] given in SI Appendix, section 6.2, and where (g t+1 ) t≥1 are C 1 and Lipschitz. Then, the following limit holds almost surely for any order 2 pseudo-Lipschitz function * ψ : R 2t+2 → R and t ≥ 1: Here, for i ∈ [t], U i+1 = g i+1 (F t ) and (F 1 , . . . , F t ) = t X * + (W 1 , . . . , W t ), with (W i ) i≤t a multivariate Gaussian vector whose covariance as well as t are given in SI Appendix, section 6.2.
Eq. 12 provides a high-dimensional characterization of our proposed BAMP. A suitable choice of ψ readily gives the MSE of the BAMP iterates. We also note that our result is equivalent to the almost sure convergence in Wasserstein-2 distance of the joint empirical distribution of (u 1 , . . . , u t+1 , f 1 , . . . , f t , X * ) to (U 1 , . . . , U t+1 , F 1 , . . . , F t , X * ), see corollary 7.21 of ref. 44.
We emphasize that our BAMP algorithm is not the usual AMP of ref. 35, where the data matrix Y are just replaced by the preprocessed matrix J (Y). Indeed, tuning the Onsager coefficients {c t,i } entering BAMP requires a type of "multistage" state evolution recursion which is completely different from the one in ref. 35. The acronym we introduce stresses this crucial distinction. While our replica prediction for the MMSE is nonrigorous, the state evolution analysis of BAMP is rigorous.
In Section 2, we show that BAMP improves over the AMP in ref.
35 by comparing their fixed points. This improvement is thus a rigorous conclusion, while the conjecture is that BAMP saturates the Bayes-optimal performance. Finally, the "multistage" state evolution of BAMP suggests a choice of the denoisers in the AMP of ref. 35, which differs from the greedy strategy of ref. 36 (i.e., picking the full posterior mean denoiser at every iteration). The numerical results of Section 2 also show that this denoiser selection-motivated by BAMPmeets the BAMP performance and, hence, the replica prediction of the Bayes-optimal error.

Numerical Results and Discussion
BAMP vs the Replica Prediction. The Left plot of Fig. 2 considers the quartic ensemble for µ = 0, and the right one refers to the pure power six potential. The signal X * has a Rademacher prior X * i ∼ 1 2 (δ 1 + δ −1 ). The estimators of the spike X * X * are compared in terms of the MSE achieved at the fixed point, as a function of the SNR λ. All algorithms are run for N = 8,000, they are initialized with u 1 that satisfies Eq. 11, and the results are averaged over 50 trials; the state evolution recursions and the replica prediction are for N → ∞. In SI Appendix, section 7.2, we provide additional numerical results for a sparse Rademacher prior, which display a similar qualitative behavior.
We observe that all algorithms converge rapidly: 10 iterations are sufficient to reach the corresponding fixed points. A few remarks concerning the results displayed in Fig. 2 are now in order. First, in all settings, the fixed point of the BAMP state evolution (red) matches the replica prediction (black). This is a strong numerical evidence supporting our conjecture that the proposed BAMP algorithm is Bayes-optimal. These theoretical curves for N → ∞ are also remarkably close to the MSE achieved by the BAMP algorithm at N = 8,000.
Second, there is a clear performance gap between our proposed BAMP (red) and the existing AMP algorithms (35,36) (singlestep denoiser in blue, and multistep in ochre). For V (x) = ξ x 6 /6, the gap is even more evident. As predicted by our theory, the gap is reduced when µ approaches 1 with all curves collapsing for µ = 1; see SI Appendix, section 7.2.
. Finally, (v) (green triangles) performance of BAMP when the uniformly distributed matrix O (appearing in the spectral decomposition of the noise Z) is replaced by the product of the Hadamard-Walsh matrix and a diagonal matrix with i.i.d. Rademacher entries as in ref. 42. In the smaller plots in the Top-Right corner, we report the performance of AMP-AP (blue) and of BAMP for our universality experiments involving the CIFAR-10 "plane" class (purple) and the "muscle skeletal" GTEx dataset (orange).
Thirdly, we consider a choice of denoisers in the AMP of ref. 35, which is motivated by our BAMP: If the potential has degree K , every K -th nonlinearity is the full memory posterior mean denoiser, and all the other denoisers are chosen to be the identity. The algorithm is dubbed AMP with alternating posteriors (AMP-AP), and its connection to BAMP is discussed at the end of the 1 section. As evident from the smaller plots in the top right corner, AMP-AP (blue) matches the performance of BAMP and of the replica prediction as well.
Last, BAMP is numerically unstable for low SNR. For the quartic potential and λ = 2.3, 5 out of 50 trials do not reach the state evolution fixed point (and are thus discarded). Furthermore, BAMP's state evolution detaches from the replica prediction as the SNR gets smaller. Considering an initialization closer to the fixed point mitigates the issue. This instability is likely due to the fact that BAMP's state evolution corresponds to an auxiliary AMP that multiplies the number of iterations (see the 1 section) and which thus amplify errors.
Universality of the Rotational Invariance Assumption. We believe that our results apply beyond the rotational invariance assumption to cases where the eigenbasis of the noise is invariant under more restrictive transformations (such as permutations), or even "quasideterministic." This intuition comes from recent studies (41,42) showing that, when AMP or its linearized version are used, the class of rotationally invariant matrices leads to the same performance as a much broader class of matrices (with same spectral density). While the existing literature considers a setting different than ours, this still suggests that our predictions should remain true more generally. To confirm this, we plot in Fig. 2 the performance of BAMP when the uniformly distributed matrix O (i.e., the noise Z eigenbasis) is replaced by i) the product of the Hadamard-Walsh matrix and a diagonal matrix with i.i.d. Rademacher entries, as in ref. 42 (green squares), or ii) the eigenbasis of the covariance matrix for two popular datasets in computer vision and quantitative genetics, i.e., the CIFAR-10 (46) "plane" class and the "muscle skeletal" GTEx dataset (47,48) (purple and orange markers, respectively, in the Top-Right plots). The excellent match clearly supports the universality of our predictions. Additional validations are contained in SI Appendix, section 7.2. These results can be understood from the fact that any eigenbasis O is typical w.r.t. the Haar measure, so for a fixed instance, as long as O is sufficiently independent of the eigenvalues, the universality should hold. This suggests that, in practice, our rotational invariance assumption effectively corresponds to assuming decoupling between eigenbasis and eigenvectors.

Methods
Outline of the Replica Computation. The starting point of the replica method is the "replica trick" lim N→∞ E ln Z(Y)/N = lim n→0 lim N→∞ ln EZ n (Y)/(Nn) that implicitly assumes the commutation of the n, N limits. Another key assumption is to consider n ∈ N in the computation and then assume an analytic continuation to n close to 0 + . The expectation is with respect to Y or equivalently the independent O, X * ; concerning D, we only need that its empirical eigenvalue distribution converges weakly to ρ and that it has asymptotically no outliers. When computing Z n , we get multiple integrals over (x ) 0≤ ≤n , with x 0 ≡ X * , and a sum of n Hamiltonians as in Eq. 6 in the exponential. Expanding the exponent, we identify some order parameters: After fixing these using the Fourier representation of the Dirac delta function, the replicated partition function reads where H N ( ,ˆ , x; x 0 , Z) := Nh( ,ˆ ) + x J 1 ( ,ˆ , Z)x + x J 0 ( ,ˆ , Z)x 0 , and := (v , M (1) , κ , m ) withˆ being the Fourier conjugate. The definitions of (h, J 1 , J 0 ) can be found in SI Appendix, section 3.1. This point is crucial as it allows us to write the n Hamiltonians (one per x ) as at most quadratic functions of x . Due to the quartic nature of the potential, the original H N would instead have quartic interactions, or higher-order ones for polynomial V of degree greater than four. Yet, by identifying the proper order parameters, a similar reduction to effective quadratic Hamiltonians would still be possible.
In EZ n , the replicas are coupled in the system only through the expectation over the quenched noise, that can be rewritten as an expectation over the Haar distributed noise eigenbasis E O . The entire computation then boils down to the evaluation of an inhomogeneous log-spherical integral that we introduced and defined as follows: Let the matrices C = diag((C i, ) i≤N ), C i = (C i, ) , ≤n , and vectors h = (h i, ) i≤N , h i = (h i, ) ≤n all having bounded entries uniformly in N. The sequence (h i ∈ R n , C i ∈ R n×n ) i≤N is assumed to have an empirical law tending to that of the random variable (h ∈ R n , C ∈ R n×n ). The inhomogeneous log-spherical integral is defined as Its limit depends only on the law of (C, h) and on the overlaps q := 1 N x x , ≤ , that we need to fix with additional Dirac deltas in addition to the previous order parameters. We find that lim N→∞ I N is expressed by a variational formula; see SI Appendix, section 2.1. This integral is a natural generalization of the standard spherical integral (49) and thus may have an interest beyond the present model, in particular, in random matrix theory or spin glasses. The final ingredient is a replica symmetric ansatz, justified by the strong concentration-of-measure effects taking place in the Bayes-optimal setting (50,51). It amounts to assume that all order parameters entering the model are independent of the replica index . Finally, a saddle point yields an extremization over R 13 of an effective action. Eqs. 7 and 8 follow directly.
Concerning the reduction from 13 to 2 order parameters (saddle point equations): This is possible thanks to a symmetry arising as a consequence of the Bayes rule which is specific to the Bayes-optimal setting and often called Nishimori identity. It allows to "interchange" the ground-truth signal X * with a sample x from the posterior Eq. 4 inside joint expectations over the posterior and data (see, e.g., ref. 51) and as a consequence to automatically fix the value of most order parameters.
Auxiliary AMP and Onsager Coefficients. The Onsager coefficients {c t,i } i∈[t],t≥1 are designed so that, conditioned on the signal, the empirical distribution of the iterate f t is Gaussian, namely (f 1 , . . . , f t ) W 2 −→ (F 1 , . . . , F t ) := t X * + W t , with W t ∼ N (0, t ) for some mean vector t and covariance matrix t . For the AMP in ref. 35, this condition is enforced via the reduction to an auxiliary AMP, which also allows to track the iterates of the original algorithm and yields the state evolution parameters, such as t and t above. This reduction crucially relies on splitting the matrix Y that multiplies the iterate u t , into the rank-one signal plus the noise matrix. In contrast, in Eq. 10, the iterate is multiplied by the preprocessed matrix J(Y), which cannot be directly split in a similar fashion. Hence, we track all the contributions (Y k u t ) k≤K , so that we can split them as Y k u t = Y Y k−1 u t = λ N X * X * , Y k−1 u t + ZY k−1 u t . The key idea is to map the first T iterations of Eq. 10 to the first K × T iterations of an auxiliary AMP with iterates (z t ,ũ t ) t∈ [KT] and denoisers {h t+1 } t∈[KT] , z t = Zũ t − i≤tb t,iũ i ,ũ t+1 =h t+1 (z 1 , . . . ,z t , u 1 , X * ), [13] whose state evolution can instead be deduced from ref. 35  More specifically, for t ∈ [T] and ∈ {2, . . . , K}, the denoiserh K(t−1)+ givingũ K(t−1)+ is a linear combination of past iteratesũ 1 , . . . ,ũ K(t−1)+ −1 and ofz K(t−1)+ −1 ; furthermore, the coefficients of these linear combinations are chosen to ensure thatũ K(t−1)+ ≈ Y −1 u t . Hence, by using Eq. 13 with Kt in place of t, one gets (Y u t ) ∈[K] fromz Kt and (ũ K(t−1)+ ) ∈{2,...,K} (up to an o N (1)). Thus, J(Y)u t can be expressed as a linear combination of (ũ 1 , . . . ,ũ Kt ,z Kt ), which in turn is a linear combination of i) the past iterates {u i } i∈ [t] , ii) the signal X * , plus iii) independent Gaussian noise. By inspecting the coefficients of this linear combination, one deduces a) the Onsager coefficients {c t,i } i∈[t],t≥1 (as the coefficients multiplying the past iterates {u i } i∈[t] ), b) the mean µ t (as the coefficient multiplying the signal X * ), and c) the covariance matrix t (as the covariance matrix of the remaining noise terms). Finally, by makingh Kt+1 depend on g t+1 , we enforce that u Kt+1 ≈ u t+1 . The description of the auxiliary AMP is deferred to SI Appendix, C.1, and its state evolution follows in SI Appendix, C.2.
In summary, the derivation of BAMP's Onsager coefficients involves approximating {Y k u t } k≤K−1 . This suggests an alternative choice of denoisers leading to the algorithm dubbed AMP-AP: For each batch of K iterations, we pick linear denoisers in the first K − 1 of them, as this allows to construct {Y k u t } k≤K−1 ; then, at the K-th iteration, we pick the posterior mean using all the past iterates, as this-in principle-allows one to assemble the vectors {Y k u t } k≤K−1 to obtain J(Y)u t as in BAMP. We note that AMP-AP does not require the coefficients of the polynomial J(Y), but it rather leaves to the posterior mean denoiser to learn them from the data. As such, it provides an efficient alternative to our proposed BAMP.