Bryan's Maximum Entropy Method -- diagnosis of a flawed argument and its remedy

The Maximum Entropy Method (MEM) is a popular data analysis technique, based on Bayesian inference, which has found various applications in the research literature. While the MEM itself is well grounded in statistics, I argue that its state-of-the-art implementation, suggested originally by Bryan, artificially restricts its solution space. This restriction leads to a systematic error often unaccounted for in contemporary MEM studies. Since previously published arguments on the shortcoming of Bryan's MEM have recently been questioned in arXiv:2001.10205, this paper will carefully revisit Bryan's train of thought, point out its flaw in applying linear algebra arguments to an inherently non-linear problem and suggest possible ways to overcome it.


I. INTRODUCTION
Ill-posed inverse problems represent a major challenge to research in many disciplines. They arise e.g. when one wishes to reconstruct the shape of an astronomical object after its light has passed through a turbulent atmosphere and an imperfect telescope or when we image the interior of a patients skull during a CAT scan. In a more theoretical setting, extracting spectral functions from numerical simulations of strongly correlated quantum fields constitutes another example. The common difficulty among these tasks lies in the fact that we do not have direct access to the quantity of interest (from here on referred to as ρ) but instead only to a distorted representation of it, measured in our experiment ( from here on denoted by D ). Extracting ρ from D, in general, requires us to solve an inherently non-linear optimization problem, which we construct and discuss in detail below.
Let us consider the common class of inverse problems, where the quantity of interest ρ and the measured data D are related via an integral convolutioñ D(τ ) = ωmax ωmin dω K(τ, ω) ρ(ω), 0 ≤ τ ≤ τ max , (1) with a kernel function K(τ, ω). For the sake of simplicity let us assume (as is often the case in practice) that the function K is exactly known. The task at hand is to estimate the function ρ that underlies the observed D. The ill-posedness (and ill-conditioning) of this task is readily spotted, if we acknowledge that our data comes in the form of N τ discrete estimates D i =D(τ i ) + η of the true functionD, where η denotes a source of noise. In addition, we need to approximate the integral in some form for numerical treatment. In its simplest form, writing it as a sum over N ω bins we obtain (2) * alexander.rothkopf@uis.no At this point we are asked to estimate N ω N τ optimal parameters ρ l from N τ data points, which themselves carry uncertainty. A naive χ 2 fit of the ρ l 's is of no use, since it would produce an infinite number of degenerate solutions, which all reproduce the set of D i 's within their errorbars. Only if we introduce an appropriate regularization, can the problem be made well-posed and it is this regularization which in general introduces non-linearities.
Bayesian inference represents one way how to regularize the inversion task. It provides a systematic procedure how additional (so called prior) knowledge can be incorporated to that effect. Bayes theorem states that the posterior probability P [ρ|D, I] for some set of parameters ρ l , to be the true solution of the inversion problem, is given by the product of the likelihood probability P [D|ρ, I] and the prior probability P [ρ|I]. The ρ independent normalization P [D|I] is often referred to as the evidence. Assuming that the noise η is Gaussian, we may write the likelihood as where C ij is is the unbiased covariance matrix of the measured data with respect to the true mean and D ρ i refers to the synthetic data that one obtains by inserting the current set of ρ l parameters into (2). A χ 2 fit would simply return one of the many degenerate extrema of L, hence being referred to as maximum likelihood fit. The important ingredient of Bayes theorem is the presence of the prior probability, often expressed in terms of a regulator functional S It is here where pertinent domain knowledge can be encoded. For the study of intensity profiles of astronomical objects and hadronic spectral functions it is e.g. a priori known that the values of ρ must be positive. Depending on which type of information one wishes to incorporate, the explicit form of S will be different. Once both the likelihood and prior probability are set, we may search for the maximum a posteriori (MAP) point estimate of the ρ l 's via ρ Bayes constitutes the most probable parameters given our data and prior knowledge. Note that since the exponential function is monotonous, instead of finding the extremum of P [ρ|D, I], in practice one often considers the extremum of L − S directly. Let us have a look at three choices of regulators from the literature: the historic Tikhonov regulator [1] (1943-), the Shannon-Jaynes entropy deployed in the MEM [2,3] (1986-) and the more recent BR [4] regulator (2013-) All three regulators contain two sets of so-called hyperparameters m and α. The former is referred to as the default model, which is closely related to the mean of the prior distribution, while the confidence parameter α determines the spread of P [ρ|I]. Both S MEM and S BR enforce positivity of the function ρ, due to the logarithm. This logarithm is responsible for the numerical optimization problem (7) to become genuinely non-linear. Note that all three functions are concave, which (as proven e.g. in [5]) guarantees that if an extremum of P [ρ|D, I] exists it is unique. I.e. within the N ω dimensional solution space spanned by the discretized parameters ρ l , in case that a Bayesian solution exists, we will be able to locate it with standard numerical methods in a straight forward fashion.

A. Tikhonov regularization
Let us have a look at the consequences of choosing a particular regularization. The Tikhonov choice amounts to a Gaussian prior probability, which allows ρ to take on both positive and negative values. The default model m l denotes the value for ρ l , which was most probable before the arrival of the measured data D (e.g. from a previous experiment) and α l represents our confidence into the prior knowledge (e.g. the uncertainty of the previous experiment).
Since both (5) and (8) are at most quadratic in ρ l , taking the derivative in (7) leads to a set of linear equations that need to be solved to compute the Bayesian optimal solution ρ Bayes . It is this fully linear scenario, from which most intuition is derived, when it comes to the solution space of the inversion task. Indeed we are lead to the following relations which can be written solely in terms of linear vectormatrix operations. Note that in this case δL/δD ρ contains the vector ρ in a linear fashion. (12) invites us to parametrize the function ρ via its deviation from the default model and to look for the optimal set of parameters a l . Here we may safely follow Bryan [6] and investigate the singular values of K T = U ΣV t with U being an N ω × N ω special orthogonal matrix, Σ an N ω × N τ matrix with N τ nonvanishing diagonal entries, corresponding to the singular values of K T and V t being an N τ ×N τ special orthogonal matrix. We are lead to the expression which tells us that in this case the solution of the Tikhonov inversion lies in a functional space spanned by the first N τ columns of the matrixÛ (usually referred to as the SVD or singular subspace) around the default model m. I.e. we can parametrize ρ as The point here is that if we add to this SVD space parametrization any further column of the matrixÛ , it directly projects into the null space of K viaK · U k>Nτ = 0. In turn such a column does not contribute to computing synthetic data via (2). As was pointed out in [7] in such a linear scenario, the SVD subspace is indeed all there is. If we add extra columns ofÛ to the spectral function these do not change the likelihood. Thus the corresponding parameter c j of that column is not constrained by data and will come out as zero in the optimization procedure of (7), as it encodes a deviation from the default model, which is minimized by S.

B. Maximum Entropy Method
Now that we have established that in a fully linear context the arguments based on the SVD subspace are indeed justified, let us continue to the Maximum Entropy Method, which deploys the Shannon-Jaynes entropy as regulator. S MEM encodes as prior information e.g. the positivity of the spectral function, which manifests itself in the presence of the logarithm in (9). This logarithm however also entails that we are now dealing with an inherently non-linear optimization problem in (7). Carrying out the functional derivative with respect to ρ on L − S we are lead to the following relation As suggested by Bryan [6], let us introduce the completely general (since ρ > 0) and non-linear redefinition of the parameters Inserting (17) into (16), we are led to an expression that is formally quite similar to the result obtained in the Tikhonov case While at first sight this relation is also amenable to be written as a linear relation for a it is actually fundamentally different from (12), since due to its non-linear nature, a enters δL/δD ρ via component-wise exponentiation. It is here, when attempting to make a statement about such a non-linear relation with the tools of linear algebra that we run into difficulties. What do I mean by that: let us push ahead and introduce the SVD decomposition of the transpose Kernel as before At first sight this relation seems to imply that the vector a, which encodes the deviation from the default model (this time multiplicatively), is restricted to the SVD subspace, spanned by the first N τ entries of the matrixÛ . My claim (as put forward most recently in [8]) is that this conclusion is false, since this linear-algebra argument is not applicable when working with (19). Let us continue to setup the corresponding SVD parametrization advocated e.g. in [7] In contrast to (15) the SVD space is not all there is to (21) (see explicit computations in Appendix A). This we can see by taking the N τ + 1 column of the matrixÛ , exponentiating it component-wise and applying it to the matrix K. In general we get This means that if we add additional columns of the ma-trixÛ to the parametrization in (21) they do not automatically project into the null-space of the Kernel (see the explicit example in the appendix of this manuscript) and thus will contribute to the likelihood. In turn the corresponding parameter c j related to that column will not automatically come out to be zero in the minimization procedure (7). Hence we cannot apriori disregard its contribution and thus the contribution of this direction of the search space, which is not part of the SVD subspace. We thus conclude that limiting the solution space in the MEM to the singular subspace amounts to an ad-hoc procedure, motivated by an incorrect application of linear-algebra arguments to a fully non-linear optimization problem.
If we take a look e.g. at Eq.12 in (arXiv:2001.10205), we find the same fallacy as in Bryan's original paper. I.e. the non-linear character of the parametrization of ρ (Eq.7 in that paper) is not taken into account. We emphasize that one does not apply a column of the matrixÛ itself to the matrix K but the component-wise exponentiation of this column. This operation does not project into the null-space. In those cases where we have only few pieces of reliable prior information and our datasets are limited, restricting to the SVD subspace may lead to significantly distorted results (as shown explicitly in [10]). On the other hand its effect may be mild if the default model already encodes most of the relevant features of the final result and the number of datapoints is large, so that the SVD subspace is large enough to (accidentally) encompass the true Bayesian solution sought after in (7). Independent of the severity of the problem, the artificial restriction to the SVD subspace in Bryan's MEM is a source of systematic error, which needs to be accounted for when the MEM is deployed as precision tool for inverse problems.
Being liberated from the SVD subspace does not lead to any conceptual problems either. We have brought to the table N τ points of data, as well as N ω points of prior information in the form of the default model m (as well as its uncertainty α). This is enough information to determine the N ω parameters ρ l uniquely, as proven in [5].
Being liberated from the SVD subspace does not lead to any conceptual problems either. We have brought to the table N τ points of data, as well as N ω points of prior information in the form of the default model m (as well as its uncertainty α). This is enough information to determine the N ω parameters ρ l uniquely, as proven in [5].
Recognizing that linear-algebra arguments fail in the MEM setup also helps us to understand some of the otherwise perplexing results found in the literature. If the singular subspace were all there is to the parametrization of ρ l in (21), then it would not matter whether we use the first N τ columns ofÛ or just use the N τ columns ofK T directly. Both encode the same target space, the difference being only that the columns ofÛ are orthonormal. However, as was clearly seen in Fig.28 of [9], using the SVD parametrization or the columns of K T leads to significantly different results in the reconstructed features of ρ. If the MEM were a truly linear problem, these two parametrizations gave exactly the same result. The finding that the results do not agree emphasizes that the MEM inversion is genuinely non-linear and the restriction to the SVD subspace is ad-hoc.

C. Numerical evidence for the inadequacy of the SVD subspace
Let us construct an explicit example to illustrate the fact that the solution of the MEM reconstruction may lie outside of the SVD search space. Since Bryan's derivation of the SVD subspace proceeds independently of the particular form of the kernel K, the provided data D and the choice of the default model m, we are free to choose them at will. For our example we consider a transform often encountered among inverse problems related to the physics of strongly correlated quantum systems.
One then has K(τ, ω) = 1/(ω 2 + τ 2 ) and we may set m = 1. With α entering simply as scaling factor in (19), we do not consider it further in the following. Let me emphasize again that the arguments leading to (19) did not make any reference to the data we wish to reconstruct. Here we will consider three datapoints that encode a single delta peak at ω = ω 0 , embedded in a flat background. Now let us discretize the frequency domain between ω min = 1/2000 and ω max = 1000 with N ω = 2000 points. Together with the choice of τ min = 0, τ max = 0.2 and N τ = 3 this fully determines the kernel matrix K il in (2). Three different mock spectra are considered, with ω  Fig.1 left). Bryan's argument states that in the presence of three datapoints (see Fig.1 right), irrespective of the data encoded in those datapoints, the extremum of the posterior must lie in the space spanned by the exponentiation of the three first columns of the matrixÛ , obtained from the SVD of the transpose kernel K t . In Fig.2 its first three columns are explicitly plotted. Note that while they do show some peaked behavior around the origin, they quickly flatten off above ω = 10. From this inspection by eye it already follows that it will be very difficult to linearly combine U 1 (ω), U 2 (ω) and U 3 (ω) into a sharply peaked function, especially for a peak located at ω > 10.
Assuming that the data comes with constant relative error ∆D/D = κ, let us find out how well we can reproduce it within Bryan's search space. A minimization carried out by Mathematica (see the explicit code in Appendix B) tells us that L 1 min ≈ 10 6 κ −2 , L 2 min ≈ 10 9 κ −2 and L 3 min ≈ 10 10 κ −2 . I.e. we clearly see that we are not able to reproduce the provided datapoints well (i.e. χ 2 /d.o.f = L/3 1) and that the deviation becomes more and more pronounced, the higher the delta peak is positioned along ω. In the full search space on the other hand, we can always find a set of ρ l 's which brings the likelihood as close to zero as desired.
Minimizing the likelihood of course is not the whole story in a Bayesian analysis, which is why we have to take a look at the contributions from the regulator term S. Remember that by definition it is negative. We find that for all three mock spectra, Bryan's best fit of the likelihood leads to values of S ≈ −0.23. On the other hand a value of S ≈ −200 is obtained in the full search space for the parameter set ρ l which contains only one entry that is unity and all other entries being set to the background at 1/N 2 ω . From the above inspection of the extremum of L and the associated value of S inside and outside of Bryan's search space we arrive at the following: Due to the very limited structure present in the SVD basis functions U 1 (ω), U 2 (ω) and U 3 (ω) it is in general not possible to obtain a good reconstruction of the input data. This in turn leads to a large minimal value of L accessible within the SVD subspace. Due to the fact that S cannot compensate for a large value of L, we have constructed an explicit example where at least one set of ρ l 's (the one that brings L close to zero in the full search space) leads to a smaller value of L − S and thus to a larger posterior probability than any of the ρ l 's within the SVD subspace.
I.e. we have constructed a concrete example in which the MEM solution, given by the global extremum of the posterior, is not contained in the SVD subspace.

A. Staying within MEM
Having identified the shortcoming of Bryan's MEM as the ad-hoc restriction of the search space to the SVD subspace, we must consider how to overcome it. The most naive path to take is to simply add additional columns of the matrixÛ to the non-linear parametrization of (21), as proposed in 2011 in [10] and successfully applied in practice in [11]. This procedure systematically approaches the full search space, in which the unique Bayesian solution is located.
We are surprised by the statements made in (arXiv:2001.10205), which claim that the numerics presented in [10] and thus also in [11] were unreliable. Both papers have been peer-reviewed and the underlying code has been freely available for many years on the authors website (ExtMEM link). Furthermore the examples presented in [10] are explicitly parametrized and thus lend themselves to straight forward reproduction.
Investigating in detail the shape of the SVD basis functions, it was found that they may not offer the same resolution for features located at different ω's. In practice this means that one often needs to add a large number of additonal SVD basis functions before the structures encoded in the data D i are sufficiently resolved. In order to remedy this situation, we can exploit that we are not bound by the SVD subspace and instead, as suggested in [12], deploy e.g. Fourier basis functions in (21). I.e. we replace the columns ofÛ by a linear combination of sin and cos terms. This proposal has been shown to lead to a resolution capacity of the MEM that is independent of the position of the encoded structures in ρ along ω and has been put in practice e.g. in [13].
Thanks to the availability of highly efficient numerical optimization algorithms, such as the limited-memory BFGS (BroydenFletcherGoldfarbShanno) algorithm it is nowadays easily possible to directly carry out the optimization task (7) in the full N ω dimensional search space, even if N ω ∼ O(10 3 ). Together with the proof of uniqueness of the Bayesian solution and the related convexity of L − S, there does not exist any principal need to restrict to a low dimensional subspace.

B. Beyond MEM
An active community is working on devising improved Bayesian approaches to inverse problems in the sciences. On the one hand there exist works that go beyond the maximum aposteriori approach and proceed towards sampling the posterior, such as the stochastic analytic continuation method [14], of which the MEM is but one special limit as shown in [15]. Together with the SOM method presented in [16] these stochastic methods have e.g. been deployed in the study of nuclear matter at high temperatures in [17]. Recently the community has seen hightened activity in exploring the use of neural networks for the solution of inverse problems e.g. in [18][19][20][21] In the remainder of this manuscript let me focus on one recent Bayesian approach, the BR method, presented in [4] with regulator (10), which was designed with the particular one-dimensional inverse problem of (2) in mind. The motivation to develop this new method on the one hand arose from the observation that the specific form of the Shannon-Jaynes regulator of the MEM can pose a problem in finding the optimal Bayesian solution. Consider the (negative of the) integrands of (8), (9) and (10), plotted for an arbitrary choice of α = 0.1 and m = 1 in the left panel of Fig.3. By construction, all of them have an extremum at ρ = m and as expected only the Tikhonov regulator allows ρ to take on values smaller than zero. Both the MEM and BR regulator diverge as ρ → ∞ but their behavior close to ρ = 0 differs markedly. Let us have a closer look in the right panel of Fig.3. Plotted in log-log scale we see that while the BR regulator diverges as ρ → 0 the Shannon-Jaynes entropy just flattens off and intercepts the y-axis at a finite value. It is this flattening off that in practice can lead to very slow convergence of the deployed optimization algorithms, as the MEM wanders about in this flat direction.
The second reason was that the MEM originally arose in the context of two-dimensional astronomical image reconstruction and the assumptions that enter its construction make reference specifically to this two-dimensional nature of the inverse problems. Here instead we are interested in a simple one-dimensional inverse problem, which is not directly related to at least one of the axioms underlying the MEM. As laid out in detail in [4] we started by replacing that axiom by a generic smoothness axiom as well as introduced a scale invariance axiom to arrive at the BR method with its regulator given in (10). The form of this regulator differs in important aspects from the Shannon-Jaynes entropy. Note that it contains only ratios of the functions ρ and m. Since both quantities carry the same units, it implies that the value of the integral does not depend on the units assigned to the spectral function. In contrast the Shannon-Jaynes integrand and thus the integral depend on the specific choice of units in ρ. In addition the logarithmic term in S BR is not multiplied with the function ρ. This changes the behavior of the integrand for ρ → 0 , making it diverge. I.e. the BR regulator avoids the flat direction encountered in S MEM and thus shows much better convergence properties for functions ρ, which contain large ranges, where their values are parametrically much smaller than in the dominant contributions.
A straight forward implementation of the BR method in a general Bayesian context has recently been introduced in [22]. Using the MC-STAN Monte-Carlo library it has been shown how to sample the posterior distribution of the BR method in the full N ω dimensional search space and thus how to access the full information encoded in it. The maximum aposteriori solution considered in the literature so far, only captures the maximum of this distribution. Not only does a full Bayesian implementation of the BR method allow for a self-consistent treatment of the hyperparameter α but also provides the complete uncertainty budget from the spread of the posterior.

IV. SUMMARY
We have recapitulated in this paper the arguments underlying Bryan's MEM, which we show to be flawed as they resort to methods of linear algebra when treating an inherently non-linear optimization problem. Therefore, even though the individual steps in the derivation of the SVD subspace are all correct, they do not apply to the problem at hand and their conclusions can be disproven with a direct counterexample. The counterexample utilizes the fact that the component-wise exponentiated columns of the matrixÛ do not project into the nullspace of the Kernel, when computing synthetic data. After establishing the fact, that the restriction to the SVD subspace is an ad-hoc procedure, we discussed possible ways to overcome it, suggesting either to systematically extend the search space within the MEM or abandon the MEM in favor of one of the many modern Bayesian approaches developed over the past two decades.