Unification of field theory and maximum entropy methods for learning probability densities

The need to estimate smooth probability distributions (a.k.a. probability densities) from finite sampled data is ubiquitous in science. Many approaches to this problem have been described, but none is yet regarded as providing a definitive solution. Maximum entropy estimation and Bayesian field theory are two such approaches. Both have origins in statistical physics, but the relationship between them has remained unclear. Here I unify these two methods by showing that every maximum entropy density estimate can be recovered in the infinite smoothness limit of an appropriate Bayesian field theory. I also show that Bayesian field theory estimation can be performed without imposing any boundary conditions on candidate densities, and that the infinite smoothness limit of these theories recovers the most common types of maximum entropy estimates. Bayesian field theory is thus seen to provide a natural test of the validity of the maximum entropy null hypothesis. Bayesian field theory also returns a lower entropy density estimate when the maximum entropy hypothesis is falsified. The computations necessary for this approach can be performed rapidly for one-dimensional data, and software for doing this is provided. Based on these results, I argue that Bayesian field theory is poised to provide a definitive solution to the density estimation problem in one dimension.

Bayesian field theory and maximum entropy are two methods for learning smooth probability distributions (a.k.a. probability densities) from finite sampled data. Both methods were inspired by statistical physics, but the relationship between them has remained unclear. Here I show that Bayesian field theory subsumes maximum entropy density estimation. In particular, the most common maximum entropy methods are shown to be limiting cases of Bayesian inference using field theory priors that impose no boundary conditions on candidate densities. This unification provides a natural way to test the validity of the maximum entropy assumption on one's data. It also provides a closer-fitting nonparametric density estimate when the maximum entropy assumption is rejected. Research in nearly all fields of science routinely calls for the estimation of smooth probability densities from finite sampled data [1,2]. Even in one dimension, however, basic questions remain about how best to accomplish this task. For instance, statistical physics has inspired two disparate approaches to this problem: maximum entropy density estimation [3,4] and Bayesian field theory [5][6][7][8][9][10][11][12][13]. Despite their common origin, the relationship between these approaches has remained unclear.
Maximum entropy (MaxEnt) density estimation is carried out as follows. One first uses the sampled data to estimate the values for a chosen set of moments, e.g., mean and variance. Typically, all moments up to some chosen order are selected [4,14]. The probability density that matches these moments while having the maximum possible entropy is then adopted as one's estimate. All other information in the data is discarded. One can therefore think of the MaxEnt estimate as a null hypothesis reflecting the assumption that there is no useful information in the data beyond the values of the specified moments [15].
In the Bayesian field theory approach, one first defines a prior on the space of continuous probability densities. This prior is formulated using a scalar field theory that favors smooth probability densities over rugged ones. The data are then used to compute a Bayesian posterior, from which a maximum a posteriori (MAP) density estimate is obtained. Although such field theory priors invoke an explicit length scale when quantifying smoothness, the optimal value for can be learned from the data in a natural way [5][6][7].
Its scale-free nonparametric Bayesian formulation arguably makes this field theory approach the most principled way to estimate probability densities from sampled data. However, the field theory priors considered in [5][6][7] imposed periodic boundary conditions on candidate densities. These boundary conditions limit the types of data sets for which such priors are appropriate. MaxEnt, by contrast, does not impose any boundary conditions on estimated densities.
Here I describe a class of Bayesian field theory priors that have no boundary conditions. These yield density estimates that, like standard MaxEnt estimates, match the first few moments of the data. The MaxEnt estimate itself is recovered when the smoothness length scale is infinite. More generally, any MaxEnt density estimate can be recovered from Bayesian field theory in the infinite smoothness limit. One need only choose a field theory prior that defines "smoothness" appropriately. This connection provides a natural way to test the validity of any MaxEnt hypothesis. If field theory density estimation identifies = ∞ as being optimal for one's data set, the corresponding MaxEnt hypothesis is validated. If instead the optimal is finite, the MaxEnt hypothesis is rejected and a nonparametric density estimate having lower entropy is obtained.
A predictor-corrector algorithm allows the optimal value for , as well as the corresponding density estimate, to be rapidly computed. Moreover, perturbation theory in the → ∞ limit reveals an analytic criterion for rejecting the MaxEnt hypothesis. Software for carrying out this analysis in one dimension is available at https://github.com/jbkinney/14_maxent.
Bayesian field theory -These results are elaborated in the context of one-dimensional density estimation. Suppose we are given N data points x 1 , x 2 , . . . , x N sampled from a smooth probability density Q true (x) that is confined to an interval of length L. We wish to estimate Q true from these data .
Following [7], we represent each density Q(x) in terms of a real field φ(x) via This parametrization ensures that Q is positive and normalized [16]. We then adopt a field theory prior on φ. Specifically, we consider priors of the form Here, S 0 plays the role of an action in statistical field theory, Z 0 = ∫ Dφ e −S 0 [φ] is the corresponding partition function, is a length scale below which fluctuations in φ are strongly damped, and α is a positive integer of our choosing. After some calculation [17], the resulting Bayesian posterior on Q is found to be consistent with a posterior on φ of the form where is a nonlinear action, Z is the corresponding partition function, and Eliminating boundary conditions -The MAP field, here denoted by φ , minimizes the action S [φ]. To obtain a differential equation for φ , previous work [5][6][7] imposed periodic boundary conditions on φ and used integration by parts to derive With the assumed boundary conditions in place, this differential equation has a unique solution. However, imposing these boundary conditions amounts to assuming that Q true (x) is the same at both ends of the x-interval. It is not hard to imagine data sets for which this assumption would be problematic. This imposition of boundary conditions, however, is unnecessary. From Eq. (4) alone we can derive a differential equation for φ that has a unique solution.
To do so, we first define a differential operator ∆ α by the requirement that for any two fields φ and ψ. This operator is Hermitian [18] even without boundary conditions on φ and ψ. We refer to ∆ α as the "bilateral Laplacian of order α." Setting δS δφ = 0 then gives the differential equation Any solution of Eq. (7) also satisfies Eq. (5) in the interior of the x-interval. However, Eq. (7) also specifies the behavior of φ at the boundary in a way that Eq. (5) by itself does not. These extra constraints render the solution of Eq. (7) unique, whereas Eq. (5) is degenerate without the additional assumption of boundary conditions on φ . The reason for this difference between the standard Laplacian and the bilateral Laplacian is perhaps clearest in the discrete representation. Let us restrict our attention to G grid points evenly spaced throughout the x-interval. The field φ becomes a G-dimensional vector, and the derivative operator ∂ G becomes a (G − 1) × G matrix having elements (∂ G ) ij = G L (−δ i,j + δ i+1,j ). Similarly, the standard α-order Laplacian (−1) α ∂ 2α is given by (−1) α ∂ G−2α+1 ⋯∂ G−1 ∂ G , a (G − 2α) × G matrix. Eq. (5) therefore provides only G − 2α equations for the G unknown values of φ at the grid points. 2α boundary conditions are thus needed to obtain a unique solution. By contrast, the α-order bilateral Laplacian is Eq. (7) therefore provides G equations for the G unknown elements of φ , and has a unique solution without additional constraints.
Connection to maximum entropy -Integrating Eq. (7) gives ∫ dx e −φ = L, and so the MAP density is Q = e −φ L. Multiplying Eq. (7) on the left by polynomials of order α − 1 and integrating reveals that Q matches the first α moments of the data, i.e., This moment matching behavior results from the the kernel of ∆ α begin spanned by polynomials of order α − 1.
Eq. (8) holds exactly in the discrete representation. At = ∞, the MAP field φ ∞ is restricted to the kernel of the bilateral Laplacian. The corresponding density thus has the form where the values of the coefficients a k are determined by Eq. (8). Q ∞ is therefore identical to the MaxEnt density that matches the first α moments of the data [4]. This equivalence again holds in the discrete representation.
Choosing the length scale -To determine the optimal value for , we compute the evidence p(data ) = ∫ Dφ p(data φ)p(φ ) [5,6,20,21]. This quantity is infinitesimal when α > 1, due to p(Q ) being an improper prior [22]. However, the evidence ratio E = p(data ) p(data ∞) is finite for all > 0. Using a Laplace approximation, which is valid for large N , we find where η = N (L ) 2α [23]. By construction, the evidence ratio E approaches unity in the large limit. Whether this limiting value is approached from above or below is relevant to the question of whether the MaxEnt hypothesis is optimal. Using perturbation theory about η = 0 ( = ∞), we find that where the coefficient K is [24]    Here, λ i and ψ i (x) (i = 1, 2, . . .) respectively denote the eigenvalues and eigenfunctions of L 2α ∆ α , and are indexed so that λ i = 0 for i ≤ α. The eigenfunctions are normalized so that ∫ dx L −1 ψ i ψ j = δ ij , and in the degenerate subspace (i, j ≤ α) they are oriented so that ∫ dx Q ∞ ψ i ψ j = δ ij ζ j for some positive real numbers ζ j .
The other indexed quantities are Eq. (12) provides a plug-in formula that can be used to assess the validity of the MaxEnt hypothesis. If K > 0, there is guaranteed to be a finite value of that has larger evidence than = ∞. In this case the MaxEnt estimate is seen to be non-optimal. On the other hand, if K < 0, then = ∞ will be a local optimum that may or may not be a global optimum as well. Numerical computation of E over all length scales is needed to resolve this ambiguity.
Numerical examples -In practice this density estimation procedure is carried out on a grid spanning the interval of interest. First, a predictor-corrector homotopy algorithm [25] is used to compute the value of φ over all length scales [26]. As in [7], this computation can be performed rapidly and deterministically due to the sparsity of ∆ α G and the convexity of S . The result is a one-parameter family of moment-matching densities that smoothly interpolates between the MaxEnt density at = ∞ and the data histogram at = 0 (Fig. 1b). The value of the evidence ratio is then computed over all length scales using Eq. (10). The optimal length scale, which we denote * , is thereby identified (Fig. 1c).
The validity of the MaxEnt assumption depends on the data in hand. Sometimes * = ∞, in which case the Max-Ent hypothesis is deemed optimal. When * is finite, however, the estimated density Q * fits the data more closely and has lower entropy than Q ∞ (Figs. 1a,d). This improved fit and reduced entropy reflects the use of addition information in the data beyond the first α moments.
Bayesian field theory can therefore be used to test whether Q true has a hypothesized functional form. For example, the value of * that results from using α = 3 provides a test of whether Q true is Gaussian. Indeed, when using α = 3 to analyze simulated data drawn from a Gaussian density, * = ∞ was obtained most of the time (Figs. 2a,d). By contrast, when data was drawn from a mixture of two Gaussians, the fraction of data sets yielding * = ∞ decreased sharply as the separation between the two Gaussians was increased (Figs. 2b,c,e,f). Other choices for α can be used to test other functional forms for Q true .
The "empirical Bayes" method used to select * both here and in previous work [5,6] is arguably inferior to the fully Bayesian approach of [7]. However, the fully Bayesian approach requires a length scale prior p( ) that obscures the nontrivial and potentially useful large behavior of the evidence ratio E. In particular, the K coefficient in Eq. (12) might provide a useful way to test the functional form of Q true when the direct computation of E over all length scales is not feasible. In the simulations performed for Fig. 2, the sign of K performed well as a proxy for the finiteness of * : for Fig. 2d (2e), K < 0 was found for 100% (95%) of the * = ∞ data sets, whereas K > 0 was found for 96% (69%)of the finite * data sets; K > 0 and finite * were found for all of the data sets analyzed in Fig. 2f .
Summary and discussion -Bialek et al. [5] showed that probability density estimation can be formulated as a Bayesian inference problem using field theory priors. Among other results, [5] derived the action in Eq. (4) and showed how to use a Laplace approximation of the evidence to select the optimal smoothness length scale [27]. However, periodic boundary conditions were imposed on candidate densities in order to transform the standard Laplacian into a Hermitian operator. The MaxEnt den-   sity typically violates these boundary conditions, and as a result the ability of Bayesian field theory to subsume MaxEnt density estimation went unrecognized in [5] and in follow-up studies [6,7].
Here we have seen that boundary conditions on candidate densities are unnecessary. The bilateral Laplacian, defined in Eq. (6), is a Hermitian operator that imposes no boundary conditions on functions in its domain, yet is equivalent to the standard Laplacian in the interior of the interval of interest. Using the bilateral Laplacian of various orders to define field theory priors, we recovered standard MaxEnt density estimates in cases where the smoothness length scale was infinite. We also obtained criteria for judging the appropriateness of the MaxEnt hypothesis on individual data sets.
More generally, Bayesian field theories can be constructed for any set of moment-matching constraints. One need only replace the bilateral Laplacian in the above equations with a differential operator that has a kernel spanned by the functions whose mean values one wishes to match to the data. The resulting field theory will subsume the corresponding MaxEnt hypothesis, and thereby allow one to judge the validity of that hypothesis. Analogous approaches for estimating discrete probability distributions can be formulated by replacing integrals over x with sums over states.
The elimination of boundary conditions removes a considerable point of concern with using Bayesian field theory for estimating probability densities. As demonstrated here and in [7], the necessary computations are readily carried out in one dimension. One issue not investigated here -the large N assumption used to compute the evidence ratio -can likely be addressed by using Feynman diagrams in the manner of [8].
How to choose an appropriate prior thus appears to be the primary issue standing in the way of a definitive practical solution to the density estimation problem in one dimension. In the author's opinion, this issue likely reflects a lingering ambiguity in what "density estimation" means. Different situations may ultimately call for the use of different priors, but understanding which situations call for which priors will require further work.
This field theory approach to density estimation readily generalizes to higher dimensions -at least in principle. Additional care is required in order to construct field theories that do not produce ultraviolet divergences [5], and the best way to do this remains unclear. The need for a very large number of grid points also presents a substantial practical challenge. Grid-free methods, such as those used by [10,28], may provide a way forward.
I thank Gurinder Atwal, Curtis Callan, William Bialek, and Vijay Kumar for helpful discussions. Support for this work was provided by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. Derivation of the evidence ratio E (Eq. 10) 3 Derivation of the K coefficient (Eq. 12) 4 Step 1: Expansion of φ to first order in η. Our derivation of the action S [φ] in Eq. 4 of the main text is essentially that used in [1]. This derivation is not entirely straight-forward, however, and the details of it have yet to be published.
The prior we consider is As written, this prior is improper due to the α-dimensional kernel of ∆ α . To avoid unnecessary mathematical difficulties, we can render p(φ ) proper by considering a regularized form of the action where is an infinitesimal positive number. All quantities of interest, of course, should be evaluated in the → 0 limit. The prior over Q that results from Eq. (1) is where φ c is the constant Fourier component of φ and φ nc (x) is the non-constant component. The likelihood of Q given the data, can be expressed in a somewhat non-obvious but nevertheless useful form. Using the identity which holds for any positive number a and positive integer N , we find that The prior probability of Q and the data together is therefore given by where is the action in Eq. 4 of the main text. The "evidence" for -i.e., the probability of the data given -is therefore given by, where is the partition function corresponding to the action S . The posterior distribution over Q is then given by Bayes's rule: This motivates us to define the posterior distribution over φ as This definition of p(φ data, ) is consistent with the formula for p(Q data, ) obtained in Eq. (21), in that However, Eq. (22) violates Bayes's rule, since This is not a problem, however, since φ itself is not directly observable; only Q is observable, and so only Eq. (21) is required for our mathematics to make sense. Put another way, Eq. (22) violates Bayes's rule only in the way that it specifies the behavior of φ c . This constant component of φ, however, has no effect on Q.
Note that all of the above calculations have been exact; no large N approximation was used. This contrasts with prior work [2,3], which used a large N Laplace approximation to derive the formula for S [φ]. Also note that the regularization parameter has vanished in the formulas for the posterior distributions p(Q data, ) and p(φ data, ). However, this parameter still appears in the formula for the evidence p(data ), both explicitly and implicitly through the value of Z 0 .

DERIVATION OF THE EVIDENCE RATIO E (EQ. 10)
In Eq. (19), we found that the evidence for a length scale is given by We now turn to the task of evaluating the partition functions Z and Z 0 , so that we can compute this quantity. Defining Λ = L 2α ∆ α and η = N (L ) 2α , and using a Laplace approximation to compute the path integral, we get, In Eq. (30) we switched to the grid representation, in which , etc.. Note that the operator Λ is unitless and has eigenvalues that, if ranked in ascending order, approach limiting values as G → ∞. Also note that η is unitless. For these reasons, η will emerge as a natural perturbation parameter in the → ∞ limit.
Evaluating the partition function Z 0 requires more care because the regularized form of the action S 0 , given in Eq.
(2), has to be used in order for the equations we derive to make sense. We get As in the main text, the subscript "row" on the determinant denotes restriction to the row space of Λ. Putting these values for Z and Z 0 back into Eq. (25), we get Although both Z and Z 0 depend strongly on the number of grid points G, the value for the evidence does not. The evidence does, however, depend on the regularization parameter ; specifically, it is proportional to α−1 2 . This is the basis for the claim in the main text that the evidence vanishes for α > 1.
In the large limit, η → 0, and so where "ker" denotes restriction to the kernel of Λ. As a result, The evidence ratio is therefore given by Note that E, unlike the evidence itself, does not depend on the regularization parameter .

DERIVATION OF THE K COEFFICIENT (EQ. 12)
The goal of this section is to clarify the large length scale behavior of To do this we expand ln E as a power series in η about η = 0. We will find that where K is a nontrivial coefficient that can be either positive or negative, depending on the data. We carry out this expansion in three steps: 1. Expand φ to first order in η.

Expand S [φ ]
to first order in η.

Expand ln det[Λ + ηe −φ ] to first order in η.
In what follows we will sometimes use the ket notion of quantum mechanics, primarily as a notational convenience. For any two functions f and g and any Hermitian operator H, we define Note the convention of dividing by L in the inner product integral. This allows us to take inner products without altering units. The eigenvalues λ i and corresponding eigenfunctions ψ i (x) defined in the main text satisfy ⟨ψ i ψ j ⟩ = δ ij for all i, j. and for some positive numbers ζ j . We will also make use of the following indexed quantities, Step 1: Expansion of φ to first order in η.
We begin by representing φ as a power series in η. Abusing notation somewhat, we write Plugging this expansion into the equation of motion, then collecting terms of equal order in η, we get, At lowest order in η, we recover This just reflects the restriction of φ ∞ to the kernel of Λ. At first order we get which we will proceed to investigate. To compute φ 1 , we first write it in terms of the eigenfunctions of Λ.
Taking the inner product of Eq. (55) with ψ i , we get Since λ i > 0 for i > α, As yet we have no information about the value of A i for i ≤ α. For this we need to consider the second order terms in Eq. (53). At second order in η, we obtain, Dotting this with ψ j for j ≤ α, we obtain The coefficients left unspecified in Eq. (58) are therefore given by, This completes our computation of φ to first order in η: Step 2: Expansion of S [φ ] to first order in η.
The value of the action S , evaluated at its minimum φ , is The first inner product term in Eq. (66) is Here we have used the fact that λ k = 0 for k ≤ α to restrict the sum to i, j > α in Eq. (68). The second inner product term in Eq.
Here, the sum over i is restricted to i > α due to the moment-matching condition ⟨ψ k Q ∞ − R⟩ = 0 for k ≤ α. To first order in η, we therefore find the rather simple result, We pause to note that this moment matching condition has interesting implications in the context of perturbation theory. For k ≤ α, This expression must vanish at each order in η, and so In a similar manner, considering the higher-order terms of Eq. (74) leads to nontrivial constraints on the higher-order corrections to φ . We don't make use of this finding in what follows, but it seems worth noting.
We first outline how we will go about computing Computing this quantity requires computing the eigenvalues of Γ. We will use γ i and ρ i to denote the eigenvalues and corresponding orthonormal eigenfunctions of Γ, i.e., Our primary task is to compute each eigenvalue γ i as a power series in η: This task, as we will see, also requires computing the eigenfunctions ρ i as power series in η: As usual, it will help to expand ρ 1 i in the eigenfunctions of Λ: Keeping in mind that λ i > 0 for i > α, and λ j = 0 for j ≤ α, we see that So while the larger eigenvalues of Γ need only be computed to first order in η, the α lowest eigenvalues must actually be computed to second order in η. Performing this second order calculation will require that we also compute the eigenfunctions ρ i to first order in η. Plugging Eq. (78) and Eq. (79) into Eq. (77) and collecting terms by order in η, we get . From the zeroth order term, we recover the eigenvalue equation Λψ i = λ i ψ i , which we already knew. From the first order term, we get, Dotting this with ψ k gives Choosing k = i, we recover the standard first order correction to the eigenvalues, in particular, Moreover, for i ≤ α, Eq. (90) gives From the second order term in Eq. (85), we get, Focusing on i ≤ α and dotting with ψ i , Using Eq. (93) to evaluate this expression gives Switching indices from i ↔ j, so that i > α and j, k ≤ α, we thus find the required second order correction to γ j : We can now compute ln det Γ. Plugging the values for γ 1 i and γ 2 i into Eq. (83), and using we get what we set out to find: n ⋯ (orange) then converges to Q n .

Putting it all together
Putting together our results from Eq. (73) and Eq. (102), and switching indices j ↔ k in the last term of Eq. (102) just to smooth out the notation, we get where the K coefficient is

HOMOTOPY ALGORITHM
In the space of probability densities, the set of MAP densities forms a one-parameter curve extending from the MaxEnt density at = ∞ to the data histogram at = 0. The algorithm used to perform the computations shown in Figs. 1 and 2 of the main text approximates this "MAP curve" by computing the MAP densities at a finite set of length scales. This set of length scales extends from = 0 to = ∞ (inclusively), and are chosen so that, at any two neighboring length scales and ′ , the MAP densities Q and Q ′ are sufficiently similar. By computing this "string of beads" approximation to the MAP curve, the homotopy algorithm allows one to compute the evidence ratio E over all length scales.
This algorithm proceeds as follows. First, the MaxEnt density Q ∞ is computed essentially as described by Ormoneit and White [4]. The details of this computation are reported below.
Next, the MAP density Q 0 is computed at an intermediate length scale 0 = L √ G. This density, Q 0 , serves as the starting point for a predictor-corrector homotopy algorithm [5] that traces the MAP curve from 0 towards both larger and smaller length scales (Fig. S1a). The result is a finite set of length scales {0, r , ⋯, −2 , −1 , 0 , 1 , 2 , ⋯, m , ∞} chosen so that MAP densities at any two neighboring length scales and ′ satisfy where D geo is the geodesic distance measure [1,6] and is a small user-specified distance. The value = 10 −2 was used for the computations reported in the main text. Each step Q n ′ → Q n is accomplished in two parts (Fig. S1b). First, a "predictor step" is used to compute the new length scale n as well as an approximation Q (0) n to the corresponding MAP density Q n . A series of "corrector steps" is then used to compute a corresponding series of densities Q (1) n , Q (2) n , ⋯ that converges to Q n . These predictor and corrector steps are described in more detail below.

Computing the MaxEnt density
We saw in the main text that φ ∞ is a polynomial of order α − 1, i.e., The problem of computing the MaxEnt density Q ∞ therefore reduces to finding the vector of coefficients a = (a 0 , a 1 , . . . , a α−1 ) that minimizes the action As suggested by Ormoneit and White [4], we solve this optimization problem using the Newton-Raphson algorithm with backtracking. Starting at a 0 = 0, we iterate a n → a n+1 = a n + γ n δa n (110) where 0 < γ n ≤ 1 and the vector δa n is the solution to From Eq. (109), where µ i = ∫ dx R x i is the i'th moment of the data. The Hessian matrix ∂ 2 S ∂ai∂aj is positive definite at all a, so a unique solution for δa n can always be found [7]. The scalar γ n is then chosen so that S ∞ (a n + γ n δa n ) − S ∞ (a n ) < γ n 2 α−1 i=0 ∂S ∂a i a n δa n i .
Specifically, γ n is first set to 1, then is reduced by factors of 1 2 until Eq. (113) is satisfied. This "dampening" of the Newton-Raphson method is sufficient to guarantee convergence [4,8]. This algorithm is terminated when the magnitude of the change in the action S ∞ (a n+1 ) − S ∞ (a n ) falls below a specified tolerance.

Predictor step
The predictor step computes a change → ′ in the length scale, as well as an approximation to the corresponding change φ → φ ′ in the MAP field. Specifically, the predictor step returns a scalar δt and a function ρ(x) such that, where t = ln η is a numerically convenient reparametrization of the length scale . To determine the function ρ, we examine the equation of motion at ′ : The O(δt) term must vanish, and thus we obtain a linear equation for ρ: The scalar δt is then chosen to satisfy the distance criterion, We therefore adopt with the sign of δt chosen according to the direction we wish to traverse the MAP curve.

Corrector step
The purpose of the corrector step is to accurately solve the equation of motion, at fixed length scale . This step is used initially to compute Q 0 at the starting length scale 0 . It is also used to hone in on the MAP density at each new length scale chosen by the predictor step of the homotopy algorithm.
As with the computation of the MaxEnt density, this corrector step uses the Newton-Raphson algorithm with backtracking. Starting from a hypothesized field φ (0) (e.g., returned by the predictor algorithm), the iteration is repeatedly performed. The function δφ (n) and scalar γ n are chosen so that this iteration converges to the true field φ . To derive the field perturbation δφ (n) , we plug this formula for φ (n+1) into the equation of motion. Keeping only terms of order δφ (n) or less, we see that Λφ (n) + Λδφ (n) + η(LR − e −φ (n) + e −φ (n) δφ (n) ) = 0.
As before, γ n is chosen so that so that the action decreases by a respectable amount compared to what is expected from the linear approximation, i.e., This iterative process is terminated when the magnitude of the change in the action, S [φ (n+1) ] − S [φ (n) ] falls below a specified tolerance.

DISCUSSION OF SILVERMAN (1982)
In statistics there is a class of nonparametric techniques called "maximum penalized likelihood" estimation [9]. The central idea behind these methods for estimating smooth functions is to maximize the likelihood function modified by a heuristic roughness penalty. In this context,  proposed using −S [φ] as the penalized likelihood function for probability density estimation [10]. This choice was motivated by the observation that, when = ∞, one recovers a moment-matching distribution having a familiar parametric form. This early work by Silverman is therefore relevant to the results reported here.