Relative Entropy Minimization over Hilbert Spaces via Robbins-Monro

One way of getting insight into non-Gaussian measures, posed on infinite dimensional Hilbert spaces, is to first obtain best fit Gaussian approximations, which are more amenable to numerical approximation. These Gaussians can then be used to accelerate sampling algorithms. This begs the questions of how one should measure optimality and how the optimizers can be obtained. Here, we consider the problem of minimizing the distance with respect to relative entropy. We examine this minimization problem by seeking roots of the first variation of relative entropy, taken with respect to the mean of the Gaussian, leaving the covariance fixed. Adapting a convergence analysis of Robbins-Monro to the infinite dimensional setting, we can justify the application of this algorithm and highlight necessary assumptions to ensure convergence, not only in the context of relative entropy minimization, but other infinite dimensional problems as well. Numerical examples in path space, showing the robustness of this method with respect to dimension, are provided.

Having a Gaussian approximation gives qualitative insight into µ, as it provides a concrete mean and variance, which can be numerically approximated. Additionally, the optimized distribution can be used in exact sampling algorithms to improve performance. This benefits a number of applications, including path sampling for molecular dynamics and parameter estimation in statistical inverse problems. Note that R has an asymmetry in its arguments. The other form has been explored in the finite dimensions, [2,3,10,17].
To be of computational use, it is necessary to have an algorithm for computing this optimal distribution. In [14], this was accomplished by first expressing ν = N (m, C(p)), where m is the mean and p is a parameter inducing a well defined covariance operator, and then solving, (1.2) (m, p) ∈ argmin R(N (m, C(p))||µ).
Minimization was accomplished by using the Robbins-Monro algorithm (RM) to find a root of the first variation of (1.2). This strategy has also been used in finite dimensions, [2,3].
In this work, we emphasize the infinite dimensional problem. Indeed, following the framework of [14], we assume that µ is posed on the Borel σ-algebra of a separable Hilbert space H. For simplicity, we will leave the covariance operator C fixed, and only optimize over the mean, m.

Robbins-Monro.
Given an objective function f : H → H, assume that it has a root, x . In our application to relative entropy, f will be the first variation of R. Further, we assume that we can only observe a noisy version of f , F : H × χ → H, such that for all x ∈ H, where ξ n ∼ µ ξ , are independent and identically distributed (i.i.d.), and a n > 0 is a carefully chosen sequence. Subject to assumptions on f , F , and the distribution µ ξ , it is known that x n will converge to x almost surely (a.s.), in finite dimensions, [4,5,16].
To prove convergence, additional properties are needed. For instance, it is sufficient to assume f grows at most linearly, f (x) ≤ c 0 + c 1 x . Alternatively, an a priori assumption is sometimes made that the entire sequence generated by (1.4) stays in a bounded set. The analysis of the finite dimensional problem has been refined tremendously over the years, including an analysis based on continuous dynamical systems. We refer the reader to the texts [1,7,11] and references therein.
One way of ensuring convergence is to introduce trust regions that the sequence {x n } is permitted to explore, along with a "truncation" which constrains the sequence to said trust regions. The truncations distort (1.4) into where p n+1 is the so-called projection. Projection algorithms are also discussed in [1,7,11]. A general analysis of RM with truncations in Hilbert spaces can be found in [19]. Here, we apply the analysis of [12], see, also, [8], to the Hilbert space case for two versions of the truncated problem. This more recent analysis is quite simple, and readily adapts to our setting.
The trust region methods can be formulated as follows. Let σ 0 = 0 and x 0 ∈ U 0 , an open subset of H. Then the iteration is given by: where {U n } is a sequence of trust regions. We interpret x (p) n+1 as the proposed move, which is either accepted or rejected. If it is rejected, the algorithm restarts at one of {x (n) 0 }, the restart points. The essential property is that the algorithm restarts in the interior of a trust region. The r.v. σ n counts the number of times a truncation has occurred.
For subsequent calculations, we re-express (1.6) as where δM n+1 , the noise term, is A natural filtration for this problem is F n = σ(x 0 , ξ 1 , . . . , ξ n ). Since x n ∈ F n , the noise term can be expressed via the filtration as We now describe two implementations of (1.6).
. . , ∪ ∞ n=0 U n = H As before, the restart points may be random or fixed, and x 0 is assumed to be in U 0 . This would appear superior to the fixed trust region algorithm, as it does not require knowledge of the sets. However, to guarantee convergence, global (in H) assumptions on f are needed; see Assumption 2 below.
1.2. Outline. In Section 2, we give sufficient conditions for which we are able to prove convergence in both the fixed and expanding trust region problems. In Section 3, we state assumptions for the relative entropy minimization problem for which we can ensure convergence. Some examples are presented in Section 4, and we conclude with remarks in Section 5.

The Robbins-Monro Algorithm
We now state our assumptions that will ensure convergence of (1.7).
Assumption 1. f has a zero, x ∈ H and in the case of the fixed trust region problem, there exist R 0 < R 1 such that while in the case of the expanding trust region problem, there is a nested sequence of bounded open sets U n satisfying (1.9) with x 0 ∈ U 0 .
In the case of the fixed truncation, this inequality is restricted to x ∈ U 1 .
is bounded on bounded sets, with the restriction to U 1 in the case of fixed trust regions.
Assumption 4. a n > 0, a n = +∞ and a 2 n < ∞. Then Theorem 2.1. Under the above assumptions, for the fixed trust region problem, X n → x a.s. and σ n is a.s. bounded.
Theorem 2.2. Under the above assumptions, for the expanding trust region problem, X n → x a.s. and σ n is a.s. bounded.
In the fixed truncation algorithm, Assumptions 2 and 3 need only hold in the set U 1 , while in the expanding truncation algorithm, they must hold in all of H. While this would seem to be a weaker condition, it requires identification of the sets U 0 and U 1 for which the assumptions hold. These sets may not be readily identifiable, as we will see in Section 4.
A careful reading of the two main theorems appearing in [12], formulated for the finite dimensional problem, reveals almost no dependence on dimensionality. Thus, it can be immediately extended to the abstract separable Hilbert space case. A minor detail is that a Martingale convergence theorem for Hilbert spaces is required. Also, though the fixed trust region algorithm is not specifically addressed in [12], it is a very minor twist on the results. We briefly comment upon the proof.
First, we introduce the sequence M n , (2.1) x ≤r for expanding trust regions and r > 0. By virtue of Assumptions 3 and 4, and the above construction, M n is a Martingale, and it is uniformly bounded in L 2 (hence also L 1 ). In [6], Chatterji proves that in reflexive Banach spaces, such as H, uniformly bounded, in L 1 , Martingales converge a.s; see Theorems 6 and 7 of that work. Now that we have a H Martingale convergence, Theorems 1 and 2 of [12] can be applied to obtain the result. Assumptions 2, 3 and 4 correspond to the assumptions of [12].
Briefly, the proofs of the theorems from [12] are as follows. It is first shown that the number of truncations is a.s. finite, or, alternatively, σ n is a.s. bounded. This requires all four assumptions, along with a study of a modified sequence, x n . In the case of the expanding trust regions, x n is defined, for appropriate r > 0, as This has the recurrence, In defining x n , we make use of the a.s. convergence of M n . Since the number of truncations is a.s. finite, Analyzing the scalar sequence x n − x 2 , it is straightforward to show x n → x , a.s. Again, as M n converges a.s., we get x n → x .

Relative Entropy Minimization
Recall from the introduction that our distribution of interest, µ, is posed on the Borel subsets of a separable Hilbert space H. We assume that µ µ 0 , where µ 0 = N (m 0 , C 0 ) is some reference Gaussian. Thus, we write where Φ ν : H → R and Z µ is the normalization. Let ν = N (m, C), be another Gaussian, equivalent to µ 0 , such that we can write Assuming that ν µ, implying the three measures are equivalent, As was proven in [15], if A is set of Gaussian measures, closed under weak convergence, such that at least one element of A is absolutely continuous with respect to µ, then any minimizing sequence over A will have a weak subsequential limit. For simplicity, in this work, we will assume C = C 0 . Then, by the Cameron-Martin formula (see [9]), We also introduce the Cameron-Martin space, H 1 = Im(C 1/2 0 ). Note, we work in the shifted coordinate x = m − m 0 . We will show that x n → x , and the mean, m = x + m 0 .
Letting ν 0 = N (0, C 0 ) = µ 0 , we can then rewrite (3.3) as The first and second variations of (3.6), are then: 3.1. Application of Robbins-Monro. In [14], it was suggested that rather than try to find a roof of (3.7), the equation first be preconditioned by multiplying by C 0 . Then roots of x is bounded on bounded subsets of H 1 . • There exists a convex neighborhood U of x and a constant α > 0, such that for all x ∈ U , for all u ∈ H 1 , Then, choosing a n according to Assumption 4, • If the subset U can be taken to be all of H 1 , for the expanding truncation algorithm, x n → x a.s. in H 1 . • If the subset U is not all of H 1 , then, taking U 1 to be a bounded convex subset of U , with x ∈ U 1 , and U 0 any subset of U 1 such that there exist R 0 < R 1 with for the fixed truncation algorithm x n → x a.s. in H 1 .
Observe that the convergence is in H 1 , and not the underlying space H. In the sense of Theorems 2.1 and 2.2, H corresponds to H 1 .
Proof. By the assumptions of the theorem, we clearly satisfy Assumptions 1 and 4. To satisfy Assumption 3, we observe that , and this is bounded on bounded subsets of H 1 .
Finally, the Fréchet differentiability and our convexity assumption, (3.12), imply Assumption 2, since, by the mean value theorem in function spaces, where λ 1 is the principal eigenvalue of C 0 . This also implies Assumption 2, since x

Examples
To apply the Robbins-Monro algorithm, the Φ µ functional of interest must be examined. In this section we present a few examples, based on those presented in [14], and show when the assumptions hold. The one outstanding assumption that we must make is that, a priori, µ 0 is an equivalent measure to µ.

Scalar Problem.
Taking µ 0 = N (0, 1), the standard unit Gaussian, let V : R → R be a smooth function such that is a probability measure on R. In the above framework,   V Since E[Φ µ (x + ξ)] ≥ 4 −1 , all of our assumptions are satisfied and the expanding truncation algorithm will converge to the unique root at x = 0 a.s. See Figure 1 for an example of the convergence at = 0.1, U n = (−n − 1, n + 1), and always restarting at 0.5. We refer to this as a "globally convex" problem since R is globally convex about the minimizer.

Locally Convex Case.
In contrast to the above problem, some mimizers are only "locally" convex. Consider the case the double well potential In thise case, f (x) vanishes at 0 and ± √ 1 − , and J changes sign from positive to negative when x enters (− (1 − )/3, (1 − )/3). We must therefore restrict to a fixed trust region if we want to ensure convergence to either of ± √ 1 − . We ran the problem at = 0.1 in two cases. In the first case, U 1 = (0.6, 3.0) and the process always restarts at 2. This guarantees convergence since the second variation will be strictly postive. In the second case, U 1 = (−0.5, 1.5), and the process always restarts at −0.1. Now, the second variation can change sign. The results of these two experiments appear in Figure 2. For some random number sequences the algorithm still converged to √ 1 − , even with the poor choice of trust region.
Consider the path space distribution on L 2 (0, 1), induced by where V : R d → R is a smooth function. We assume that V is such that this probability distribution exists and that µ ∼ µ 0 , our reference measure. We thus seek an R d valued function m(t) ∈ H 1 (0, 1) for our Gaussian approximation of µ, satisfying the boundary conditions For simplicity, take m 0 = (1 − t)m − + tm + , the linear interpolant between m ± . As above, we work in the shifted coordinated x(t) = m(t) − m 0 (t) ∈ H 1 0 (0, 1). Given a path x(t) ∈ H 1 0 , by the Sobolev embedding, x is continuous with its L ∞ norm controlled by its H 1 norm. Also recall that for ξ ∼ N (0, C 0 ), in the case of ξ(t) ∈ R, Letting λ 1 = 1/π 2 be the ground state eigenvalue of C 0 , The terms involving x + m 0 in the integrand can be controlled by the L ∞ norm, which in turn is controlled by the H 1 norm, while the terms involving ξ can be integrated according to (4.8). As a mapping applied to x, this expression is bounded on bounded subsets of H 1 . Minimizers will satisfy the ODE

Globally Convex Example.
With regard to convexity about a minimizer, m , if, for instance, V were obviously pointwise positive definite, then the problem would satisfy (3.13), ensuring convergence. Consider the quartic potential V given by (4.3). In this case, Since Φ (x + m 0 + ξ) ≥ −1 , we are guaranteed convergence using expanding trust regions. Taking = 0.01, m − = 0 and m + = 2, this is illustrated in Figure 3, where we have also solved (4.9) by ODE methods for comparison. As trust regions, we take (4.11) U n = m ∈ H 1 0 (0, 1) | x H 1 ≤ 10 + n , and we always restart at the zero solution Figure 3 also shows robustness to discretization; the number of truncations is relatively insensitive to ∆t. Then, Here, we take m − = 0, m + = 2, and = 0.01. We have plotted the numerically solved ODE in Figure 4. Also plotted is E[Φ (x + m 0 + ξ)].
Discretizing the Schrödinger operator we numerically compute the eigenvalues. Plotted in Figure 5, we see that the minimal eigenvalue of J (m ) is approximately µ 1 ≈ 550. Therefore, (4.14) J (x )u, u ≥ µ 1 u 2 L 2 ⇒ J (x)u, u ≥ α u 2 H 1 , for all x in some neighborhood of x . For an appropriately selected fixed trust region, the algorithm will converge.
However, we can show that the convexity condition is not global. Consider the path m(t) = 2t 2 , which satisfies the boundary conditions . As shown in Figure 5, this path induces negative eigenvalues.   Figure 4. Also shown is the numerically computed spectrum for the path m(t) = 2t 2 , which introduces negative eigenvalues.
Despite this, we are still observe convergence. Using the fixed trust region , we obtain the results in Figure 6. Again, the convergence is robust to discretization.

Discussion
We have shown that the Robbins-Monro algorithm, with both fixed and expanding trust regions, can be applied to Hilbert space valued problems, adapting the finite dimensional proof of [12]. We have also constructed sufficient conditions for which the relative entropy minimization problem fits within this framework.
One problem we did not address here was how to identify fixed trust regions. Indeed, that requires a tremendous amount of a priori information that is almost certainly not available. We interpret that result as a local convergence result that gives a theoretical basis for applying the algorithm. In practice, since the root is likely unknown, one might run some numerical experiments to identify a reasonable trust region, or just use expanding trust regions. The practitioner will find that the algorithm converges to a solution, though perhaps not the one originally envisioned. A more sophisticated analysis may address the convergence to a set of roots, while being agnostic as to which zero is found.
Another problem we did not address was how to optimize not just the mean, but also the covariance in the Gaussian. As discussed in [14], it is necessary to parameterize the covariance in some way, which will be application specific. Thus, while the form of the first variation of relative entropy with respect to the mean, (3.7), is quite generic, the corresponding expression for the covariance will be specific to the covariance parameterization. Additional constraints are also necessary to guarantee that the parameters always induce a covariance operator. We leave such specialization as future work.