A Dirichlet Form approach to MCMC Optimal Scaling

This paper develops the use of Dirichlet forms to deliver proofs of optimal scaling results for Markov chain Monte Carlo algorithms (specifically, Metropolis-Hastings random walk samplers) under regularity conditions which are substantially weaker than those required by the original approach (based on the use of infinitesimal generators). The Dirichlet form methods have the added advantage of providing an explicit construction of the underlying infinite-dimensional context. In particular, this enables us directly to establish weak convergence to the relevant infinite-dimensional distributions.


Introduction
Markov Chain Monte Carlo (MCMC) algorithms form a general and widespread computational methodology addressing the problem of drawing samples from complex and intractable probability distributions (Robert and Casella, 2001;Brooks, Gelman, Jones, and Meng, 2011). Because of their simplicity and their scalability to high-dimensional settings, MCMC algorithms are now routinely used in many fields to obtain approximations of integrals that could not be tackled by common numerical methods. One of the simplest and most popular MCMC schemes, the 'Metropolis-Hastings Random Walk' (MHRW) Algorithm generates a Markov chain as follows. Let Ω and π denote the state space and the density of the distribution of interest. Given a current state x, the chain samples a proposed value y from some symmetric transition kernel Q(x, ·) and moves to the proposal y with probability a(x, y) = 1 ∧ π(y) π(x) (otherwise staying at x). The resulting Markov chain is reversible with respect to π. It can be used to obtain approximate samples and to perform Monte Carlo integration using ergodic averages. Note that there are many variant algorithms, for example the Metropolis-Adjusted Langevin Algorithm (MALA: Roberts and Rosenthal, 1998).

MCMC Optimal Scaling
Because of the popularity of MCMC algorithms, quantitative and mathematically rigorous understanding of their behaviour is of considerable interest. The framework of Optimal Scaling (Roberts, Gelman, and Gilks, 1997) provides an effective and powerful approach. The idea is to consider a sequence of target distributions π (n) defined on state spaces Ω (1) , Ω (2) , . . . of increasing dimensionality (typically Ω (n) = R n ), and to study the behaviour of the resulting sequence of MCMC algorithms as n → ∞. One obtains a sequence of Markov chains X (1) , X (2) , . . . , where each X (n) = X (n) (t) : t = 0, 1, 2, . . . is obtained from the chosen MCMC algorithm with target π (n) . Appropriate sequences of algorithms lead to non-trivial limiting behaviour of X (n) , namely that a time-rescaled version of X (n) converges to a tractable and informative limiting process X ∞ .
The resulting asymptotic analysis provides valuable insight in two practically relevant ways. Firstly, inspection of the time-rescaled version of X (n) leads to rigorous proofs of useful results about the computational complexity of the sequence of MCMC algorithms, viewed as depending on the dimensionality of the integration space Ω (n) . The now-classical example is that of Roberts et al. (1997) (see also Roberts and Rosenthal, 1998). Their results show that, for simple targets on Ω (n) = R n , MHRW needs O(n) steps to explore the state space entirely. By way of contrast, the more sophisticated MALA will take O(n 1/3 ) steps to explore the state space entirely (Roberts and Rosenthal, 2016). Secondly, optimal scaling results facilitate optimization of MCMC performance by providing clear and mathematically-based guidance on how to tune the parameters defining the proposal distribution Q (n) . In fact optimizing such parameters for fixed dimensional chains X (n) is a difficult problem, typically not admitting analytic solution, whereas the limiting object X ∞ is often simple enough to allow a neat analytical optimization. This yields guidance (e.g. optimal values for average acceptance rates) which is widely used by practitioners, especially via self-tuning or Adaptive MCMC methodologies (Andrieu and Thoms, 2008;Rosenthal, 2011).
Originally Roberts et al. (1997) dealt with MHRW and independent, identically distributed (i.i.d.) targets, namely Ω (n) = R n and π n (x (n) ) = n i=1 π(x (n) i ) where π is a suitably smooth univariate density function. The i.i.d. assumption is restrictive; however there are many extensions showing that the relevant results (order of complexity and optimal average acceptance rate) hold with significantly greater generality. These extensions include: independent targets with different scales (Bédard, 2007), Gibbs random fields (Breyer and Roberts, 2000), exchangeable normals (Neal and Roberts, 2006), elliptical densities (Sherlock and Roberts, 2009), densities with bounded support (Neal, Roberts, and Kong Yuen, 2012) and infinite-dimensional distributions with interaction terms (Mattingly, Pillai, and Stuart, 2012).
The Optimal Scaling framework is one of the most successful and practically useful ways of performing asymptotic analysis of MCMC methods in high-dimensions. Indeed, optimal scaling results are not limited to the analysis of MHRW and MALA, but have been used to analyze and compare a wide variety of MCMC schemes: Hamiltonian Monte Carlo (Beskos, Roberts, Sanz-Serna, and Stuart, 2010), Pseudo-Marginal MCMC (Sherlock, Thiery, Roberts, and Rosenthal, 2015), multiple-try MCMC (Bédard, Douc, and Moulines, 2012) and many others.

Contribution of this paper
The key mathematical result underpinning optimal scaling results, regardless of the classes of targets and algorithms considered, concerns the convergence of time-rescalings of the sequence of resulting Markov chains X (n) . Such convergence is usually expressed in the form of weak convergence of the first coordinate X (n) 1 of the vector process X (n) , with the weak limit being a one-dimensional limiting diffusion process X ∞ 1 (typically a Langevin diffusion). The main interest of Optimal Scaling results lies exactly in the high-dimensionality of the target distribution. So it is arguable that focusing on the first component only is somewhat restrictive and undesirable, insofar as it deflects attention from the genuine multivariate problem of interest. Rather than focusing on one-dimensional marginals, it would be more satisfying to study the full joint distribution of X (n) . To do so one has to embed the process X (n) , originally living in Ω (n) = R n , into the limiting space Ω ∞ = R ∞ (for example by allowing moves of only the first n coordinates, while viewing the remaining coordinates as being static and drawn from equilibrium). One then needs to prove the convergence of the whole stochastic process X (n) to the infinite-dimensional limiting stochastic process X ∞ . Roberts et al. (1997) observe that it is not hard to extend classic optimal scaling results to the study of convergence of a finite and fixed number of components (i.e. X (n) 1:k converging to X ∞ 1:k for fixed k and n going to infinity), but this confines attention to the joint distribution of X (n) for fixed n. The approach using Ethier and Kurtz (1986) results, based on uniform convergence of generators, does not easily apply to the study of processes living on infinite-dimensional state spaces (e.g. it can be necessary to assume that the state space is locally compact). Moreover such techniques typically require rather substantial regularity conditions (in terms of target density derivatives and their moments).
In this paper we propose a different probabilistic approach to MCMC Optimal Scaling, relying on infinitedimensional Dirichlet Form theory (Ma and Röckner, 1992) to prove the crucial convergence result. The abstract and powerful theory of Dirichlet forms, and specifically the notion of Mosco (1994) convergence, allows us to work directly and naturally on the infinite dimensional space R ∞ while requiring only modest regularity assumptions. In the following we will focus on the classic MHRW framework of Roberts et al. (1997), proving convergence for the whole infinite-dimensional stochastic process under mild regularity assumptions (finite Fisher information and local Hölder and controlled growth of first derivative of log-density). In MCMC scenarios the smoothness and tail-behaviour of the target can impact massively on the performance of the algorithm (Neal et al., 2012;Roberts and Tweedie, 1996); therefore it is important to establish general con-ditions under which the Optimal Scaling asymptotic analysis is still valid. The following results are relevant to the Computational Statistics community interested in a theoretical understanding of MCMC methods, and also to the Stochastic Processes community interested in convergence of stochastic processes and applications of Dirichlet Form theory. To the best of our knowledge, this is the first application of Mosco convergence to the analysis of MCMC methods, and we expect that the proof strategies developed in this paper will be useful to people seeking to prove convergence of infinite-dimensional stochastic processes arising in MCMC and other applications.

Organization of the paper
Section 2 defines the class of MCMC algorithms being considered, and briefly reviews relevant theoretical notions, including the notion of Mosco convergence of forms (Mosco, 1994) and weak convergence through Dirichlet forms (Sun, 1998). It also presents the main results of the paper, namely Mosco and weak convergence of the relevant infinite-dimensional processes. Section 3 establishes Mosco convergence, while Section 4 deals with weak convergence (under somewhat stronger regularity conditions): the existence of the limiting process is established in Appendix A. Finally Section 5 discusses possibilities for future work and compares our work to some recent results involving Optimal Scaling for infinite-dimensional distributions (Mattingly et al., 2012) and Optimal Scaling under weak regularity of the target (Durmus, Le Corff, Moulines, and Roberts, 2016).

Overview and main results
This paper focuses on Metropolis-Hastings random walk samplers based on a simple target, namely the joint distribution of a large independent sample taken from a fixed distribution satisfying modest regularity conditions. Suppose the fixed distribution is given by π and assume that the potential φ is continuous and everywhere differentiable, with derivative φ ′ = (log f) ′ satisfying the following combination of a local Hölder condition and a growth condition: for some k > 0, 0 < γ < 1 and α > 1, This combined growth / local Hölder condition is much less restrictive than a global Hölder regularity with exponent γ. We do not believe that condition (2) is necessary for our results to hold: however it combines the merit of reasonable generality with the advantage of simplicity of expression. Note that condition (2) suffices for establishing optimal scaling in an L 2 sense; however the Dirichlet form approach presently needs to use a stronger Lipschitz condition in order to establish weak convergence (for more details see Section 2.5).
The following notational conventions are used. Upper case letters denote random variables and corresponding lower case letters denote possible realizations, e.g. X 1 and x 1 . By L(X 1 ) we mean the distribution (or law) of the random variable X 1 , for example L(W 1 ) = N(0, 1). Subscripts denote vector components, e.g. X 1:N = (X 1 , . . . , X N ) or w (N+1):n = (w N+1 , . . . , w n ). Finally, we interpret the evaluation of probability density functions on vectors multiplicatively: if f is a one-dimensional probability density then its evaluation at a vector X 1:N is interpreted as the product of the density evaluated at each component. Thus for example f(X 1:N ) = f(X 1 ) · · · f(X N ), while f(w (N+1):n ) = f(w N+1 ) · · · f(w n ).

Metropolis-Hastings Random Walk Sampler
For each n = 1, 2, . . ., let X (n) (t) : t = 0, 1, 2, . . . be a Metropolis-Hastings Random Walk (MHRW) sampler on R n , with target measure π ⊗n ( d x 1 , . . . , d x n ) and with proposal measure defined by using independent and identically distributed Gaussian proposals on each component. The component proposals are taken to be N(0, τ 2 n ), for fixed τ > 0. We seek to understand the limiting behaviour of a time-rescaled version of X (n) as n → ∞.
For the sake of convenience we interpret X (n) (t) : t = 0, 1, 2, . . . as an infinite-dimensional stochastic process on R ∞ updating only the first n components, with the remaining components drawn independently from the target distribution π and held fixed in time. The state space R ∞ is equipped with the product topology and corresponding Borel σ-algebra, and we choose the infinite product measure π ⊗∞ as invariant measure. It will be useful to note that R ∞ is a Polish space (i.e. separable and completely metrizable topological space). For example it can be equipped with the metric d(x, y) = ∞ j=1 2 −j |x j −y j | 1+|x j −y j | , which induces the product topology. However R ∞ is not a Banach space, because its topology cannot be derived from any norm (for discussion of the broader context here see Conway, 1994, Chapter IV; details about (R ∞ , π ⊗∞ ) are discussed in Eldredge, 2012, Section 3).
Our attention is focussed on the following explicit construction of the first step of the MHRW, hence defining X (n) (t) : t = 0, 1 (extension of this explicit construction to all of the time-homogeneous Markov process X (n) (t) : t = 0, 1, 2 . . . follows immediately from the Markov property of X (n) , but will not be the focus of attention in the sequel). Let X = (X 1 , X 2 , . . .) be a sequence of independent and identically distributed random variables on R with P X 1 ( d x) = π( d x), let W = (W 1 , W 2 , W 3 ...) be a sequence of independent and identically distributed standard normal random variables on R with standard Gaussian density g, and let U be a Uniform(0, 1) random variable. We require X, W and U to be independent of each other. The first step of the n th MHRW X (n) (t) : t = 0, 1 is defined on (R ∞ , π ⊗∞ ) by where A n equals 1 if U < a(X 1:n , W 1:n ) and 0 otherwise, with a(X 1:n , W 1: being the Metropolis-Hastings acceptance function designed to induce reversibility. Thus, as n increases, X (n) proposes smaller jumps extending over a larger number of dimensions. In due course we will re-scale time so that the smaller jumps are proposed more frequently in compensation for their reduced size. The key result of Roberts et al. (1997) then runs as follows.
Theorem 1 (Roberts et al., 1997, Theorem 1.1). Suppose that the probability density f of π is positive and C 2 , that f ′ /f is Lipschitz continuous and that Let U n t = X (n) 1 (⌊nt⌋), the first component of X (n) at the re-scaled time ⌊nt⌋. Then U (n) ⇒ U as n → ∞, where U 0 is distributed as π, and U solves the stochastic differential equation where F is the standard normal distribution function. We shall show that the Dirichlet form approach allows us to replace the restrictive regularity and moment conditions of Theorem 1 by (1) and (2), thus avoiding second-order conditions on f and concerns only weak growth and local Hölder conditions on φ ′ = f ′ /f, as well as being an approach naturally adapted to the underlying infinite-dimensional framework.

Dirichlet forms
Consider a Polish space F furnished with a probability measure µ. In the following we will be interested in F = R ∞ and µ = π ⊗∞ (for π as given at the beginning of Section 2).
We now recall some notions from the literature of Dirichlet forms (for more details see Ma and Röckner, 1992). Note that the general theory of Dirichlet forms applies even if µ is merely a σ-additive measure, rather than a probability measure. However we will describe results only in the case of a probability measure, which reduces the complexity required in the following definitions.
Let H be the Hilbert space H = L 2 (F, µ). For any h and v in H, denote the usual L 2 inner product by . A form Φ on H is a non-negative definite and symmetric bilinear form Φ(h 1 , h 2 ), defined for h 1 , h 2 belonging to a dense linear subspace D(Φ) of H, the domain of Φ (Mosco, 1994, Section 1). We will commit a mild abuse of notation by using Φ(h) = Φ(h, h) to denote the associated quadratic functional, and we will also refer to Φ(h) as a form (the polarization identity yields a 1:1 correspondence between forms and quadratic functionals). A form Φ can be extended to the whole space H by setting Φ(h) = ∞ for any h ∈ H \D(Φ). A Dirichlet form is a closed, Markovian form (Mosco, 1994, Section 1 Given a Markov process on F, a Dirichlet form can be associated with it as follows. In the discrete-time case, let {X(t) : t = 0, 1, . . .} be a discrete-time Markov chain on the Polish space F, assumed reversible with respect to the probability measure µ. The corresponding Dirichlet form with starting state X(0) distributed according to µ. Note that the second equality in (7) holds because of the reversibility assumption. Now consider the continuous-time case. Let {X x (t) : 0 t < ∞} be a continuous-time Markov process on F, also reversible with respect to the measure µ. Here time is denoted by t, while x is the starting point of the process. Let {T t : t 0} denote the Markov semigroup of operators T t : H → H given by with D(Φ) being the subset of H for which the limit in (8) is finite. Note that (7) can be obtained as a special case of (8), by reformulating the discrete-time Markov chain as a continuous-time process with jumps happening according to an exponential clock of unit rate. Ma and Röckner (1992) show that, under some mild regularity conditions (for example regularity or quasiregularity of the Dirichlet form in question; see Definition 8 in Section 2.4 below), for each Dirichlet form Φ there exists a Markov process {X x (t) : t 0} (x ∈ F) such that Φ is its associated Dirichlet form.

Mosco convergence of forms
Mosco (1994, Definition 2.1.1) introduced the following notion of convergence of forms. In the case of Dirichlet forms, this entails uniform convergence of the semigroups of the associated processes: see Theorem 4 below.
Definition 2. A sequence of forms {Φ n : n = 1, 2, . . .} in H converges to a form Φ in H (using the notation Φ n M → Φ) if the following conditions hold: Remark 3. There is a potential terminological confusion between weak convergence of elements of a Hilbert space (h n w → h if h n , g → h, g for all g ∈ H) and weak convergence of distributions of random variables for all bounded continuous f). In the language of functional analysis, the second kind of convergence is more properly thought of as weak * convergence of (probability) measures. In this second case we will refer to (probabilistic) weak convergence.
The following result plays a key enabling rôle in the application of Dirichlet forms to MCMC theory.

Nests, capacity and quasi-regularity
We first introduce the notion of capacity (see Albeverio and Röckner, 1989, (2.2) and Ma and Röckner, 1992, Def.III.2.1 and Ex.III.2.10) Definition 5 (Dirichlet form capacity). Given an open set U ⊆ F, we define the capacity of U as and, for general subsets A ⊆ F, Let Φ, Φ 1 , Φ 2 . . . be Dirichlet forms on H = L 2 (F, µ). The notion of Φ-nests (Ma and Röckner, 1992, Def.III.2.1 and Thm.III.2.11) is crucial when articulating the extent to which the Dirichlet forms are confined to suitable regions of F.
Remark 6. In the following we denote by C 0 (F) the space of continuous functions of compact support on F, which is typically too small to be much use if F is infinite-dimensional.
and individually Φ-quasi-continuous, in the sense that (an µ-version of) any h in this subset is continuous in each closed set in a Φ-nest (perhaps depending on h); 3. there is a countable subset of members of D(Φ) with Φ-quasi-continuous µ-versionsũ 1 ,ũ 2 , . . . , such that F \N is separated byũ 1 ,ũ 2 , . . . , for a set N which can be expressed as a subset of i F c i for some Φ-nest F 1 ⊆ F 2 ⊆ . . ..
Remark 9. We assume that 1 ∈ D(Φ) and 1 ∈ D(Φ n ) for every n. Such an assumption implies that the notions of quasi-regularity and nests are equivalent to their strict versions, namely strictly quasi-regular and strict nests (Ma and Röckner, 1992, Thm.V.2.15). This simplifies the exposition as it is then possible to ignore the strict versions of the above definitions. This brief summary concludes by introducing the notion of an increasing family of closed sets which is uniformly a Φ n -nest for a sequence of Dirichlet forms Φ 1 , Φ 2 , . . . .

Results of the paper
This paper applies the above notions of Dirichlet forms in the context of the MHRW framework described in Section 2.1, based on F = R ∞ and µ = π ⊗∞ . For each n = 1, 2, . . ., consider the MHRW {X (n) (t) : t = 1, 2, . . . subject to a time-rescaling by a factor of n. Via (7), this motivates consideration of the following Dirichlet form: (This is the Dirichlet form corresponding to the continuous-time Markov process resulting from the MHRW reformulated as a discrete-time Markov chain jumping at instants of an exponential clock of rate n.) The natural candidate for a limiting Dirichlet form (as n → ∞) is given by where . Here the domain S of Φ is precisely the region where the first expression in (13) can be viewed as finite. Here is the set of infinitely differentiable functions with compact support depending only on the first N components.
The gradient ∇h in (13) is then defined as the continuous extension to S of the natural definition of ∇ on Albeverio and Röckner (1989, Equation (1.12) and Remark 1.12) show that such a function exists and is π ⊗∞almost everywhere unique.
The Dirichlet form in (13) corresponds to an infinite-dimensional continuous-time Markov process {X ∞ (t) : t 0} with state-space (R ∞ , π ⊗∞ ), for which each component evolves according to an independent copy of a specific diffusion on R with invariant measure π and speed given by a specified function of τ. Some care is needed to establish a rigorous proof that such a process has associated Dirichlet form given in the form of (13). Albeverio and Röckner (1989, Equations (2.8)-(2.11)) give sufficient conditions on Φ for the corresponding Markov process to be well defined. In Appendix A we prove that these conditions hold for Φ as specified in (13). A simple computation with Gaussian densities shows that where F is the standard normal distribution function: the limiting Dirichlet form (13) therefore agrees with the Dirichlet form for the limiting diffusion given by Roberts et al. (1997) as described in Theorem 1.
The key result of this paper is that Mosco convergence of Φ n to Φ holds under the relatively weak conditions on the potential φ given at and above (2) (finite Fisher information, and combined local Hölder and growth condition for the derivative of the potential φ).
Theorem 11. For Φ n and Φ defined by (12) and (13), using a potential φ satisfying (2) together with finite Proof. It suffices to establish both (M1) and (M2) of Definition 2 above. Dealing with these in reverse order (so as to dispose of the easiest case first), Property (M1) is established in Section 3.3 below, and Property (M2) is established in Section 3.2.
Mosco convergence of forms immediately implies the uniform convergence of the associated semigroups such as {T t : t 0}, and hence (probabilistic) vague convergence of the finite-dimensional distributions of the corresponding process {X  : t 0} be their associated semigroups. Then Φ n M → Φ implies the uniform convergence of semigroups in the strong operator topology: for any t 0 > 0 and h ∈ H sup Remark 13. Kolesnikov (2006) notes that vague convergence holds for finite-dimensional distributions of the corresponding Markov processes. Note however that the above Corollary establishes L 2 convergence of marginal distributions, which in some respects is much stronger (e.g. it controls some unbounded test functions).
These results lead to optimal scaling arguments for finite-dimensional distributions of the Metropolis-Hastings random walk sampler, directly following the final argument of Roberts et al. (1997). Fastest asymptotic exploration of the state space is obtained exactly by optimizing the limiting process (governed by the Dirichlet form given in (13)). This limiting Dirichlet form depends on τ only through a multiplicative factor τ 2 c(τ) which measures the speed at which the limiting process evolves; therefore exploration occurs as fast as possible exactly when τ 2 c(τ) = E 1 ∧ exp N(− τ 2 2 I, τ 2 I) is maximized, and at this maximum the acceptance probability for jumps is given by the famous "Goldilocks constant" 0.234 obtained by Roberts et al. (1997). See Roberts and Rosenthal (2016) for more details on the connection between asymptotic analysis through scaling limits and the algorithmic complexity of MCMC algorithms.
Good practice in Markov-chain Monte Carlo involves estimators which make use of entire sample paths (deleting the initial "burn-in" periods), and so it is relevant to consider (probabilistic) weak convergence of the distribution of the entire sample path of {X (n) (t) : t 0} to that of {X ∞ (t) : t 0}. Sun (1998) provides sufficient conditions to prove this using Dirichlet form theory.
Then X (n) converges to X ∞ in the sense of (probabilistic) weak convergence.
Note that the topology of F only plays a role in formulating closedness and compactness of the sets F 1 , F 2 , . . . . The previous result, together with results from Section 3, can then be used to prove weak convergence of the process of interest, so long as we strengthen the regularity required of the density f (and thence of the potential φ).
Theorem 15. Let {X (n) (t) : t 0} and {X ∞ (t) : t 0} be the Markov processes associated with Φ n and Φ defined by (12) and (13) (see Sections 2.1 and 2.5). Suppose that the potential φ has Lipschitz-continuous first derivative, meaning that |φ ′ (x+v)−φ ′ (x)| < k|v| for a fixed k and for all x, v ∈ R, and finite Fisher information, Then X (n) converges to X ∞ in the sense of (probabilistic) weak convergence.
Remark 16. Lipschitz continuity of φ ′ is required in order to allow use of Lemma 20 from Section 4 below.
Proof. The result follows by proving conditions (S1) and (S2) of Theorem 14. Both conditions can be deduced from Theorem 11 and Lemma 20 from Section 3, as follows.
First consider (S1). Theorem 11 guarantees Φ n M → Φ and therefore it suffices to prove lim sup This holds trivially if Φ(u) = ∞, so suppose Φ(u) < ∞. Since Φ n M → Φ, there exists a sequence {u n } ⊂ H such that u n → u in H and lim sup n→∞ Φ n (u n ) Φ(u). Moreover, using the construction described in Section 3.2, such a sequence can be chosen such that Φ(u − u n ) → 0. Then Lemma 20 of Section 4 implies that Φ n (u − u n ) → 0, because Bilinearity of for any u, v ∈ H permits the deduction that As a consequence of (15), it follows that lim sup Φ n (u − u n ) = 0. Moreover an application of the Cauchy-Schwartz inequality and the fact that lim sup Φ n (u n ) Φ(u) shows When combined with (16) and (M2) of Mosco convergence, the latter results in the deduction that lim sup Φ n (u) Φ(u), as desired. Now consider condition (S2). Suppose F 1 ⊆ F 2 ⊆ . . . is a Φ-nest of compact sets. Therefore there exist u k ∈ D(Φ) with u k 1 on F \F k such that u k H + Φ(u k ) → 0. By definition of Cap Φ n and by Lemma 20 below, it is the case that Therefore {F k } k∈N is a uniform {Φ n }-nest and so (S2) holds.

Mosco convergence for Metropolis-Hastings Random Walks
In this section we establish Mosco convergence in three steps. We begin with a lemma and a corollary which describe central limit behaviour for a conditioned instance of the Metropolis-Hastings ratio, making heavy use of the regularity conditions at and above (2). This is then applied to establish the two conditions for Mosco convergence (Definition 2) in Sections 3.2 and 3.3

Convergence of the acceptance function
Consider the Metropolis-Hastings ratio for the Metropolis-Hasting random walk algorithm, conditioned on the chain state. Under mild conditions (finite Fisher information, local Hölder and controlled growth of derivative of log-density), we now show that the conditioned ratio a(X 1:n , W 1:n )|X 1:n = x 1:n converges in distribution to 1 ∧ exp N(− τ 2 2 I, τ 2 I) as n → ∞, for almost every sequence (x 1 , x 2 , . . . ).
Lemma 17. Let φ : R → R and W = (W 1 , W 2 , . . . ) be as described above in Section 2. Given finite Fisher information, and local Hölder and controlled growth for the derivative of the log-density φ, for π ⊗∞ -almost every sequence (x 1 , x 2 , . . . ), Proof. Throughout the proof we condition implicitly on X 1 = x 1 , X 2 = x 2 , . . . . We begin by separating the left-hand side of (17) into two summands, the first of which is of mean zero and carries all the asymptotic random variation.
The second summand of the right-hand side of (18) requires more detailed attention, and its treatment requires some regularity of φ ′ , for example as expressed in (2) above. We seek to show that this summand converges in distribution to − τ 2 2 I. The strategy is to show that its expectation converges to − τ 2 2 I, while its variance vanishes asymptotically. Recall that variances are bounded by second moments. Applying this to each of the n conditionally independent terms involved in the finite sum (conditioning implicitly on X 1 = x 1 , X 2 = x 2 , . . . as noted above), we find: Employing the regularity of φ ′ as given in the combined growth / local Hölder condition (2), and noting that u α u γ for u ∈ (0, 1) and n γ n α for n 1 (with α and γ as given in (2)), where k is the constant appearing in (2). Combining (19) and (20), we deduce that the second summand has variance bounded above by So the variance of the second summand vanishes asymptotically. We turn to the expectation of the second summand. Once again we condition implicitly on X 1 = x 1 , X 2 = x 2 , . . . . We obtain . We now integrate out the implicit conditioning. The random variables Z (n) (X 1 ),. . . ,Z (n) (X n ) are i.i.d., with values lying in the range [−c n 1−γ 2 ,c n 1−γ 2 ]. Hence Hoeffding's inequality applies: for any positive ε, The right-hand side of (21) is summable over n, since γ > 0, and therefore the first Borel-Cantelli lemma applies: 1 n n i=1 Z (n) (X i ) converges almost surely to lim n→∞ E Z (n) (X 1 ) , if such a limit exists. To complete the proof it suffices to show that lim n→∞ E Z (n) (X 1 ) = − τ 2 2 I. Shifting an x-variable of integration, we achieve the following, (The exchange of integrals and expectations is justified by a Fubini argument involving the finiteness of I = ) But now we undo the shift of the x-variable of integration and use the regularity condition (2) for φ ′ . For n 1, this leads to: Here the finiteness of R |φ ′ (x)| e φ(x) d x = E [|φ ′ (X 1 )|] follows from E |φ ′ (X 1 )| 2 = I < ∞.
The above result will actually be used in the following form.
Proof. Given a, b > 0, we have |(1 ∧ ab) − (1 ∧ b)| |1 − a|. This follows because if b < 1 then x → 1 ∧ bx is 1-Lipschitz, while if b 1 and a 1 b the left-hand side is 0, and finally if b 1 and a < 1 b then a ab < 1 and |ab − 1| |1 − a|. Therefore which converges to 0 almost surely for n → ∞. Moreover, by Lemma 17 and the dominated convergence theorem, as n → ∞ so

Proving the second Mosco condition (M2)
Suppose that the conditions of Section 2.1 are satisfied. We establish the validity of Definition 2 (M2) before that of (M1), because (M2) follows by a more straightforward argument. If h ∈ H \ S then Φ(h) = ∞ and thus (M2) holds trivially, for example choosing a sequence {h n } ∞ n=1 identically equal to h.
Consequently we need only consider the case h ∈ S.
Choosing a subsequence and re-labelling, we may suppose that For fixed k, noting that h k ∈ C ∞ 0,N (R ∞ ) for some N and that by virtue of this h k is induced by a smooth function of compact support on R N , we see that The expression inside the outer expectation is bounded by τ 2 2 (|W 1:N | h ′ k ∞ ) 2 , which is an integrable random variable. Because of the regularity of h k and Corollary 18, this expression converges pointwise to τ 2 2 (∇h k (X 1:N ) T W 1:N ) 2 c(τ) as n → ∞. Therefore it follows from the dominated convergence theorem that as k for sufficiently large n depending on k, and so we can choose an increasing sequence j 1 = 1 < j 2 < . . . such that for any k = 1, 2, . . .
Note that we can in addition stipulate that j k k. For n j 1 we define σ n = sup{k : j k n}. Note that 1 σ n n and moreover σ n → ∞ as n → ∞, because σ n k for n j k . Finally, by definition of σ n it is the case that j σ n n. Therefore, as n → ∞, Relabelling h σ n as h n produces the sequence required to establish the validity of the second Mosco condition.

Proving the first Mosco condition (M1)
We now turn to the more substantial question of the validity of Definition 2 (M1) under the conditions described in Section 2.1. Consider h n , h ∈ H such that h n w → h weakly in H as n → ∞. It is convenient to write Fixing N > 0 and taking a non-zero test function ξ in C ∞ 0 (R 2N ) (so ξ is infinitely differentiable with compact support, and in particular is bounded), the function ξ(X 1:N , W 1:N )I(U < a(X 1:n , W 1:n )) belongs to L 2 (X,W,U) and is also non-zero. We can therefore apply the Cauchy-Schwartz inequality and obtain: Ψ n (h n ), ξ(X 1:N , W 1:N )I(U < a(X 1:n , W 1:n )) L 2 (X,W,U) ξ(X 1:N , W 1:N )I(U < a(X 1:n , W 1:n )) L 2 (X,W,U) . (23) Here U is the Uniform(0, 1) random variable introduced in Section 2.1, which is independent of X and W. Consider the denominator of (23). Integrating out first U and then (X (N+1):n , W (N+1):n ) leads to ξ(X 1:N , W 1:N )I(U < a(X 1:n , W 1:n )) L 2 . (24) Convergence as n → ∞ follows from Corollary 18 (hence E [a(X 1:n , W 1:n )|X 1:N , W 1:N ] converges almost surely to c(τ) = E 1 ∧ exp N(− τ 2 2 I, τ 2 I) ) and the fact that ξ(X 1:N , W 1:N ) 2 E [a(X 1:n , W 1:n )|X 1:N , W 1:N ] is bounded by ξ 2 ∞ < ∞ (note that the acceptance probability a(X 1:n , W 1:n ) lies in [0, 1]). In order to deal with the numerator of (23), it is necessary to argue in more detail, as described by the following lemma.
Lemma 19. Suppose as above that h n → h weakly in H. Define a twisted gradient ∇ (f) x 1:N ξ(X 1:N , W 1:N ) (twisted by the density f) by requiring that it satisfy f(X 1:N )∇ (f) x 1:N ξ(X 1:N , W 1:N ) = ∇ x 1:N (ξ(X 1:N , W 1:N )f(X 1:N )) . Then, as n → ∞, Ψ n (h n ) , ξ(X 1:N , W 1:N )I(U < a(X 1:n , W 1:n )) L 2 Proof. We use the following concise notation Weak convergence of h n to h in H implies that h n N M 1 for some M 1 < ∞ by the Banach-Steinhaus theorem (the "uniform boundedness principle"). On the other hand, for b ∈ H, ifb n (x) = E [b(X)|X 1:n = x] then b −b n H → 0 as a consequence of the L 2 martingale convergence theorem. Accordingly Thush n w → h weakly in H. These arguments show that effectively we may suppose that h n depends only on the first n components, leading to h n (x) =h n (x) for every n and x ∈ R n .
The following equality is obtained by translating x to x − τ √ n w, then multiplying and dividing through by f(x B − τ √ n w B )/f(x B ), finally using reflection to replace w B by −w B (noting that g is symmetric). (27) From (27) it follows that (26) equals Adding and subtracting appropriate terms to (28), and multiplying and dividing the resulting second summand by − τ √ n , we obtain Note that the density f is positive and C 1 everywhere, and hence is strictly positive and bounded with bounded first derivative on the compact projection of the support of ξ. Using Corollary 18 and smoothness and compact support of the test function ξ, the expression . Therefore this expression is bounded by and therefore converges also in L 2 (X,W 1:N ) . Consequently, since h n converges weakly to h in L 2 (X,W 1:N ) and the inner product of a strongly and a weakly converging sequence is a convergent sequence of real numbers (using again the uniform boundedness principle), the second term of (29) converges to the limit The proof of the lemma will be completed by showing that the first term of (29) converges to 0 as n → ∞. This term can be rewritten as We shall show that b n (x, w A ) L 2 (X,W 1:N ) is bounded and c n (x, w A ) L 2 (X,W 1:N ) → 0, which implies that (30) converges to 0.
Boundedness of b n (x, w A ) L 2 (X,W 1:N ) is almost immediate. Since h n L 2 X M 1 (using the uniform boundedness principle) and ξ( M 2 (since both ξ and f are continuous and the set {(x A , w A ) : ξ(x A − τ √ n w A , w A ) > 0} is contained in the compact set K defined at the start of this proof), it follows that b n (x, w A ) L 2 (X,W 1:N ) τ √ 2 M 1 M 2 for some positive M 1 and M 2 not depending on n. Using f(x) = e φ(x) , we bound the integral factor of c n (x, w A ) as a sum of two integrals: . We deal with these two integrals separately. Since |(a ∧c)−(b∧c)| |a −b| for any a, b, c > 0, the modulus in the first integral on the right-hand side of (31) is smaller than e ∆ A − e ∆ A . Since e x is locally Lipschitz, there exist a constant c > 0 such that, for (x A , w A ) ∈ K, we can use (2) to deduce that which converges to 0 uniformly over (x A , w A ) ∈ K. The second integral of the right-hand side of (31) can be dealt with as follows. Suppose ∆ A > 0 for simplicity (if ∆ A < 0 the argument needs only trivial modification). Then To complete the proof of the lemma, we show that (34) is bounded for (x A , w A ) ∈ K and converges almost surely to 0 as n → ∞. The integral terms of (34) are bounded either by 1 or by the (finite) supremum of e −∆ A over (x A , w A ) ∈ K. Moreover, since the function x → e x is locally Lipschitz, there exist c > 0 such that for which is bounded over (x A , w A ) ∈ K. Therefore (34) is bounded. Finally, for almost every w A and x 1 , x 2 , . . .
it is the case that ∆ A converges to 0 and ∆ B D → N(− τ 2 2 I, τ 2 I) (see Lemma 17). Therefore the integral −∆ A <∆ B <∆ A g(w B ) d w B converges almost surely to 0 and Thus the second integral of the right-hand side of (31) converges to 0 as n → ∞. Accordingly we have shown that the first term of (29) converges to 0 as n → ∞, and so this completes the proof of the lemma.

From (23), (24) and Lemma 19 it follows that for any
Given (35), we can prove (M1) of Definition 2 using Hilbert space duality. We consider h ∈ S and then h ∈ H \ S. If h ∈ S, then an integration-by-parts argument using the compact support of ξ shows that Using Hilbert space duality and taking the supremum over N and ξ we obtain the desired inequality This establishes (M1) of Definition 2 for the case of h ∈ S. On the other hand, (M1) follows for the case of h ∈ H \ S if it can then be shown that the supremum over ξ of the right-hand side of (35) is equal to infinity. Since h / ∈ S, we can use Hilbert space duality, together with the definition of S, and also the definition of the twisted gradient in Lemma 19, to show that (For otherwise the numerator, viewed as a function of ξ, extends to a continuous linear function on S, and the Riesz representation theorem for Hilbert space would then imply that h ∈ S.) Since h ∈ H and therefore To apply (37) to (35), we consider test functions ξ of the form ξ(X 1:N , W 1: Moreover, since E [ξ 2 (W i )] = 0 for all indices i, we have Combining (38) and (39), and using the specific form of the test function ξ, the supremum of the right-hand side of (35) is controlled by a fixed positive finite multiple of    sup Now W 1 can be arbitrarily approximated in L 2 W 1 by mollifications ξ 2 (W 1 ) such that ξ 2 ∈ C ∞ 0 (R) and E [ξ 2 (W 1 )] = 0. Consequently the supremum over ξ 2 in (40) is equal to E W 2 1 = 1. Therefore (40) equals where the infinite value of the second supremum follows from (37). This establishes (M1) of Definition 2 for the case of h ∈ H \ S, and thus (M1) holds for all h ∈ S. The results of this section and of Section 3.2 therefore together establish Mosco convergence of Φ n to Φ.

Weak convergence
In this section we show that a strengthening of (2) to deliver a global Lipschitz property for φ ′ permits control of the Φ n by the Sobolev norm associated with Φ. This suffices to allow the application of the results of Sun (1998) to establish (probabilistic) weak convergence.
Lemma 20. Suppose that φ ′ is Lipschitz-continuous, meaning that |φ ′ (x + v) − φ ′ (x)| < k|v| for a fixed k and for all x, v ∈ R. Then there exists C depending on τ but not depending on n such that, for any h ∈ H, Proof. If Φ(h) = ∞, then (41) holds trivially (note that Φ n (h) < ∞ whenever h ∈ H). We may therefore suppose that Φ(h) < ∞.
Viewing Φ n (h) as an expectation as in Equation (12), we divide the expectation according to whether or not n i=1 |W i | 2 is greater than c n for a suitable constant c n .
The desired result now follows because n P n i=1 |W i | 2 > c n converges to 0 as n → ∞.
We may now apply Sun (1998, Theorem 1) to deduce weak convergence of {X (n) (t) : t 0} to {X ∞ (t) : t 0} as described in Theorem 15 in Section 2.5 above.

Discussion
The above work demonstrates that Dirichlet forms provide an effective methodology for treating the Optimal Scaling framework in its natural infinite-dimensional context, and also for reducing the framework's dependence on severe regularity conditions. It is interesting to compare the Dirichlet form approach with that of the recent paper by Durmus et al. (2016), which does manage to reduce the regularity conditions required by the classical Roberts et al. (1997) approach (though not to the same extent as above), and also substantially relaxes smoothness requirements. It would be interesting to see whether the smoothness requirements of the Dirichlet form approach could be similarly reduced.
In this paper we have focussed on establishing the utility of the Dirichlet form approach for the special case of i.i.d. targets and for the Metropolis-Hastings random walk sampler; we expect this approach will prove useful in studying optimal scaling for MALA, and for non-identically distributed targets (Bédard, 2007), and for the nonindependent case (Breyer and Roberts, 2000;Mattingly et al., 2012). Tied as it is to equilibrium calculations, it is less clear how to extend the approach of this paper to deal with the transient behaviour of MCMC algorithms before they reach equilibrium (see for example the results of Christensen, Roberts, and Rosenthal, 2005;Jourdain, Lelièvre, and Miasojedow, 2014;Ottobre and Stuart, 2014), and this is a clear challenge for future work. Finally, there is evidently scope for adapting the Dirichlet form approach to deal with Optimal Scaling frameworks in which there is a natural Banach-space structure, and in this case we expect that the genuinely infinite-dimensional nature of the Dirichlet form approach will be highly beneficial. The techniques discussed here (especially that of Mosco convergence) also seem to have considerable potential for other highor infinite-dimensional problems in applied probability.

A Existence of the limiting infinite-dimensional stochastic process
This appendix is devoted to proving the existence of an infinite-dimensional Markov process associated to the limiting Dirichlet form Φ defined by Equation (13). Albeverio and Röckner (1989) consider Dirichlet forms of this kind (sometimes called classic Dirichlet forms) in the framework of topological vector spaces (which includes our case). They provide and discuss a sufficient set of four conditions (Albeverio and Röckner, 1989, (2.8)-(2.11)) (which we refer to below as conditions AR1-4 respectively) for the existence of a diffusion process associated to Φ (Albeverio and Röckner, 1989, Thm.2.7). In summary, the conditions AR1-4 imply that Φ is a (local) quasi-regular Dirichlet form (Ma and Röckner, 1992, Definition 3.3.1), and this in turn implies the existence of an associated Markov process (Ma and Röckner, 1992, Theorem 3.5).
Note that it is the case that 0 < k (n) ℓ < ∞ for any positive integers n and i. Since cartesian products of compact sets are compact in the product topology (Tychonoff's theorem) it follows that the set K (n) is a compact subset of R ∞ .
The following lemma completes the proof of (46).