Weighted batch means estimators in Markov chain Monte Carlo

This paper proposes a family of weighted batch means variance estimators, which are computationally efficient and can be conveniently applied in practice. The focus is on Markov chain Monte Carlo simulations and estimation of the asymptotic covariance matrix in the Markov chain central limit theorem, where conditions ensuring strong consistency are provided. Finite sample performance is evaluated through auto-regressive, Bayesian spatial-temporal, and Bayesian logistic regression examples, where the new estimators show significant computational gains with a minor sacrifice in variance compared with existing methods.


Introduction
Markov chain Monte Carlo (MCMC) methods are widely used to approximate expectations with respect to a target distribution, see e.g. Liu (2001) and Robert and Casella (2004).
In short, an MCMC simulation generates a dependent sample from the target distribution and then uses ergodic averages to estimate a vector of expectations. Variability of the ergodic averages is of interest because it reflects the quality of estimation and can be used to construct confidence intervals or confidence regions (see e.g. Flegal et al., 2008;Flegal and Jones, 2011;Geyer, 1992;Jones and Hobert, 2001;Vats et al., 2015). Estimating variability is akin to estimation of the asymptotic covariance matrix in a multivariate Markov chain central limit theorem (CLT).
Let F be a probability distribution with support X ∈ R d and g : X → R p be a F -integrable function. We are interested in estimating the p-dimensional vector θ = X g(x)dF.
Let X = {X t , t ≥ 1} be a Harris ergodic Markov chain with invariant distribution F . Then if Y t = g(X t ) for t ≥ 1,Ȳ n = 1 n n t=1 Y t → θ w.p. 1 as n → ∞. The sampling distribution forȲ n − θ is available via a Markov chain CLT if there exists a positive definite symmetric where Provided an estimator of Σ is available, sayΣ n , one can access variability of the estimator Y n by constructing a p-dimensional ellipsoid. Further, Vats et al. (2015) propose terminating the simulation when the ellipsoid volume is sufficiently small, which is asymptotically equivalent to stopping when a multivariate effective sample size is large enough. One of their necessary conditions is thatΣ n is a strongly consistent estimator of Σ.
Outside of recent work of Chan and Yau (2017), Dai and Jones (2017), Vats et al. (2015), and Vats et al. (2018), estimating the covariance matrix is rarely done in MCMC. Instead most practitioners focus on univariate techniques to estimate only the diagonal components.
Within the MCMC literature, the most common approach is univariate BM since it is fast and simple to calculate. Speedy calculations are especially helpful in conjunction with sequential stopping rules where multiple variances or a covariance matrix would be calculated each time a stopping criteria is checked (see e.g. Flegal et al., 2008;Gong and Flegal, 2016). Unfortunately, Flegal and Jones (2010) and Vats et al. (2015) illustrate BM methods tend to underestimate confidence region volumes unless the number of Markov chain iterations is extremely large. Practitioners familiar with the time-series literature may argue for more complex SV estimators using Tukey-Hanning or flat top lag windows.
Flat top windows are especially appealing since they tend to reduce bias leading to more accurate confidence region volumes. Despite the popularity in fields where sample sizes are moderate, multivariate SV methods are challenging to use in MCMC since they require substantial computational effort for large sample sizes (see Section 3.3).
This paper introduces weighted BM variances estimators that are especially convenient in MCMC but are applicable in other fields such as time-series and nonparametric analysis.
The proposed estimators incorporate the same flexible lag windows of SV estimators while reducing computation time. For example, we later show a weighted BM estimator is approximately 60 times faster for a 30 × 30 covariance matrix with 5e5 iterations. Moreover, the speed up increases as dimension or iteration increases.
The cost one pays for computational efficiency is an increase in relative efficiency. Specifically, we show the variance is 1.875 higher for a flat top lag window using weighted BM versus a traditional SV estimator. Our result is similar to Flegal and Jones (2010) who show the variance of the BM estimator is 1.5 times higher than that of the overlapping BM estimator.
In addition to calculating relative efficiency, we prove strong consistency of weighted BM estimators. Strong consistency is important since it is required for asymptotic validity of sequential stopping rules, see e.g. Flegal and Gong (2015), Glynn and Whitt (1992), Jones et al. (2006), and Vats et al. (2015). In short, asymptotic validity implies the simulation terminates with probability one and ensures the final confidence regions have the right coverage probability.
The performance of weighted BM estimators is illustrated in univariate and multivariate auto-regressive models. These finite sample simulations show weighted BM estimators converge to the true known value and that flat top lag windows enjoy significant bias reduction. As dimension or chain length increases, calculation of weighted BM estimators save significant time compared with SV estimators. Our simulations also illustrate an increase in the variance relative to SV estimators, which depends lag window choice.
We also consider a Bayesian spatial-temporal model applied to temperature data collected from ten nearby weather station in the year 2010. In this example, we estimate the covariance matrix associated with a vector of 185 parameters and again illustrate the improved computational efficiency of weighted BM estimators. Our final example considers a Bayesian logistic regression model that illustrates weighted BM estimators with a flat top window provide more accurate coverage probabilities of multivariate confidence regions.
The rest of the paper is organized as follows. Section 2 summarizes current multivariate estimators of Σ. Section 3 proposes weighted BM estimators, establishes conditions that ensure strong consistency, and calculates the variance when using a Bartlett flat top lag window. Section 3 also investigates how chain length n and dimension p impact computation times for weighted BM, SV, and recursive estimators. Section 4 demonstrates the finite sample properties of weighted BM estimators via four examples. We conclude with a discussion in Section 5. All proofs are relegated to the Appendix.

Covariance matrix estimation
Estimating Σ is rarely done in MCMC output analysis. Instead, most researchers ignore the cross-correlation and only estimate the diagonal entries of Σ. Computationally efficient BM methods are usually preferred, but such methods can lead to lower than expected coverage probabilities. In this section, we provide formal definitions for existing estimators of Σ and provide some motivation for our proposed weighted BM estimators. When p = 1 these estimators reduce to the usual univariate estimators.
First consider BM estimators where a = a n is the number of batches, b = b n is the batch size, and n = ab. (Note a and b can depend on n, but we suppress this dependency to simplify notation.) For l = 0, 1, ..., a − 1, denote the mean vector for batch l asȲ l Then the sample variance of batch means scaled up properly is used to Alternatively, overlapping BM use n − b + 1 overlapping batches of length b denotedẎ l (b) = b −1 b t=1 Y l+t for l = 0, . . . , n − b. Then the overlapping BM estimator is given bŷ Computing overlapping BM is significantly slower than BM given the increased quantity of batches.
SV methods can also be used to estimate Σ. First consider estimating the lag k auto- Then the SV estimator of Σ truncates and downweights the summed lag k autocovariances.
That is,Σ where b is the truncation point and w n (·) is the lag window.
We assume the lag window w n (·) is an even function defined on Z such that (i) |w n (k)| ≤ 1 for all n and k, (ii) w n (0) = 1 for all n, and (iii) w n (k) = 0 for all |k| ≥ b. Most commonly used lag windows satisfy this assumption, which is necessary for our proof of strong consistency. Our discussion and simulations focus on the Bartlett, Tukey-Hanning, and Bartlett flat top lag windows defined as respectively (see Figure 1). An interested reader is directed to Anderson (1994) for more on lag windows. It is well known the overlapping BM estimator at (3) is asymptotically equal to the SV estimator with a Bartlett lag window apart from some end effects (see e.g. Meketon and Schmeiser, 1984;Welch, 1987). Notice in Figure 1 that the Tukey-Hanning lag window slightly reduces downweighting of small lag terms compared to the Bartlett lag window in an effort to reduce bias. Politis andRomano (1995, 1996) expanded on this idea when introducing flat top lag windows that modify existing windows by letting w n (k) = 1 for k near 0. Their work demonstrates SV estimators with flat top lag windows enjoy significant bias reduction while maintaining comparable variance. Politis and Romano (1999) later illustrate the superiority of flat top lag windows in nonparametric estimation of multivariate density function.
We only consider the flat top window function constructed from the Bartlett window with w n (k) = 1 for |k| ≤ b/2 as at (6). For this setting, Politis andRomano (1995, 1996) show the resulting SV estimator is equivalent to the difference of two Bartlett SV estimators.
Specifically, if w n (k) is the flat top window then whereΣ (1) andΣ (2) denote Bartlett SV estimators with bandwidths b and b/2, respectively.
In the next section, we construct weighted BM estimators that inherit desired properties from lag window functions but are computationally efficient due to a nonoverlapping structure.

Weighted BM estimators
Consider first an alternative representation of the SV estimator that is akin the overlapping BM estimator. Similar estimators have been previously studied by Damerdji (1987Damerdji ( , 1991 and Flegal and Jones (2010). To this end, define ∆ 1 w n (k) = w n (k − 1) − w n (k) and ∆ 2 w n (k) = w n (k − 1) − 2w n (k) + w n (k + 1). Then recallẎ l (k) = k −1 k t=1 Y l+t for l = 0, ..., n − k and consider the estimatoṙ If d =Σ sv -Σ, Liu and Flegal (2018) show that d → 0 with probability 1 as n → ∞, hence the estimators are asymptotically equivalent. Starting withΣ, it is possible to reduce number of batches and computing time by only including non-overlapping batches. First define the more general batch mean vector asȲ l (k) = k −1 k t=1 Y lk+t for l = 0, 1, ..., a k − 1 and k = 1, 2, ..., b where a k = (n/k) . Then the weighted BM estimator iŝ The estimatorΣ w is not necessarily computationally efficient. However, if the lag window is such that ∆ 2 w n (k) = 0 for certain k values then the first summation can be simplified.
We suggest using the Bartlett flat top lag window at (6) in an effort to reduce bias. In this case, it is easy to show ∆ 2 w n (b/2) = −2/b, ∆ 2 w n (b) = 2/b, and ∆ 2 w n (k) = 0 for all other k values. Hence, the first summation in (8) contains two terms which is extremely computationally friendly. For this lag window, Sections 3.3 and 4 illustrate computational and bias advantages, respectively. Since the expression of ∆ 2 w n (·) is similar to that of a second derivative of w n (·), other piecewise linear functions would also be computationally efficient.

Strong consistency
This section establishes necessary conditions for strong consistency ofΣ w for estimating Σ.
Denote the Euclidean norm by · and let {B(t), t ≥ 0} be a p-dimensional multivariate Brownian motion. Then the primary assumption is that of a strong invariance principle.
Assumption 1. There exists a p × p lower triangular matrix L, a nonnegative increasing function ψ on the positive integers, a finite random variable D, and a sufficiently rich probability space Ω such that for almost all ω ∈ Ω and for all n > n 0 , Our results hold as long as Assumption 1 holds. This includes independent processes, Markov chains, Martingale sequences, renewal processes and strong mixing processes. An interested reader is directed to Vats et al. (2015) and the references therein.
Remark 1. Kuelbs and Philipp (1980) show λ only depends on p, 1 , and δ, but quantifying this relationship is an open problem. Damerdji (1991) notes that λ is closer to 0 for slow mixing (heavily correlated) processes and closer to 1/2 for fast mixing processes.
Remark 2. Under stronger assumptions of geometric ergodicity, a one step minorization condition, and p = 1, Jones et al. (2006) and Bednorz and Latuszyński (2007) provide an exact relationship between λ and the convergence rate of the chain (see Lemma 3 of Flegal and Jones, 2010). Establishing a similar result for p > 1 is a direction of ongoing research.
The weighted BM estimator can only be consistent if the batch size increases with n leading to the following additional assumption.
Assumption 2. The batch size b is an integer sequence such that b → ∞ and n/b → ∞ as n → ∞, where b and n/b are both monotonically nondecreasing.
In Theorem 1 we show strong consistency ofΣ w . The proof is given in Appendix B.
Theorem 1. Suppose the conditions of Corollary 1 hold, Assumption 2 holds, and there then with probability 1,Σ w → Σ as n → ∞.
Lemma 1 of Vats et al. (2018) is especially useful in checking these.
A method of differences calculation shows (10) holds. Vats et al. (2018) again show (11) and (12) hold if b −1 n 1−2λ log n → 0 as n → ∞. When q = 1 this is the Bartlett window at (4) andΣ w equalsΣ bm defined at (2). Hence Theorem 1 provides an alternative proof of strong consistency under the same conditions as Vats et al. (2015).
Theorem 2. (Theorem 2 Vats et al., 2015) Suppose the conditions of Corollary 1 hold, Assumption 2 holds, and there exists a constant c ≥ 1 such that However, (12) does not hold since ∆ 2 w n (b/2) = −2/b and ∆ 2 w n (b) = 2/b. We can still ensure strong consistency since the estimator can be expressed as the difference between two BM estimators similar to (7). Specifically, whereΣ bm is defined at (2) (batch size b) andΣ (2) bm defines a BM estimator with batch size b/2. With (13), strong consistency follows from Theorem 2.
Corollary 2. Suppose the conditions of Theorem 2 hold and w n (k) is the flat top lag window at (6), then with probability 1,Σ w → Σ as n → ∞.
A common choice is setting b = n ν for 0 < ν < 1. In this case, ν > 1 − 2λ ensures b −1 n 1−2λ log n → 0 as n → ∞. Finite sample performance naturally depends on the choice of ν. Flegal and Jones (2010) and Liu and Flegal (2018) minimize the asymptotic meansquared error and conclude the optimal truncation point is proportional to n 1/3 .

Increase in variance
Since weighted BM variance estimators are based only on the nonoverlapping batches, a variance inflation is expected relative to SV estimators. Here we focus on estimating the diagonal entries of Σ but the off-diagonal entries behave in a similar manner.
Suppose i ∈ {1, . . . , p} then denote estimators of the ith diagonal element of Σ based on BM and weighted BM with a Bartlett flat top lag window asσ 2 bm andσ 2 w , respectively. Further, denote SV estimators with Bartlett and Bartlett flat top lag windows asσ 2 b and σ 2 f , respectively. Results in Flegal and Jones (2010) imply as n → ∞ since the overlapping BM estimator is asymptotically equivalent toσ 2 b . The variance ofσ 2 b has also been studied by Lahiri (1999) and Politis and White (2004).
The following result establishes the variance ratio between weighted BM and SV estimators with a Bartlett flat top lag window at (6). The proof is given in Appendix D.
Theorem 3. Suppose the conditions of Corollary 1 hold, Assumption 2 holds,  Politis andRomano (1995, 1996) mention such a variance increase for flat-top estimators, but go on to argue it is offset by lower bias.

Computational time
This section investigates how chain length n and dimension p affect computation time for weighted BM, SV, and recursive estimators. Calculations were completed on a 2016 MacBook (1.2 GHz Intel Core m5) and coded exclusively in R to ensure fairness. Chan and Yau (2017) provide the R-package rTACM (version 3.1), which we utilize to compute recursive estimators.
As an example, consider the p-dimensional vector autoregressive process of order 1 for t = 1, 2, . . . where X t ∈ R p , t are i.i.d. N p (0, I p ) and Φ is a p × p matrix. When the largest eigenvalue of Φ in absolute value is less than 1 the Markov chain is geometrically ergodic (Tjøstheim, 1990).
and ⊗ denotes the Kronecker product. Consider approximating θ = EX 1 = 0 byȲ n =X n . In the CLT at (1), we have A geometrically ergodic Markov chain is generated with Φ chosen as follows. Consider a p × p matrix A with each entry generated from standard normal distribution, and B = AA T which is a symmetric matrix with the largest eigenvalue m. Then Φ = B/(m + 1) ensures geometrically ergodicity. The corresponding Σ is estimated by weighted BM and SV estimators with window functions at (4), (5) and (6)  42 s 1.6 min 6.1 min p=20 1.7 min 3.5 min 14.9 min p=30 3.2 min 6.3 min 28.5 min

AR(1) model
Consider the autoregressive process of order 1 (AR(1)) where X t = φX t−1 + t for t = 1, 2, . . . and t are i.i.d. N(0,1). For |φ| < 1 the Markov chain is geometrically ergodic. Consider estimating θ = E[X 1 ] = 0 byȲ n =X n , then in the CLT at (1) we have A range of φ from 0.6 to 0.9 are evaluated and the true value of Σ is used to compare weighted BM and SV estimators.
For each φ, generate an AR(1) Markov chain of length 1e5 and compute weighted BM and SV estimators using the three lag window functions at (4), (5) and (6). Truncation point b equals to n 1/3 = 46 for all six estimators. The procedure is repeated independently for 500 times and the average of 500 replications are shown in Figure 2. When the autocovariance is low, all the estimators perform well. As φ increases, the flat top lag window outperforms Bartlett and Tukey-Hanning windows for both SV and weighted BM estimators.
It is easy to see larger k implies stronger autocovariance and cross-correlation in the chain.
These Φ are used to generate geometrically ergodic Markov chains from which Σ is estimated. For an estimatorΣ, define E =Σ − Σ and consider mean squared error (MSE) over

Bayesian dynamic space-time model
Consider monthly temperature data collected at 10 nearby station in northeastern United States in 2000, a subset of NETemp data described in R package spBayes (Finley et al., 2007).
A Bayesian dynamic model proposed by (Gelfand et al., 2005) is fitted to the data and the model treats time as discrete and space as continuous variable.
Suppose y t denotes the temperature observed at location s and time t for s = 1, 2, ..., N s and t = 1, 2, ..., N t . Let x t (s) be a k × 1 vector of predictors and β t be a k × 1 coefficient vector, which is a purely time component, and u t (s) denotes a space-time component. The Figure 4: Variance ratios over 500 replications with n = 1e5.
model is where GP (0, C t (·, σ 2 t , φ t )) is a spatial Gaussian process where C t (s 1 , s 2 ; σ 2 t , φ t ) = σ 2 t ρ(s 1 , s 2 ; φ t ), ρ(·; φ) is an exponential correlation function with φ controlling the correlation decay, and σ 2 t represents the spatial variance components. The Gaussian spatial process allows closer locations to have higher correlations. Time effects for both β β β t and u t (s) are characterized by transition equations, delivering a reasonable dependence structure. Priors follow defaults in the spDynlM function of the spBayes package. We are interested in estimating posterior expectations for 185 parameters denoted θ = (β β β t , u t (s), σ 2 t , Σ η , τ 2 t , φ t ). Again we consider Markov chains of length 5e4, 1e5 and 2e5 and compute computational time ratios in Table 2. For this high-dimensional Bayesian analysis, weighted BM estimators are much cheaper to compute for Bartlett and flat top lag windows. Figure 6 plots estimates of the diagonal elements of Σ obtained with weighted BM and SV methods on the log scale.
Since the points are close to the identity line it is clear both methods produce similar estimates.

Bayesian logistic regression model
.
Priors for β β β are chosen to be β β β ∼ N (0 0 0, 100I k ) as in Boone et al. (2014) and the MCMClogit function in the MCMCpack package is used to sample the Markov chain.
Using methods from Vats et al. (2015), we construct a 90% confidence region of the 10 dimensional parameter β β β based on the BM estimator, the SV estimator with a flat top lag window, and the weighted BM estimator with a flat top lag window. Coverage probabilities of these confidence regions from 1000 repeated simulations are used to evaluate the performance of each method. Since the true value of β β β is unknown, the average of 500 chains each of length 1e6 is used as the "truth". Table 3 shows the coverage probabilities for chains of length 1e4, 5e4, 1e5, and 5e5 and two different batch sizes. When b = n 1/3 , both estimators based on a flat top window are far superior to the BM estimator. When b = n 1/2 , the improvement remains but is less substantial.  Memory complexity will also be O(n) if b is allowed to to increase by ones since the entire chain must be stored. Gong and Flegal (2016) propose a low-cost alternative sampling plan that satisfies conditions necessary for strong consistency. Specifically, they set b = inf 2 k : 2 k ≥ n ν , k ∈ Z + so that b increases by doubling the batch size. It is then possible to store only the batch means and merge every two batches when the batch size increases twofold. Such a sampling plan reduces memory complexity to O(a). Moreover, the estimator can be updated in O(1) computational steps using a recursive variance calculation.
The weighted BM estimator with a Tukey-Hanning window does not reduce computational time significantly. As mentioned earlier, the proposed estimators are more beneficial for lag window functions with ∆ 2 w n (k) = 0 for certain k. Nevertheless, as dimension and chain length increases, weighted BM still save some computing time. More efficient coding or a linear approximation could provide additional efficiency if one prefers the Tukey-Hanning window.
All estimators in this article use the same truncation point for a fair comparison, which are not necessarily the best. In fact, finite sample coverage probabilities should improve by choosing better truncation points as suggested by Liu and Flegal (2018). These optimal truncation points are those that minimize the asymptotic mean-squared error. If one wants to omit further exploration, we suggest using the same truncation point for weighted BM estimators as the corresponding SV estimators. Flegal and Jones (2010) provide an illustrative example regarding optimal batch sizes for BM and overlapping BM. For flat top windows, Politis (2003) suggests an empirical rule for the optimal truncation point selection.
However, optimal batch size is a direction of future research for weighted BM estimators.

Acknowledgments
We are grateful to the anonymous referees, anonymous associate editor, Galin Jones, and Dootika Vats for their constructive comments and helpful discussions, which resulted in many improvements to this paper.

A Preliminaries for Theorem 1
We first introduce some notations and propositions. Recall where the individual entries are denoted asΣ w,ij .
Recall L is the lower triangular matrix satisfying Σ = LL T and denote Σ ij as the individual entries of Σ. Finally define C(t) = LB(t), C (i) (t) as the ith component of C(t), and for almost all sample paths, there exists n 0 ( ) such that for all n ≥ n 0 and all i = 1, ..., p

B Proof of Theorem 1
Theorem 1 follows from Lemmas 1 and 2 presented below.
Lemma 1. Suppose Assumption 2 holds. If there exists a constant c ≥ 1 such that n (b/n) c < ∞ and (10) holds, thenΣ w → I p as n → ∞ w.p.1 where I p is the p × p identity matrix.
Proof. First consider the diagonal elements ofΣ w which go to 1 as n → ∞. For i = j By the proof of Proposition 3.1 in Damerdji (1994)  k∆ 2 w n (k) and since (10) holds,Σ w,ii → 1 as n → ∞ w.p.1. Now consider the off-diagonal elements ofΣ w which go to 0 as n → ∞. Here i = j and By Lemma 3 in Vats et al. (2015), as n → ∞ w.p.1 Then since (10) holds,Σ w,ij → 0 as n → ∞ w.p.1. HenceΣ w → I p as n → ∞ w.p.1.
Proof. We will show the result componentwise, i.e. thatΣ w,ij → Λ ij where Λ ij denotes ij entry of the matrix LΣ w L T . For ease of exposition, let Y i = g(X i ) − θ and definē Define vectors Then it is easy to show Using the definition ofΣ w at (8) with (16) and (17), we havê Taking absolute value of (18) We will show each of the seven terms in (19) goes to 0 as n → ∞ w.p.1. First we establish the following useful inequality. From (9) in Assumption 1, for any component i and sufficiently 1. For any component i, we have where the inequality is from (20) since lk < lk + k ≤ n. Then using (15) and (21) as 2. For any component i, where the inequality is from (20). Using (15), (22), and Assumption 2, as n → ∞ 3. For any component i, using Proposition 2 Then (21) and (23) which tends to 0 as n → ∞ w.p.1 by (14).
Since each of the seven terms in (19) goes to 0 as n → ∞ w.p.1,Σ w → LΣ w L T as n → ∞ w.p.1.

C Preliminaries for Theorem 3
The proof of Theorem 3 uses the results in Corollary 3, Lemma 4 and Lemma 5.
Consider the general family of Bartlett flat top SV estimators introduced by Politis andRomano (1995, 1996) defined aŝ where 0 ≤ c ≤ 1. When c = 1/2, the resulting estimator is the SV estimator based on lag window at (6) denoted previouslyσ 2 f . Further define the Brownian motion expression of σ 2 f t asσ whose variance is given by Lemma 3. Denote lim n→∞ f (n)/g(n) = 0 by f (n) = o(g(n)).
The joint distribution of Z is . The conditional distribution of Z 1 |Z 2 and the marginal distribution of Z 2 are  (60),
From (7),σ 2 f can be expressed by a linear combination of two Bartlett SV estimators which are asymptotically equivalent toσ 2 obm , andσ 2 w can be expressed as a linear combination of two BM estimators, (68) and (69)