Abstract
In this article, we perform an asymptotic analysis of Bayesian parallel kernel density estimators introduced by Neiswanger et al. (in: Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, pp 623–632, 2014). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.
Similar content being viewed by others
References
Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.
De Valpine, P. (2004). Monte Carlo state-space likelihoods by weighted posterior kernel density estimation. Journal of the American Statistical Association, 99(466), 523–536.
Duong, T., Hazelton, M. (2005). Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3), 485–506.
Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1), 153–158.
Ge, R., Huang, F., Jin, C., Yuan, Y. (2015). Escaping from saddle points—Online stochastic gradient for tensor decomposition. In Proceedings of conference on learning theory (pp. 797–842), 3–6 July.
Geyser Observation and Study Association. (2017). Old faithful. http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL. Accessed 21 Oct 2017.
Langford, J., Smola, A. J., Zinkevich, M. (2009). Slow learners are fast. Advances in Neural Information Processing Systems, 22, 2331–2339.
Miroshnikov, A., Wei, Z., Conlon, E. M. (2015). Parallel Markov chain Monte Carlo for non-Gaussian posterior distributions. Stat, 4(1), 304–319.
Neiswanger, W., Wang, C., Xing, E. P. (2014). Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the thirtieth conference on uncertainty in artificial intelligence (pp. 623–632). AUAI Press, 23–27 July.
Nelder, J. A., Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7(4), 308–313.
Newman, D., Asuncion, A., Smyth, P., Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10, 1801–1828.
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065–1076.
Propp, J. G., Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures & Algorithms, 9(1–2), 223–252.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3), 832–837.
Scott, S. L. (2017). Comparing consensus Monte Carlo strategies for distributed Bayesian computation. Brazilian Journal of Probability and Statistics, 31(4), 668–685.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2), 78–88.
Silverman, B. W. (1986). Density estimation for statistics and data analysis (Vol. 26). New York: Chapman & Hall/CRC.
Simonoff, J. (1996). Smoothing methods in statistics. Springer series in statistics. New York: Springer.
Sköld, M., Roberts, G. O. (2003). Density estimation for the Metropolis-Hastings algorithm. Scandinavian Journal of Statistics, 30(4), 699–718.
Smola, A., Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1–2), 703–710.
van Eeden, C. (1985). Mean integrated squared error of kernel estimators when the density and its derivative are not necessarily continuous. Annals of the Institute of Statistical Mathematics, 37(1), 461–472.
Wand, M. P., Jones, M. C. (1994). Multivariate plug-in bandwidth selection. Computational Statistics, 9(2), 97–116.
Wang, X., Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605.
West, M. (1993). Approximating posterior distributions by mixture. Journal of the Royal Statistical Society. Series B (Methodological), 55(2), 409–422.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Kernel density estimators and asymptotic error analysis
In this section, we will use the following notation. The function f denotes a probability density, and its kernel density estimator is given by
where \(X_1,X_2,\ldots X_n \sim f\) are i.i.d. samples.
Lemma 6
(Bias expansion) Let K satisfy (H3) and (H4). Let f be a probability density function satisfying (H5) and (H6). Let \(\widehat{f}_{n,h}(x)\) be an estimation of f given by (33). Then,
-
(i)
\(\text {bias}(\widehat{f}_{n,h})\) is given by
$$\begin{aligned} \begin{aligned} \big [\text {bias}(\widehat{f}_{n,h})\big ] (x)&= \mathbb {E}\big [\widehat{f}_{n,h}(x)\big ]-f(x)=\frac{h^2k_2f''(x)}{2}+ {[E_b(f,K)](x ; h)}, \end{aligned} \end{aligned}$$(34)where
$$\begin{aligned} E_b(x;h) :=\int _{\mathbb {R}}K(t)\left( \int _{x}^{x-ht}\frac{f'''(z)(x-ht -z)^2}{2}\,\mathrm{d}z\right) \,\mathrm{d}t. \end{aligned}$$ -
(ii)
For all \(n\ge 1\) and \(h>0\), the term \(E_b(\cdot ; n,h)\) satisfies the bounds
$$\begin{aligned} \begin{aligned} |E_b(x ; h)| \,&\le \, \frac{C k_3}{6}h^3, \quad x\in \mathbb {R},\\ \int _{\mathbb {R}} |E_b(x ; h)|\, \mathrm{d}x \,&\le C\frac{k_3}{6} h^3 ,\\ \int _{\mathbb {R}} |E_b(x ; n,h)|^2\, \mathrm{d}x \,&\le \,\frac{C^2 k_3^2}{36} h^6, \end{aligned} \end{aligned}$$(35)for some constant C.
-
(iii)
The square-integrated \(bias(\widehat{f}_{n,k})\) satisfies
$$\begin{aligned} \int _{\mathbb {R}} \text {bias}^2(\widehat{f}_{n,k}) \, \mathrm{d}x \, = \, \frac{h^4 k_2^2}{4}\int _{\mathbb {R}} (f''(x))^2 \, \mathrm{d}x + \mathcal {E}_b(n,h) \, < \, \infty \end{aligned}$$with
$$\begin{aligned} |\mathcal {E}_b(n,h)| \le C_b \left( k_2 + \frac{k_3}{6}h\right) \frac{k_3 h^5}{6}, \end{aligned}$$(36)for some constant \(C_b\), and all \(n \ge 1\), \(h>0\).
Proof
Using (33) and the fact that \(X_i\), \(i=1,\ldots ,n\) are i.i.d., we obtain
where we used the substitution \(t=(x-y)/h\). Employing Taylor’s theorem with an error term in integral form and using (H3), we get
which proves (i).
By (H4), we have
and by (H6), using the substitution \(\alpha =x-ht-z\) and employing Tonelli’s theorem, we obtain
Thus, combining the two bounds above, we conclude
Observe that
By (H5) and (H6), we have \(\int _{\mathbb {R}} (f''(x))^2 \, \hbox {d}x < \infty \). Hence, by setting \(C_b=C^2\), using (39) and (40), we obtain (36). \(\square \)
Lemma 7
(Variation expansion) Let K satisfy (H3) and (H4), with \(r=2\). Let f satisfy (H5) and (H6), and \(\widehat{f}_{n,h}(x)\) be the estimator of f given by (33). Then,
-
(i)
\(\mathbb {V}(\widehat{f}_{n,h})\) is given by
$$\begin{aligned} \big [\mathbb {V}(\widehat{f}_{n,h})\big ](x) =f(x)\frac{1}{nh}\int _{\mathbb {R}}K^2(t)\,\mathrm{d}t+ E_{V}(x ; n,h), \quad x \in \mathbb {R}\end{aligned}$$(41)with
$$\begin{aligned} \begin{aligned} E_V(x;n,h)&= -\frac{1}{n}\left( \int _{\mathbb {R}} t K^2(t) \int _{0}^{1} f'(x-htu)\, \mathrm{d}u \,\mathrm{d}t \right. \\&\left. \quad +\Big (f(x)+\text {bias}(\widehat{f}_{n,h})(x)\right) ^2 \bigg ). \end{aligned} \end{aligned}$$(42) -
(ii)
The term \(E_V(x ; n,h)\) satisfies
$$\begin{aligned} \begin{aligned} \mathcal {E}_V(n,h)&=\left| \int _{\mathbb {R}} E_V(x)\,\mathrm{d}x\right| \\&\le \frac{C_V}{n} \left( 2 + h^2 k_2 + \big (k_2 + \frac{k_3}{3}h\big ) \frac{h^5}{6}k_3\right) . \end{aligned} \end{aligned}$$(43)
Proof
Using (34) and the fact that \(X_i\), \(i=1,\dots ,n\) are i.i.d., we obtain
We next estimate the terms
Observe that (H5)–(H6) imply
for any \(\alpha \in \mathbb {R}\). Then, using Tonelli’s Theorem and (H4), we obtain
Since \(E_1\) is integrable, we can use Fubini’s theorem, and this yields
where we used the fact that \(\lim _{x\rightarrow \pm \infty }f(x) = 0\). Next, by (H5) and (35), we get
Combining the above estimates, we obtain (43). \(\square \)
Lemma 8
(Kernel autocorrelation) Let K satisfy (H3) and (H4), then the function
satisfies
Moreover, for any sufficiently smooth f(x)
Proof
Since \(K \ge 0\), we have \(K_2 \ge 0\). Moreover, we have
and this proves the first property. Similarly, recalling that \(\int z K(z) \,\hbox {d}z =0\), we obtain
Next, we take any smooth function f and compute
Finally, we estimate the last term in the above formula as follows:
\(\square \)
Lemma 9
(Product expectation) Let K satisfy (H3) and (H4), with \(r=2\). Let f be a probability density function that satisfies (H5) and (H6), and let \(\widehat{f}_{n,h}(x)\) be an estimate of f given by (33). Then,
where the error term
satisfies
for some constant \(C_\varPi \) and constants C given in (H6) and \(K_2\) defined in Lemma 8.
Proof
By the definition of the estimator \(\widehat{f}\), we have
Since all \(\{X_i\}_{i=1}^{N}\) are i.i.d., we can split the calculation into two parts, one for the part, where the indexes coincide and the part, where indexes are different. We then can use the independence of the samples to simplify the calculation
where \(X=X_1\). The first expectation term in (45) can be expanded as
Let us denote
Then, we obtain
and this establishes (44).
Observe that (H3), (H4), and (H5) imply
Next, according to (34) and (35)
where C is a maximum of constants from (H5) and hence
Combining the above estimate, we conclude that
To obtain bounds on the integral of the error term, let us consider each component of the error separately. The term \(E_{\varPi ,1}\) is integrable
Next using Fubini’s theorem, we obtain
Therefore, using Lemma 6, (34), (35), and the Hypothesis (H6), we obtain
Finally, directly from (46), (34), and (35), we obtain
\(\square \)
Theorem 3
(MISE expansion) Let K satisfy (H3) and (H4), with \(r=2\). Let f be a probability density function that satisfies (H5) and (H6), and let \(\widehat{f}_{n,h}(x)\) be an estimate of f given by (33). Then,
with \(\mathcal {E}_b\) and \(\mathcal {E}_V\) defined in (40) and (43), respectively. Moreover, for every \(H>0\), there exists \(C_{f,K,H}\) such that
for all \(n \ge 1\) and \(H \ge h > 0\).
Proof
It is easy to show (see Silverman 1986) that
and hence the result follows from Lemma 6 and Lemma 7. \(\square \)
About this article
Cite this article
Miroshnikov, A., Savelev, E. Asymptotic properties of parallel Bayesian kernel density estimators. Ann Inst Stat Math 71, 771–810 (2019). https://doi.org/10.1007/s10463-018-0662-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0662-0