Model selection of stochastic simulation algorithm based on generalized divergence measures

MCMC methods (Monte Carlo Markov Chain) are a class of methods used to perform simulations per a probability distribution $P$. These methods are often used when we have difficulties to directly sample per a given probability distribution $P$ . This distribution is then considered as a target and generates a Markov chain $(X_n)_{n\in\mathbb{N}}$ that, when $n$ is large we have $X_n\sim P$. These MCMC methods consist of several simulation strategies including the \emph{Independent Sampler (IS)}, the \emph{Random Walk of Metropolis Hastings \small{(RWMH)}}, the \emph{Gibbs sampler}, the \emph{Adaptive Metropolis (AM)} and \emph{Metropolis Within Gibbs (MWG)} strategy. Each of these strategies can generate a Markov chain and is associated with a convergence speed. It is interesting, with a given target law, to compare several simulation strategies for determining the best. Chauveau and Vandekerkhove \cite{Chauv2007} have compared IS and RWMH strategies using the Kullback-Leibler divergence measure. In our article we will compare our five simulation methods already mentioned using generalized divergence measures. These divergence measures are taken in family of $\alpha$-divergence measures \cite{Cichocki2010}, with a parameter $\alpha$. This is the R\'enyi divergence, Tsallis divergence and $D_\alpha$ divergence .


Introduction
In many areas of science, computer is essential. So computers are used for testing, calculations, simulations ... Beyond performance, the computation time is very important in the choice of calculation methods. In the field of statistical, the simulation of random variables are required to perform assessments such integral calculation. Thus we resort to numerical analysis methods or probabilistic methods. It is the latter that concern us in particular. The integral form is rewritten like an expectation E(X) and use the law of large numbers for the Monte Carlo integration and the ergodic theorem for MCMC methods. We will focus here on MCMC methods with which one performs sampling according to density f via Markov chains. Among MCMC methods we have Metropolis-Hastings algorithms, Gibbs sampler and others adaptive and hybrid algorithms.
Thus the major challenge of these algorithms, like many computational techniques, is the saving time i.e. their speedy convergence to the stationary distribution with density f . This is due to the fact that the convergence time is a brand of efficiency. This paper investigates the problem of selecting a good strategy simulation. We have several simulation strategies that are Independence Sampler, Random Walk of Metropolis-Hastings [10], Gibbs sampler, Metropolis Within Gibbs [8], Adaptive Metropolis [7]...The problem for all these strategies is the time they take to the generated a Markov chain which converge to the stationary distribution with probability density function f . The objective of this paper is to compare different strategies for simulation in order to select the best. Chauveau and Vandekerkhove [4] have given in their paper a methodology for selecting the best of k candidate strategies with the Kullback-Leibler divergence. We use in our article generalized divergence measures that are α-divergence [11] to compare simulation strategies. These divergence measures that we use are α-divergence D α , the Rényi α-divergence R α , the Tsallis α-divergence T α . We know that these divergences are all generalizations of the Kullback-Leibler divergence. We have the Kullback-Leibler divergence when the parameter α of these divergence measures tends to 1. If we have two simulation strategies s 1 and s 2 of respective probability densities p n 1 1 and p n 2 at time n and a target density f which we want to have samples we can use these divergence measures to compare s 1 and s 2 . Indeed, if we consider for example the Rényi divergence, we can compare R α (p n 1 , f ) and R α (p n 2 , f ), for each iteration n, to see which of s 1 and s 2 is the best.
We will show in Section 2 these three divergence measures. We know that they belong to the family of Csiszár φ-divergences. In Section 3 we will 1 p n 1 is density function of X n of Markov chain generated by strategy s 1 show two important results that are convergence to 0 of these divergences and building estimator for each divergence. We will show in Section 4 the asymptotic distributions of our estimators. Then we have some application examples in order to illustrate our methodology (Section 5). Finally, we do discussion in Section 6 to explain our various examples and see the lesson we can draw threreof.

Description of divergence measures
Let (X , A, λ) be an arbitrary measure space with λ being a finite or σ−finite measure. Let also µ 1 , µ 2 probability measures on X such that µ 1 , µ 2 λ (absolutely continuous). Denote the Randon-Nikodym derivatives (densities) of µ i with respect to λ by p i (x): Definition 2.1 Kullback-Leibler's relative divergence (also called relative entropy) between two probability measures µ 1 , µ 2 is defined by We can also write K(p 1 , p 2 ).
We will focus on in this article to a particular family of Csiszár φdivergence that is the family of α-divergence measures. Thus we have the α-divergence [1], the Rényi α-divergence and the Tsallis α-divergence of the discrepancy which we will work.
Definition 2.4 (Tsallis α-divergence) Tsallis α-divergence is defined by Chauveau and Vandekerkhove [4] studied the strategies of simulation with Kullback-Leibler divergence. Their study was the comparison of simulation strategies by the Kullback-Leibler divergence which allowed to choose the best among candidate strategies. They used the Metropolis-Hastings (MH) algorithm to explain their method. Thus, with a target density f, proposal density q and an initial distribution p 0 i corresponding to the strategy i of MH algorithm they assumed several conditions on these densities to provide certain regularity properties, as the Lipschitz property, on the successive densities p n i of strategy i. This allowed to obtain an consistent estimate of the entropy In this present paper we propose to study how to find the optimal stochastic simulation algorithm may come from the MH methods (IS and RWMH), Gibbs sampler, or recent adaptive (AM) and hybrid (MWG) methods. For this, we will use our three divergence measures. We see that we can find the Kullback-Leibler divergence if for each divergence, the parameter α tends to 1. The interest of these divergences is that its subtracts the target density f , the proposal density q and initial density p 0 i many assumptions like in Chauveau and Vandekerkhove [5].

Convergence of α-divergence measures
We will use a result of Holden (1998) to show the geometric convergence of the MH algorithm under a minoration condition: if there exists a ∈ (0, 1) such that q(y|x) ≥ af (y) for all x, y ∈ X , then Theorem 3.1 If the proposal density of the Metropolis-Hastings algorithm satisfies q(y|x) ≥ δf (y), for all x, y ∈ X , and δ ∈ (0, 1), then with α ∈ (1, +∞) Proof.
Therefore using (3.5) we have We will use (3.5) and the same procedure give us Here too we have These three divergence measures are all positive, we see that those converge to zero under the conditions of the Theorem 3.1. This allows us to see that the densities p n of random variables X n of the Markov chain converge to the stationary distribution f when n goes to infinity.

Estimators for the three divergence measures
We first make an estimator of α-divergence D α , in the same way we can have an estimator of divergences T α and Rényi divergence R α . Suppose λ is the Lebesgue measure on X = R d . Then we have We can initially think to write the integral like a mathematical expectancy and apply the method of Monte Carlo integration. We would have then Using the strong law of large numbers we will have the following estimator which converges almost surely to D α (p n , f ).
But very often encountered in practice, cases where the density f and the densities p n are not analytically known. There are also cases where f is known to a multiplicative constant: the Bayesian context.
It is often difficult to obtain the analytical form of p n . If f is known analytically, it is not useful to estimate it. But let us consider the more general case where p n and f are not analytically known . Then we use the methods of nonparametric estimation of probability densities. Póczos and Schneider [3] then proposed in their paper an estimator of these densities based on k-NN method (k nearest neighbor).

k-NN based density estimator
This is a method of density estimation proposed by Loftsgaarden and Quesenberry [9]. It has a set of data training E = {x 1 , x 2 , . . . , x N } where x i are independent observations with common probability density function g. For a data z, the problem is the estimation of a probability density g at z. The principle is to find among x i the kth nearest neighbor of z considering euclidean distance and we are interested in the distance between the two points (x i,k and z). If x i,k and z are points of R d this distance d(z, x i,k ) allows us to calculate the volume of the hypersphere with center z and radius r = d(z, x i,k ).
The proposed estimator is as followŝ Loftsgaarden and Quesenberry (1965) suggested after several experiments to take k equals to √ N . We will round it if necessary to the nearest whole.
We find in [11] theorems of convergence for density's estimators (k-NN estimators), showing the consistency of these estimators.
Theorem 3.2 (k-NN density estimators, convergence in probability) Let k(n) an integer that denotes the k(n)th nearest neighbor in the sample of size n. If a) Application to p n Let X 1:N = (X 1 , ..., X N ) be a simulated sample according to the law with density p n with X i i.i.d. We choose one X i on the sample. Let ρ k,i the euclidean distance between X i and its kth nearest neighbor on the sample.
We should also have the i.i.d. simulations according to the density f . But the principle in this article is that we don't know how to do i.i.d. simulations directly from f . Then we apply the MCMC simulation algorithms. After several iterations (large n) we can begin to recover samples. However, we take the precaution that two successive samples are separated by n 0 iterations. We get now the samples Y 1:M = (Y 1 , ..., Y M ) of random variables which are 'independent' with density f . Let γ k,i the euclidean distance between X i and its kth nearest neighbor on the sample Y 1:M .
The estimator of f at Póczos and Schneider (2011) proposed an estimator of the integral part that is common to the three divergence measures that we have presented. However, their first estimator is asymptotically biased. They showed that multiplying by a constant, they obtained an asymptotically unbiased estimator. Starting from it we construct our estimators for the α-divergence D α , the Rényi α-divergence R α and the Tsallis α-divergence T α .
Once the densities p n and f estimated at X i , we have the following estimator of D α,N (p n , f ).
that is an asymptotically biased estimator of M α (p n , f ) (Póczos and Schneider (2011)). When we multiply it by a constant independent of p n and f , with Γ(.) the Gamma function. We obtain an asymptotically unbiased estimator.
We obtain now a new asymptotically unbiased estimator of D α (p n , f ) Applying the same process we obtain an asymptotically unbiaised estimator The Rényi measure divergence has a characteristic due to the presence of the logarithmic function that takes M α (p n , f ) as argument. We have not shown in this paper that its estimator is unbiased or asymptotically unbiased. However, we propose an estimator Now give some theorems which show thatM α,N (p n , f )B k,α is asymptotically unbiased (Póczos and Schneider [11]). Define first the following function i.e., the estimator is asymptotically unbiased.
Theorem 3.5 (L 2 consistency) We have the following assumptions: x − y γ p(y)p(x)dydx < ∞, and that q is bounded above. Then that is, the estimator is L 2 consistent.
The last theorem show the consistencyM α,N (p n , f ) which is the estimator of the common integral part to the three differences. Based on this, estimatorŝ R α,N,k (p n , f ),D α,N,k (p n , f ) andT α,N,k (p n , f ) are also consistent.

Asymptotic distribution of divergences estimators
We seek to know the asymptotic distribution of our estimators. The asymptotic distribution of these estimators is studied under the assumption of continuity of densities p n and f . If densities p n and f are continuous, their estimators will also be continuous.

Asymptotic distribution of α-divergence estimator
Recall that the asymptotically unbiased estimator of We know that X i are i.i.d. ∼ p n . Function h which is the ratio of two continuous functions is also continuous. We have therefore the independence of h(X i ), i = 1, . . . , N . We can then apply the central limit theorem. Then we will have we have a normal distribution. ButȲ N is asymptotically biased (Póczos and Schneider [11]). Multiplyind it by a constant B k,α as stated by these authors gives us asymptotically unbiased estimator of M α (p n , f ) = (p n (x)) α (f (x)) 1−α dx. We will havē with N, M → ∞ and σ 2 < ∞.

Asymptotic distribution of Tsallis α-divergence estimator
For Tsallis divergence follows a normal distribution. As the previous estimator, we have the same procedure. We have now,T with N, M → ∞ and σ 2 < ∞.

Asymptotic distribution for Rényi α-divergence estimator
Now back to our estimator of Rényi divergence.
and if h is continuous, then h(X i ) are also i.i.d.
According to the central limit theorem, we will havē We can apply delta method and we have From Póczos at Schneider [11],

Examples
We will now illustrate our methodology with simple examples. That is why we will limite ourselves to one-dimensional and two-dimensional cases. In the following examples , we will use divergence measures to compare proposal densities corresponding to a given simulation strategy. This difference of densities may appear at their parameters. The comparison will be for different parameters values.
The proposal density considered as optimum is that which the divergence measure (function of n) between successive densities p n arising and the target density f tends to 0 faster. We will mainly compare as we have already told simulation strategies.

Target density f fully known
In the case where the target density f is analytically known the estimators of our divergence measures are slightly modified. This change applies at the constant B k,α . This constant is replaced by another in order to get there also asymptotically unbiased estimators. Then consider If we replace the density's estimator f by f itself we will havê which is an another estimator of the common integrale part M α (p n , f ) but asymptotically biased. As in the proof of Theorem 3.4 (Póczos at Schneider (2011)) we can simply check that the new estimator multiplied by the constant Q k,α = Γ(k−α+1) provides an asymptotically unbiased estimator of M α (p n , f ).
In one-dimension we are in the set R, therefore d = 1. Consequently the ddimensional hypersphere around X i boils down to the line segment with X i middle. The real c represents the length measuring of a line segment with X i middle. This segment has 2 as length measuring since the distance from X i at each end is 1. As suggested by Loftsgaarden and Quesenberry [9] we will take k equals to the nearest integer of √ N − 1. We have noŵ We can now have the asymptotically unbiased estimatorŝ respectively of D α (p n , f ) and T α (p n , f ). For the estimator of the Rényi divergencê we don't have results regarding the presence or absence of bias. But the most important is the fact that three divergence estimatorsD α,N,k (p n , f ),T α,N,k (p n , f ) and R α,N,k (p n , f ) are both consistent.
First we show an example that compares two proposal densities 2 for a given strategy. We choose the Independence Sampler (IS), which is one of the strategies of the Metropolis Hastings algorithm, to compare these densities.

a) Independence Sampler (IS): comparison of proposal densities
For a given simulation strategy the choice of good proposal density is important. Indeed, for achieving satisfactory simulation results it is important to choice a good proposal density. For simplicity, we consider densities that differ by the value of their parameters. Thus we take respective distribution densities N (−3, 2) and N (0, 3); the target density is from normale standard distribution N (0, 1). It is found by looking at Figure 1 that the dashed curve converges very rapidly to 0 while the solid curve is slow to converge. After finding a good proposal distribution for each strategy, we compare here the two main strategies of MH algorithm that are IS and RWMH. It may happen in an experiment that the IS strategy trumps RWMH strategy and in another experiment the opposite occurs. Everything depends on the instrumental distribution but also the target distribution to some extent even if the initial law is the same for both strategies.
Here the target distribution is a gaussian mixture 0.4 N (−8, 2) + 0. 6 N (0, 6). For the Independence Sampler we use the proposal distribution N (−2.5, 15) whereas for RWMH method we propose to take N (x, 15). This distribution has as mean equal to the current element X n = x. We show the comparison of IS and RWMH strategies (Fig. 2). Note that the curve associated with IS strategy is below curve associated with RWMH. However, the two curves converge very quickly to 0.

Target density f is not known completely
In most real situations, the density f is not known analytically. This is the case, for example Bayesian context where f is the density of the posterior. Then f is written as f = cϕ where c is unknown constant. [11] -Adaptive Metropolis 4 -RWMH Here the target density is known up to a constant f (x) ∝ exp(−x 2 )(2 + sin(5x) + sin(2x)) Present some Adaptive Metropolis (AM) strategy proposed by Haario et al (2001). First recall that the stochastic process generated by this simulation method is not a Markov chain. However it has well ergodicity properties. The assumptions for this are that the target density is bounded from above and has a bounded support. Describe the algorithm now.
The target density has a support E ⊂ R d . Suppose, that at time t we have sampled the states X 0 , X 1 , . . . , X t−1 , where X 0 is the initial state. Then a candidate point Y is sampled from the (asymptotically symmetric) proposal distribution q t (.|X 0 , ..., X t−1 ), which now may depend on the whole history (X 0 , X 1 , ..., X t−1 ). The candidate point Y is accepted with probability in which case we set X t = Y , and otherwise X t = X t−1 . Observe that the chosen probability for the acceptance resembles the familiar acceptance probability of the Metropolis algorithm. The proposal distribution q t (.|X 0 , . . . , X t−1 ) employed in the AM algorithm is a Gaussian distribution with mean at the current point X t−1 and covariance where S d is a parameter that depends only on dimension d and > 0 is a constant that we may choose very small compared to the size of E. Here I d denotes the d-dimensional identity matrix. The covariance C t may be viewed as a function of t variables from R d having values in uniformly positive definite matrices.
For these two simulation methods (AM and RWMH) we see that the respective divergence measures stand very close and very quickly all tend to 0 (Fig. 3). Indeed, near 0 divergence measures which are in this case the Hellinger divergence measures D 1/2 are slight oscillations and intersect.

Two-dimensional case
We consider here the only case of a known function up to a constant. We choose now a sample X 1 , . . . , X n which are i.i.d. 5 such that X i ∼ N (m, σ 2 ). So we the prior distributions are m ∼ N (m 0 , σ 2 0 ) σ 2 ∼ IG(α, β), the full posterior density is known up to a constant the conditional distributions of parameters are So let's compare firstly RWMH and Gibbs sampler and secondly the RWMH and Metropolis Within Gibbs. Π(m, σ 2 |x) will be our target density.

a) RWMH -Gibbs sampler
If RWMH applied both in dimension d ≥ 1 , the Gibbs sampler on it only applied in dimension d > 1. However, we limit ourselves here in dimension 2. Note that this method (Gibbs sampler) has been used by Geman (1984) to generate observations from a Gibbs distribution (Boltzmann distribution). It is a particular form of the MCMC method, because of its effectiveness, is widely used in many fields of Bayesian analysis. Thus to simulate according to a probability density f (θ) with θ = (θ 1 , . . . , θ p ) one can use the following idea Initialisation: generating a vector θ = (θ 1 , . . . , θ p ) according to a initial proposal law Π 0 .
Simulate following the conditional distributions We see that the curve corresponding to the RWMH strategy is well above that representing the Gibbs sampler (Fig. 4). The two curves have only one common point that is their origin. As mentioned in the legend, the dashed curve is associated with RWMH while the solid curve is associated with the Gibbs sampler. However, the curves denote divergence measures, we will talk about it in Section 6 (Discussion). Metropolis Within Gibbs is a hybrid simulation method that combines stages of the Gibbs sampler and Metropolis Hastings method. It is used in some cases where we have conditional distributions for which we can't have samples directly. There are several versions of this sampler, so we present the following.
Starting from a common point, the curves stay away from the value 0 (Fig. 5). The solid curve is associated with Metropolis Within Gibbs strategy and the dashed curve associated with RWMH strategy. Here we use the Rényi divergence measure in order to obtain our curves.

Indepence Sampler : comparison of proposal densities
The two curves represent measurements of Chi-square divergence D 2 . Indeed, the dashed curve indicates the difference between the densities p n 1 from the IS strategy with proposal law N (0, 3) and the target density f of N (0, 1), noted by D 2 (p n 1 , f ). The solid curve denotes the difference between the densities p n 2 produced by the IS strategy with proposal distribution N (−3, 2) and the target distribution N (0, 1), noted by D 2 (p n 2 , f ).
It is clear that D 2 (p n 1 , f ) and D 2 (p n 2 , f ) are functions of the number of iterations n and have the same origin. The fact that they have the same starting point can be explained by relevance to begin with a same state x 0 or same initial law π 0 to compare two simulation strategies. The two curves overlap up to the first iteration. Indeed, the IS strategy with N (0, 3) is more efficient because this distribution has a mean m = 0 like the mean of target distribution. Its variance σ 2 = 3 is greater than the variance of the target distribution which is σ 2 = 1. It follows that its support covers the support of the target density. The other strategy has a proposal distribution N (−3, 2). This law has a mean m = −3 and a variance σ 2 = 2 that make its density function is shifted to the left (see figure) and is therefore not adequate to cover the support of the target density. So that the chain generated by this simulation strategy will soon converge to the target distribution.

Independence Sampler -Random Walk of Metropolis Hastings
We said earlier that the two curves in Figure 2 converge rapidly to 0.We use also, here Chi-square divergence.Dashed curve represent also the divergence measure between the densities p n of RWMH strategy and target density f (x) = 0.4f 1 (x, −8, 2)+0.6f 2 (x, 0, 6) which is Gaussian mixture density. Similarly, the solid curve represents divergence measure between the densities p n from the IS strategy and the target density f (x).
Although the two divergence measures converge rapidly to 0, we see that the divergence which represents the IS strategy is below the divergence associated with RWMH strategy. Therefore IS strategy is more efficient, even if RWMH is also good method.
Note that the two curves have the same starting point and overlap untill the first iteration. From there, the curve of IS strategy approximates much to 0. This is due to the fact this method (IS) has, here, a good proposal distribution N (−2.5, 15). With a mean m = −2.5 and a variance σ 2 = 15, its density function is centered relatively to the target density f (x). Thus the support of the target density is well covered by the proposal density.
Regarding the RWMH strategy, its has a normal proposal distribution to each iteration n with a mean equal to the current state X n = x n and variance also equal to σ 2 = 15. That's why even if its has not a support that covers all the time the target distribution's support, its does not so far away. Hence, this is also a good strategy.

Adaptive Metropolis -RWMH
For this comparison, we use the Hellinger divergence measure (D 1/2 ). The solid curve then describes the divergence measure between densities p n (AM) and the target density f (x). The dashed curve described the divergence between the densities p n (RWMH)and also target density. The two curves merge between the initial state and the first iteration. From there, they don't move away from each other, but coexist around value 0. Thus, the two corresponding simulation strategies are all very efficient.
The effectiveness of these two strategies is explained in first concerns the AM strategy, by the fact that it adapts its proposal density to the target density. The adjustment mechanism is performed at the variance of instrumental density. Indeed, if X 1 , X 2 , . . . , X n−1 have already been simulated and one desire to obtain the point (or vector) X n at time n, we generate with a proposal distribution with mean equal the current state of X n−1 and covariance martice described above.
Recall that we use the AM strategy here in one dimension. Thus, the variance σ 2 = 5 chosen for our instrumental Gaussian between the initial time and time t = 15 ( this is the period prior to the iterative update of the variance ) yields samples which describe almost support the target density. Added to this, from the 16th iteration adjustments allow the variance to have good samples. Which explains why it has a very fast convergence. Concerning RWMH method, with same variance σ 2 = 5 for the proposal density, we also obtain good samples that converge very quickly to the stationary density.

Gibbs Sampler -RWMH
As we said earlier, we have in Figure 4 two curves designating measures of Tsallis divergence with α = 0.99. The dashed curve (RWMH) is above the other and far from the value 0 after the first 50 iterations. It shows, here, that RWMH strategy is ineffective . This inefficiency is due to the covariance matrix M of the proposal distribution. This matrix which is implementation parameter is not optimum. The solid curve (Gibbs sampler) tends rapidly to 0 (after the 7th iteration). The strength of this sample is mainly due to the fact that the components of the vector X n are generated directly from the simulated conditional distributions of the target distribution.

Metropolis Within Gibbs -RWMH
The Rényi divergence measures between the respective densities p n both strategies (MWG and RWMH) and the stationary distribution are held away from the value 0 even after a large number of simulations. What makes us to say that two strategies are not effective.
As in the previous comparison (Gibbs Sampler vs RWMH) RWMH strategy is always slow in computation time. In fact it is the same strategy with the instrumental Gaussian distribution with parameter matrix M , but evaluated using the Rényi divergence measure with α = 0.3. We see thereby that we have a wide range of divergence measures allowing us to evaluate a given strategy simulation. Respecting the MwG method in our case, even after 30,000 iterations, the created process (Markov chain) does not converge ( Fig 5); even if we are limited here in 1,000 iterations.
This method (MwG) is often used in cases where one wants to use the Gibbs sampler and for some conditional distributions he can not simulate directly. Then one introduce in the algorithm a few steps of RWMH sampler, offering instrumental distributions having for target laws the conditional distributions. In our case (Fig.  5) we used a version that systematically applies the steps of the Metropolis sampler to all conditional distributions, so that the algorithm will delay to converge.

Conclusion
We have, therefore, shown that with various divergence measures we can compare two different simulation strategies. To achieve this, we used α-divergence measures that we have described in Section 2. We showed the convergence of these divergence measures and proposed estimators for these ones (Section 3). In Section 4 we gave the asymptotic distribution of each estimator. Then we gave some examples for the implementation of simulation strategies that have been described (Section 5). Finally we discussed in Section 6 of the causes that make a simulation strategy is more powerful than the other.