Adaptive Mixture of Student-t Distributions as a Flexible Candidate Distribution for Efficient Simulation: The R Package AdMit

This paper presents the R package AdMit which provides functions to approximate and sample from a certain target distribution given only a kernel of the target density function. The core algorithm consists in the function AdMit which fits an adaptive mixture of Student-t distributions to the density of interest via its kernel function. Then, importance sampling or the independence chain Metropolis- Hastings algorithm are used to obtain quantities of interest for the target density, using the fitted mixture as the importance or candidate density. The estimation procedure is fully automatic and thus avoids the time-consuming and difficult task of tuning a sampling algorithm. The relevance of the package is shown in two examples. The first aims at illustrating in detail the use of the functions provided by the package in a bivariate bimodal distribution. The second shows the relevance of the adaptive mixture procedure through the Bayesian estimation of a mixture of ARCH model fitted to foreign exchange log-returns data. The methodology is compared to standard cases of importance sampling and the Metropolis-Hastings algorithm using a naive candidate and with the Griddy-Gibbs approach.


Introduction
In scientific analysis one is usually interested in the effect of one variable, say, education (= x), on an other variable, say, earned income (= y). In the standard linear regression model this effect of x on y is assumed constant, i.e., E(y) = βx, with β a constant. The uncertainty of many estimators of β is usually represented by a symmetric Student-t density (see, e.g., Heij et al. 2004, Chapter 3). However, in many realistic models the effect of x on y is a function of several deeper structural parameters. In such cases, the uncertainty of the estimates of β may be rather non-symmetric. More formally, in a Bayesian procedure, the target or posterior density may exhibit rather non-elliptical shapes (see, e.g., Hoogerheide et al. 2007;Hoogerheide and van Dijk 2008b). Hence, in several cases of scientific analysis, one deals with a target distribution that has very non-elliptical contours and that it is not a member of a known class of distributions. Therefore, there exists a need for flexible and efficient simulation methods to approximate such target distributions.
This article illustrates the adaptive mixture of Student-t distributions (AdMit) procedure (see Hoogerheide 2006;Hoogerheide et al. 2007;Hoogerheide and van Dijk 2008b, for details) and presents its R implementation (R Development Core Team 2008) with the package AdMit . The AdMit procedure consists of the construction of a mixture of Student-t distributions which approximates a target distribution of interest. The fitting procedure relies only on a kernel of the target density, so that the normalizing constant is not required. In a second step this approximation is used as an importance function in importance sampling or as a candidate density in the independence chain Metropolis-Hastings (M-H) algorithm to estimate characteristics of the target density. The estimation procedure is fully automatic and thus avoids the difficult task, especially for non-experts, of tuning a sampling algorithm. The R package AdMit is available from the Comprehensive R Archive Network at http: //CRAN.R-project.org/package=AdMit.
In a standard case of importance sampling or the independence chain M-H algorithm, the candidate density is unimodal. If the target distribution is multimodal then some draws may have huge weights in the importance sampling approach and a second mode may be completely missed in the M-H strategy. As a consequence, the convergence behavior of these Monte Carlo integration methods is rather uncertain. Thus, an important problem is the choice of the importance or candidate density, especially when little is known a priori about the shape of the target density. For both importance sampling and the independence chain M-H, it holds that the candidate density should be close to the target density, and it is especially important that the tails of the candidate should not be thinner than those of the target. Hoogerheide (2006) and Hoogerheide et al. (2007) mention several reasons why mixtures of Student-t distributions are natural candidate densities. First, they can provide an accurate approximation to a wide variety of target densities, with substantial skewness and high kurtosis. Furthermore, they can deal with multi-modality and with non-elliptical shapes due to asymptotes. Second, this approximation can be constructed in a quick, iterative procedure and a mixture of Student-t distributions is easy to sample from. Third, the Student-t distribution has fatter tails than the Normal distribution; especially if one specifies Student-t distributions with few degrees of freedom, the risk is small that the tails of the candidate are thinner than those of the target distribution. Finally, Zeevi and Meir (1997) showed that under certain conditions any density function may be approximated to arbitrary accuracy by a convex combination of basis densities; the mixture of Student-t distributions falls within their framework. One sufficient condition ensuring the feasibility of the approach is that the target density function is continuous on a compact domain. It is further allowed that the target density is not defined on a compact set, but with tails behaving like a Student-t distribution. Furthermore, it is even allowed that the target tends to infinity at a certain value as long as the function is square integrable. In practice, a non-expert user sometimes does not know whether the necessary conditions are satisfied. However, one can check the behaviour of the relative numerical efficiency as robustness check; if the necessary conditions are not satisfied, this will tend to zero as the number of draws increases (even if the number of components in the approximation becomes larger). Obviously, if the provided target density kernel does not correspond to a proper distribution, the approximation will not converge to a sensible result. These cases of improper distributions should be discovered before starting a Monte Carlo simulation.
The R package AdMit consists of three main functions: AdMit, AdMitIS and AdMitMH. The first one allows the user to fit a mixture of Student-t distributions to a given density through its kernel function. The next two functions perform importance sampling and independence chain M-H sampling using the fitted mixture estimated by AdMit as the importance or candidate density, respectively. To illustrate the use of the package, we first apply the AdMit methodology to a bivariate bimodal distribution. We describe in detail the use of the functions provided by the package and document the relevance of the methodology to reproduce the shape of non-elliptical distributions. Second, we consider an empirical application with the Bayesian estimation of a mixture of ARCH model applied to foreign exchange log-returns, and show the relevance of the AdMit methodology compared to standard procedures such as using a unimodal candidate in importance and M-H sampling or the Griddy-Gibbs algorithm. In particular, we illustrate that it is worthwhile to invest some computing time in an accurate importance or candidate density. This investment may become profitable in the sense of much quicker convergence or more reliable sampling results, especially to depict the parameter uncertainty in the tails of the joint posterior distribution.
The outline of the paper is as follows: Section 2 presents the principles of the AdMit algorithm. Section 3 presents the functions provided by the package with an illustration of a bivariate non-elliptical distribution. Section 4 compares the performance of the AdMit approach with standard strategies in a mixture of ARCH(1) model. Section 5 concludes.

Adaptive mixture of Student-t distributions
The adaptive mixture of Student-t distributions method developed in Hoogerheide (2006) and Hoogerheide et al. (2007) constructs a mixture of Student-t distributions in order to approximate a given target density p(θ) where θ ∈ Θ ⊆ R d . The density of a mixture of Student-t distributions can be written as: where η h (h = 1, . . . , H) are the mixing probabilities of the Student-t components, 0 η h 1 (h = 1, . . . , H), H h=1 η h = 1, and t d (θ | µ h , Σ h , ν) is a d-dimensional Student-t density with mode vector µ h , scale matrix Σ h , and ν degrees of freedom: The adaptive mixture approach determines H, η h , µ h and Σ h (h = 1, . . . , H) based on a kernel function k(θ) of the target density p(θ). It consists of the following steps: Step 0 -Initial step Compute the mode µ 1 and scale Σ 1 of the first Student-t distribution in the mixture as µ 1 = arg max θ∈Θ log k(θ), the mode of the log kernel function, and Σ 1 as minus the Hessian of log k(θ) evaluated at its mode µ 1 . Then draw a set of N s points θ [i] (i = 1, . . . , N s ) from this first stage candidate density q(θ) = t d (θ | µ 1 , Σ 1 , ν), with small ν to allow for fat tails.
Comment: In the rest of this paper, we use Student-t distributions with one degrees of freedom (i.e., ν = 1) since: 1. it enables the method to deal with fat-tailed target distributions; 2. it makes it easier for the iterative procedure to detect modes that are far apart.
After that add components to the mixture, iteratively, by performing the following steps: Step 1 -Evaluate the distribution of weights Compute the importance sampling weights . . , N s . In order to determine the number of components H of the mixture we make use of a simple diagnostic criterion: the coefficient of variation, i.e., the standard deviation divided by the mean, of the importance sampling weights {w(θ [i] ) | i = 1, . . . , N s }. If the relative change in the coefficient of variation of the importance sampling weights caused by adding one new Student-t component to the candidate mixture is small, e.g., less than 10%, then the algorithm stops and the current mixture q(θ) is the approximation. Otherwise, the algorithm goes to step 2.
Comment: Notice that q(θ) is a proper density, whereas k(θ) is a density kernel. So, the procedure does not provide an approximation to the kernel k(θ) but provides an approximation to the density of which k(θ) is a kernel.
Comment: There are several reasons for using the coefficient of variation of the importance sampling weights. First, it is a natural, intuitive measure of quality of the candidate as an approximation to the target. If the candidate and the target distributions coincide, all importance sampling weights are equal, so that the coefficient of variation is zero. For a poor candidate that not even roughly approximates the target, some importance sampling weights are huge while most are (almost) zero, so that the coefficient of variation is high. The better the candidate approximates the target, the more evenly the weight is divided among the candidate draws, and the smaller the coefficient of variation of the importance sampling weights. Second, Geweke (1989) argues that a reasonable objective in the choice of an importance density is the minimization of: or equivalently, the minimization of the coefficient of variation: since: does not depend on q(θ).
The reason for quoting the coefficient of variation rather than the standard deviation is that the standard deviation of the scaled weights (i.e., adding up to one) depends on the number of draws, whereas the standard deviation of the unscaled weights depends on the scaling constant k(θ)dθ (i.e., typically the marginal likelihood). The coefficient of variation of the importance sampling weights, which is equal for scaled and unscaled weights, reflects the quality of the candidate as an approximation to the target (not depending on number of draws or k(θ)dθ). The coefficient of variation is the function one would minimize if one desires to estimate P(θ ∈ D), where D ⊂ Θ, if the true value is P(θ ∈ D) = 0.5. Different functions should be minimized for different quantities of interest. However, it is usually impractical to perform a separate tuning algorithm for the importance density for each quantity of interest. Fortunately, in practice the candidate resulting from the minimization of the coefficient of variation performs well for estimating common quantities of interest such as posterior moments. Hoogerheide and van Dijk (2008a) propose a different approach for forecasting extreme quantiles where one substantially improves on the usual strategy by generating relatively far too many extreme candidate draws.
Step 2a -Iterate on the number of components Add another Student-t distribution with density t d (θ | µ h , Σ h , ν) to the mixture with µ h = arg max θ∈Θ log w(θ) and Σ h equal to minus the inverse Hessian of log w(θ). Here, q(θ) denotes the density of the mixture of (h − 1) Student-t distributions obtained in the previous iteration of the procedure. An obvious initial value for the maximization procedure for computing µ h is the point θ Comment: There are several reasons for the use of minus the inverse Hessian of log w(θ) as the scale matrix for the new component. First, suppose that k(θ) is a posterior kernel under a flat prior, and that the first candidate distribution would be a uniform distribution (or a Student-t with a huge scale matrix). Then log w(θ) takes its maximum likelihood estimator and minus its inverse Hessian is an asymptotically valid estimate for the maximum likelihood estimator's covariance matrix. Second, since log w(θ) takes its maximum at µ h , its Hessian is negative definite (unless it is located at a boundary, in which case we do not use this scale matrix). Therefore, minus the inverse Hessian is a positive definite matrix that can be used as a covariance or scale matrix. Moreover, we want to add candidate probability mass to those areas of the parameter space where w(θ) is relatively high, i.e., where there is relatively little candidate probability mass. This is the reason for choosing the mode µ h of the new candidate component at the maximum of w(θ). Especially in those directions where w(θ) decreases slowy (i.e., moving away from µ h ) we want to add candidate probability mass also further away from µ h . This is reflected by larger elements of minus the inverse Hessian of log w(θ) at µ h . Note that w(θ) is generally not a kernel of a proper density on Θ. However, we also do not require this. We only make use of its local behaviour around its maximum at µ h , reflected by minus the inverse Hessian of log w(θ). That is, we specify a Student-t distribution that locally behaves the same as the ratio w(θ).
Comment: To improve the algorithm's ability to detect distant modes of a multimodal target density we consider one additional initial value for the optimization and we use the point corresponding to the highest value of the weight function among the two optima as the mode µ h of the new component in the candidate mixture.
Comment: Minimization of (1) is time consuming. The reason is that this concerns the optimization of a non-linear function of η h (h = 1, . . . , H) where H takes the values 2, 3, . . . in the consecutive iterations of the algorithm. Evaluating the function itself requires already N H evaluations of the kernel and N H 2 evaluations of the Student-t densities. The computation of (analytically evaluated) derivatives of the function with respect to η h (h = 1, . . . , H) takes even more time. One way to reduce the amount of computing time required for the construction of the approximation is to use different numbers of draws in different steps. One can use a relatively small sample of N p draws for the optimization of the mixing probabilities and a large sample of N s draws in order to evaluate the quality of the current candidate mixture at each iteration (in the sense of the coefficient of variation of the corresponding importance sampling weights) and in order to obtain an initial value for the algorithm that is used to optimize the weight function (that yields the mode of a new Student-t component in the mixture). Note that it is not necessary to find the globally optimal values of the mixing probabilities; a good approximation to the target density is all that is required.
Step 2c -Draw from the mixture Draw a sample of N s points θ , and go to step 1; in order to draw a point from the density q(θ) first use a draw from the uniform distribution U(0, 1) to determine which component t d (θ | µ h , Σ h , ν) is chosen, and then draw from this d-dimensional Student-t distribution.
Comment: It may occur that one is dissatisfied with diagnostics like the coefficient of variation of the importance sampling weights corresponding to the final candidate density resulting from the procedure above. In that case the user may start all over again the procedure with a larger number of points N s . The idea behind this strategy is that the larger N s , the easier it is for the method to detect and approximate the shape of the target density kernel, and to specify the Student-t distributions of the mixture adequately.
If the region of integration Θ ⊆ R d is bounded, it may occur in step 2 that w(θ) attains its maximum at a boundary of the integration region. In this case minus the inverse Hessian of log w(θ) evaluated at its mode µ h may be a very poor scale matrix; in fact this matrix may not even be positive definite. In such situations, µ h and Σ h are obtained as the estimated mean and covariance based on a subset of draws corresponding to a certain percentage of largest weights. More precisely, µ h and Σ h are obtained using the sample {θ [i] | i = 1, . . . , N s } from q(θ) we already have: where J c denotes the set of indices corresponding to the c percents of the largest weights in the sample {w(θ [i] ) | i = 1, . . . , N s }. Since our aim is to detect regions with too little candidate probability mass (e.g., a distant mode), the percentage c is typically a low value, i.e., 5%, 15% or 30%. Moreover, the estimated Σ h can be scaled by a given factor for robustness. Different percentages and scaling factors could be used together, leading to different coefficients of variation at each step of the adaptive procedure. The matrix leading to the smallest coefficient of variation could then be selected as the scale matrix Σ h for the new mixture component.
Once the adaptive mixture of Student-t distributions has been fitted to the target density p(θ) through the kernel function k(θ), the approximation q(θ) is used in importance sampling or in the independence chain Metropolis-Hastings (M-H) algorithm to obtain quantities of interest for the target density p(θ) itself.

Background on importance sampling
Importance sampling, due to Hammersley and Handscomb (1965), was introduced in econometrics and statistics by Kloek and van Dijk (1978). It is based on the following relationship: where g(θ) is a given (integrable with respect to p) function, w(θ) = k(θ)/q(θ), E p denotes the expectation with respect to the target density p(θ) and E q denotes the expectation with respect to the (importance) approximation q(θ). The importance sampling estimator of E p g(θ) is then obtained as the sample counter-part of the right-hand side of (3): where {θ [i] | 1, . . . , N } is a sample of draws from the importance density q(θ). Under certain conditions (see Geweke 1989), g is a consistent estimator of E p g(θ) . The choice of the function g(θ) allows to obtain different quantities of interest for p(θ). For instance, the mean estimate of p(θ), denoted by θ, is obtained with g(θ) = θ; the covariance matrix estimate is obtained using g(θ) = (θ − θ)(θ − θ) ; the estimated probability that θ belongs to a domain D ⊆ Θ using g(θ) = I {θ∈D} , where I {•} denotes the indicator function which is equal to one if the constraint holds and zero otherwise.

Background on the independence chain Metropolis-Hastings algorithm
The Metropolis-Hastings (M-H) algorithm is a Markov chain Monte Carlo (MCMC) approach that has been introduced by Metropolis et al. (1953) and generalized by Hastings (1970). MCMC methods construct a Markov chain converging to a target distribution p(θ). After a burn-in period, which is required to make the influence of initial values negligible, draws from the Markov chain are considered as (correlated) draws from the target distribution itself.
In the independence chain M-H algorithm, a Markov chain of length N is constructed by the following procedure. First, one chooses a feasible initial state θ [0] . Then, one repeats the following steps N times (for i = 1, . . . , N ). A candidate value θ is drawn from the candidate density q(θ ) and a random variable U is drawn from the uniform distribution U(0, 1). Then the acceptance probability: , θ ), the transition to the candidate value is accepted, i.e., θ [i] = θ . Otherwise the transition is rejected, and the next state is again θ

Illustration I: The Gelman-Meng distribution
This section presents the functions provided by the R package AdMit with an illustration of a bivariate bimodal distribution. This distribution belongs to the class of conditionally Normal distributions proposed by Gelman and Meng (1991) with the property that the joint density is not Normal. In the notation of the previous section, we have θ = (X 1 X 2 ) .
Let X 1 and X 2 be two random variables, for which X 1 is Normally distributed given X 2 and vice versa. Then, the joint distribution, after location and scale transformations in each variable, can be written as (see Gelman and Meng 1991): where A, B, C 1 and C 2 are constants. Equation (5) can be rewritten as: so the term Ax 2 1 x 2 2 causes deviations from the bivariate Normal distribution. In what follows, we consider the symmetric case in which A = 1, B = 0, C 1 = C 2 = 3. The core function provided by the R package AdMit is the function AdMit. The arguments of the function are the following: KERNEL is a kernel function k(θ) of the target density p(θ) on which the approximation is constructed. This function must contain the logical argument log. When log = TRUE, the function KERNEL returns the (natural) logarithm value of the kernel function; this is used for numerical stability. mu0 is the starting value of the first stage optimization µ 1 = arg max θ∈Θ log k(θ); it is a vector whose length corresponds to the length of the first argument in KERNEL. If one experiences misconvergence of the first stage optimization, one could first use an alternative (robust) optimization algorithm and use its output for mu0. For instance, the DEoptim function provided by the R package DEoptim (Ardia 2007) performs the optimization (minimization) of a function using an evolutionary (genetic) approach. Sigma0 is the (symmetric positive definite) scale matrix of the first component. If a matrix is provided by the user, then it is used as the scale matrix of the first component and mu0 is used as the mode of the first component. control is a list of tuning parameters. The most important parameters are: Ns (default: 1e+05), the number of draws used for evaluating the importance sampling weights; Np (default: 1e+03), the number of draws used for optimizing the mixing probabilities; CVtol (default: 0.1), the tolerance of the relative change of the coefficient of variation; df (default: 1), the degrees of freedom of the mixture components; Hmax (default: 10), the maximum number of components of the mixture; IS (default: FALSE), indicates if the scale matrices Σ h should always be estimated by importance sampling as in (2) without first trying to compute minus the inverse Hessian; ISpercent (default: c(0.05, 0.15, 0.30)), a vector of percentages of largest weights used in the importance sampling approach; ISscale (default: c(1, 0.25, 4)), a vector of scaling factors used to rescale the scale matrix obtained by importance sampling. Hence, when the argument IS = TRUE, nine scale matrices are constructed by default and the matrix leading to the smallest coefficient of variation is selected by the adaptive mixture procedure as Σ h . For details on the other control parameters, the reader is referred to the documentation file of AdMit (by typing ?AdMit). Finally, the last argument of AdMit is ... which allows the user to pass additional arguments to the function KERNEL. In econometric models for instance, the kernel may depend on a vector of observations y = (y 1 · · · y T ) which can be passed to the function KERNEL via this argument.
For the numerical optimization of the mode µ h and the estimation of the scale matrix Σ h (i.e., when the control parameter IS = FALSE), the function optim is used with the option BFGS (the function nlminb cannot be used since it does not estimate the Hessian matrix at optimum). If the optimization procedure does not converge, the algorithm automatically switches to the Nelder-Mead approach which is more robust but slower. If still misconvergence occurs or if the Hessian matrix at optimum is not symmetric positive definite, the algorithm automatically switches to the importance sampling approach for this component.
For the numerical optimization of the mixing probabilities η h (h = 1, . . . , H), we rely on the function nlminb (for speed purposes) and apply the optimization on a reparametrized domain. More precisely, we optimize (H − 1) components in R (H−1) instead of H components in [0, 1] H with the summability constraint H h=1 η h . If the optimization process does not converge, then the algorithm uses the function optim with method Nelder-Mead (or method BFGS for univariate optimization) which is more robust but slower. If still misconvergence occurs, the starting value is kept as the output of the procedure. The starting value corresponds to a mixing probability weightNC for η h while the probabilities η 1 , . . . , η H−1 are the probabilities of the previous mixture scaled by . The control parameter weightNC is set to 0.1 by default, i.e., a 10% probability is assigned to the new mixture component as a starting value. Finally, note that AdMit uses C and analytically evaluated derivatives to speed up the numerical optimization.
Let us now use the function AdMit to find a suitable approximation for the density function p(θ) whose kernel is (5). We set the seed of the pseudo-random number generator to a given number and use the starting value mu0 = c(0.0, 0.1) for the first stage optimization. The result of the function is assigned to the object outAdMit and printed out: R> set.seed(1234) R> outAdMit <-AdMit(KERNEL = GelmanMeng, mu0 = c(0.0, 0.1)) R> print(outAdMit)  The output of the function AdMit is a list. The first component is CV, a vector of length H which gives the value of the coefficient of variation at each step of the adaptive fitting procedure. The second component is mit, a list which consists of four components giving information on the fitted mixture of Student-t distributions: p is a vector of length H of mixing probabilities, mu is a H × d matrix whose rows give the modes of the mixture components, Sigma is a H × d 2 matrix whose rows give the scale matrices (in vector form) of the mixture components and df is the degrees of freedom of the Student-t components. The third component of the list returned by AdMit is summary. This is a data frame containing information on the adaptive fitting procedure: For the kernel function GelmanMeng, the approximation constructs a mixture of four components. The computing time required for the construction of the approximation is 4.4 seconds (see Section 6 for computational details). The value of the coefficient of variation decreases from 4.8224 to 0.8315. A plot of the four-component approximation is displayed in the righthand side of Figure 1. This graph is produced using the function dMit which returns the density of the mixture given by the output outAdMit$mit: R> PlotMit <-function(x1, x2, mit) + { + dMit(cbind(x1, x2), mit = mit, log = FALSE) + } R> z <-outer(x1, x2, FUN = PlotMit, mit = outAdMit$mit) R> image(x1, x2, z, las = 1, col = gray((20:0)/20), + cex.axis = 1.1, cex.lab = 1.2, + xlab = expression(X[1]), ylab = expression(X[2])) R> box() R> abline(a = 0, b = 1, lty = "dotted") The plot suggests that the four-component mixture provides a good approximation of the density function whose kernel is (5). We can also use the mixture information outAdMit$mit to display each of the mixture components separately: R> par(mfrow = c(2,2)) R> for ( abline(a = 0, b = 1, lty = "dotted") + title(main = paste("component nr.", h)) + } Plots of the four components are displayed in Figure 2.
Once the adaptive mixture of Student-t distributions is fitted to the density p(θ) using a kernel k(θ), the approximation q(θ) provided by AdMit is used as the importance sampling density in importance sampling or as the candidate density in the independence chain M-H algorithm.
The first function provided by the R package AdMit which allows to find quantities of interest for the density p(θ) using the output outAdMit$mit of AdMit is the function AdMitIS. This function performs importance sampling using the mixture approximation as the importance density (see Section 2.1). The arguments of the function AdMitIS are the following: N is the number of draws used in importance sampling; KERNEL is a kernel function k(θ) of the target density p(θ); G is the function g(θ) in (3); mit is a list providing information on the mixture approximation (i.e., typically the component mit in the output of the AdMit function); ... allows additional parameters to be passed to the function KERNEL and/or G.
Let us apply the function AdMitIS to the kernel GelmanMeng using the approximation outAdMit$mit: R> set.seed(1234) R> outAdMitIS <-AdMitIS(KERNEL = GelmanMeng, mit = outAdMit$mit) R> print(outAdMitIS) The output of the function AdMitIS is a list. The first component is ghat, the importance sampling estimator of E p g(θ) in (4). This is a vector whose length corresponds to the length of the output of the function G. The second component is NSE, a vector containing the numerical standard errors (i.e., the square root of the variance of the estimates that can be expected if the simulations were to be repeated) of the components of ghat. The third component is RNE, a vector containing the relative numerical efficiencies of the components of ghat (i.e., the ratio between an estimate of the variance of an estimator based on direct sampling and the importance sampling estimator's estimated variance with the same number of draws). RNE is an indicator of the efficiency of the chosen importance function; if target and importance densities coincide, RNE equals one, whereas a very poor importance density will have a RNE close to zero. Both NSE and RNE are estimated by the method given in Geweke (1989). For estimating E p [g(θ)] the N candidate draws are approximately as 'valuable' as RNE × N independent draws from the target would be.
The computing time required to perform importance sampling on GelmanMeng using the fourcomponent mixture outAdMit$mit is 0.7 seconds, where most part of the computing time is required for the N evaluations of the function KERNEL at the sampled values {θ [i] | i = 1, . . . , N }. The true values for E p (X 1 ) and E p (X 2 ) are 1.459. We notice that the importance sampling estimates are close to the true values and we note the good efficiency of the estimation.
By default, the function G is function(theta){theta} so that the function outputs a vector containing the mean estimates for the components of θ. Alternative functions may be provided by the user to obtain other quantities of interest for p(θ). The only requirement is that the function outputs a matrix. For instance, to estimate the covariance matrix of θ, we could define the following function: The second function provided by the R package AdMit which allows to find quantities of interest for the target density p(θ) using the output outAdMit$mit of AdMit is the function AdMitMH. This function uses the mixture approximation as the candidate density in the independence chain M-H algorithm (see Section 2.2). The arguments of the function AdMitMH are the following: N is the length of the MCMC sequence of draws; KERNEL is a kernel function k(θ) of the target density p(θ); mit is a list providing information on the mixture approximation (i.e., traditionally the component mit in the output of the function AdMit); ... allows additional parameters to be passed to the function KERNEL.
In our example, the computing time required to generate a MCMC chain of size N = 1e+05 (i.e., the default value) takes 0.8 seconds. Note that as for the function AdMitIS, the most important part of the computing time is required for evaluations of the KERNEL function. Part of the AdMitMH function is implemented in C in order to accelerate the generation of the MCMC output. The rather high acceptance rate above 50% suggests that the mixture approximates the target density quite well.
The R package coda (Plummer et al. 2008) can be used to check the convergence of the MCMC chain and obtain quantities of interest for p(θ). Here, for simplicity, we discard the first 1,000 draws as a burn-in sample and transform the output outAdMitMH$draws in a mcmc object using the function as.mcmc provided by coda. A summary of the MCMC chain can be obtained using summary: R> draws <-as.mcmc (outAdMitMH$draws[1001:1e5,]) R> colnames(draws) <-c("X1", "X2") R> summary(draws)$stat These relative numerical efficiencies reflect the good quality of the candidate density in the independence chain M-H algorithm.
Finally, note that for more flexibility, the functions AdMitIS and AdMitMH require the arguments N and KERNEL. Therefore, the number of sampled values N in importance sampling or in the independence chain M-H algorithm can be different from the number of draws Ns used to fit the Student-t mixture approximation. In addition, the same mixture approximation can be used for different kernel functions. This can be useful, typically in Bayesian times series econometrics, to update a joint posterior distribution with the arrival of new observations. In this case, the previous mixture approximation (i.e., fitted on a kernel function which is based on T observations) can be used as the candidate density to approximate the updated joint posterior density which accounts for the new observations (i.e., whose kernel function is based on T + k observations where k 1).

Illustration II: Mixture of ARCH(1) model
In this section, we consider the Bayesian estimation of a mixture of ARCH model. We use this example model in order to compare candidate distributions in case of a non-elliptical, four-dimensional posterior distribution in a parameter space with a restricted domain. In particular, we compare the performance of importance sampling and the independence chain M-H algorithm using a candidate density constructed by the function AdMit with a naive (standard) Cauchy distribution. We also consider the Griddy-Gibbs sampler of Ritter and Tanner (1992) as a benchmark.
In this application, we set the control parameter IS = TRUE in the function AdMit. The reason is that the (default) optimization step would quite possibly lead to unreliable scale matrices due to the pronounced restrictions on the parameter space (i.e., in the sense that most of the candidate mass might be outside of the allowed parameter region). Also, note that for a high dimensional distribution, avoiding the optimization step can substantially speed up the algorithm. The results for the four-dimensional highly non-elliptical posterior suggest the method's useful applicability in higher dimensions.
Mixture of ARCH and GARCH models have received a lot of attention in recent years as they provide an explanation for the high persistence in volatility observed with single-regime GARCH models (see, e.g., Lamoureux and Lastrapes 1990). Furthermore, these models allow for a sudden change in the (unconditional) volatility level which may lead to significant improvements in volatility forecasts (see, e.g., Dueker 1997;Klaassen 2002;Marcucci 2005).
A two-component mixture of ARCH(1) model for log-returns {y t } may be written as: where ω 1 , ω 2 > 0, α 0 to ensure a positive conditional variance in each regime; N (0, 1) is the standard Normal distribution. Model specification (6) allows to reproduce the so-called volatility clustering observed in financial returns, i.e., the fact that large changes tend to be followed by large changes (of either sign) and small changes tend to be followed by small changes. Moreover, it allows for sudden changes in the unconditional variance of the process; in the first regime, the unconditional variance is ω 1 /(1 − α) while it is ω 2 /(1 − α) in the second regime, provided that α < 1. We emphasize that model (6) is used for illustrative purposes only. The assumption that the state (high/low volatility) is independent over time is unrealistic and the number of regimes should be investigated. However, checking for possible misspecification of model (6) is beyond the scope of the present paper.
In order to write the likelihood function, we define the vector of observations y = (y 1 · · · y T ) and we regroup the model parameters into the vector θ = (ω 1 ω 2 α p) for notational purposes. The likelihood function of θ is then given by: We use the following proper prior densities on the model parameters: where φ(• | µ, σ) denotes the Normal density with mean µ and standard deviation σ and where we recall that I {•} is the indicator function which equals one if the constraint holds and zero otherwise. In addition, we require prior independence for the model parameters except for ω 1 and ω 2 , where we require ω 1 < ω 2 for identification purposes. The prior constraint on α 1 ensures that the model (6) is covariance-stationary in each regime. A kernel function of the joint posterior distribution is then constructed by combining the likelihood function and the joint prior via the Bayes rule. Details regarding the implementation of the kernel function for this model are provided in the Appendix.
It is important to note that we have a lack of conjugacy between the likelihood function and the joint prior density so that the joint posterior is of unknown form. Moreover, the simple Gibbs sampler cannot be used for this model since the full conditionals are also of unknown form. Alternative estimation techniques are thus required. In what follows, we consider the following strategies: AdMit IS importance sampling using an adaptive mixture of Student-t distributions as the importance density. First use the function AdMit with control parameter IS = TRUE (i.e., the mode and the scale matrix of the Student-t components are estimated with the importance sampling weights, as in (2)). Then perform importance sampling using the function AdMitIS with N = 50000 draws.
AdMit M-H independence chain M-H using an adaptive mixture of Student-t distributions as the candidate density. Use the same mixture approximation as for AdMit IS, but instead of using the function AdMitIS, perform independence chain M-H sampling using the function AdMitMH with N = 51000 draws. The first 1,000 draws are discarded as a burn-in sample.
t 1 IS importance sampling using a Student-t distribution with one degree of freedom (i.e., Cauchy) as the importance density. First use the function AdMit with control parameter Hmax = 1. Then perform importance sampling using the function AdMitIS with N = 50000 draws.
t 1 M-H independence chain M-H using a Student-t distribution with one degree of freedom (i.e., Cauchy) as the candidate density. Use the same approximation as for t 1 IS, but instead of using the function AdMitIS, perform independence chain M-H sampling using the function AdMitMH with N = 51000 draws. The first 1,000 draws are discarded as a burn-in sample.
GG Griddy-Gibbs sampler. Update each parameter by inversion from the full conditional distribution computed on a grid of the parameter space. Use the following grids for the model parameters: ω 1 seq(from = 0.001, to = 0.25, by = 0.002) ω 2 seq(from = 0.001, to = 2.0, by = 0.01) α seq(from = 0.0, to = 0.99, by = 0.008) p seq(from = 0.0, to = 1.0, by = 0.008) The kernel function is evaluated for each parameter in turn for the different values on the grid, and then a new draw is generated using the function sample with the corresponding probabilities (i.e., the normalized kernel values on the grid). Therefore, the approach is not strictly speaking the Griddy-Gibbs of Ritter and Tanner (1992) which consists in updating each parameter by inversion from the full conditional distribution computed by a deterministic integration rule since we generate new draws from a discrete distribution. However, an additional interpolation step would have slowed down even more the generation of the model parameters (which is already very slow as shown later in this section). We generate a chain of length 51,000 and discard the first 1,000 draws as a burn-in sample.
More advanced approaches have been proposed to perform an efficient Bayesian estimation of regime-switching GARCH type models. However, their implementation costs are far from negligible. The interested reader is referred to Ardia (2008) for further detail. Finally, we point out that the permutation sampler of Frühwirth-Schnatter (2001) or the permutationaugmented sampler of Geweke (2007) may be used in the context of mixture models. They are partly used to explore the unconstrained joint posterior distribution in order to find suitable identification constraints. This is not necessary here as we required ω 1 < ω 2 .
We apply our Bayesian estimation methods to daily observations of the Deutschmark vs British Pound (DEM/GBP) foreign exchange log-returns. The sample period is from January 3, 1984, to December 31, 1991, for a total of 1'974 observations. The nominal returns are expressed in percent as in Bollerslev and Ghysels (1996). This data set has been proposed as an informal benchmark for GARCH time series software validation (see, e.g., McCullough and Renfro 1998) and is available from the R package fEcofin (Wuertz 2008) using data("dem2gbp").        obtained with the t 1 approach may suggest that the tails of the marginal posterior for ω 2 is not fully covered by the Student-t candidate. The Griddy-Gibbs sampler is extremely slow (i.e., 3 hours) compared to the adaptive approach (i.e., 7.1 minutes) and the naive approach (i.e., 30 seconds). This illustrates that for complex problems the Griddy-Gibbs is hardly usable as a real-time method. AdMit clearly requires more time than the naive approach (i.e., 14 times slower) because of the time required for fitting the adaptive mixture candidate (i.e., 7 minutes). However, its efficiency is much better, where the largest differences between the strategies are observed for parameter ω 2 . In the importance sampling case, RNE is more than 14 times larger for the AdMit approach. Figure 4 illustrates the differences between both methods. AdMit requires 422 seconds for fitting the mixture candidate but after 40 seconds of sampling it already outperforms (in the sense of a higher precision) the naive approach in estimating the posterior mean of ω 2 .
Regarding the M-H strategy for these candidates, we also notice the better efficiency for Adaptive Mixture of Student-t Distributions: The R Package AdMit Figure 6: Autocorrelation function of the model parameters in the t 1 MH approach (i.e., using a t 1 approximation as the candidate density in the independence chain M-H algorithm).
AdMit. The autocorrelation in the MCMC output for the naive approach is much higher than for AdMit, especially for parameter ω 2 , as illustrated in Figure 5 and Figure 6. We note that both acceptance rates are rather high. In practice, for highly non-elliptical posterior distributions in econometric models, independence chain M-H often leads to acceptance rates below 10%. Apparently, the high autocorrelation observed for the t 1 M-H approach is caused by a too small candidate scale matrix; a lot of draws are generated in a small area of the parameter space which are generally accepted. Incidentally, the t 1 M-H sequence gets stuck in a point far away from the posterior mode (i.e., there occurs a long sequence of rejections) which implies slowly decaying autocorrelations.
The improvement of the AdMit approach over the naive approach is even more clear when focusing on the tails of the joint posterior distribution. On the left-hand side of Figure 7, we present the (natural) logarithm of the four-component mixture density. We note the nonelliptical shape for high values of p where some components of the mixture drag some of the candidate probability mass to the right-hand side of the plot. The right-hand side of  Plot of the (natural) logarithm of the t 1 candidate density. Right: 50,000 draws from the marginal distribution of (ω 2 p) obtained with the t 1 M-H strategy (i.e., using a t 1 approximation as the candidate density in the independence chain M-H algorithm).
the figure displays 50,000 draws for (ω 2 p) generated by the independence chain M-H using the four-component mixture as the candidate density. We notice the banana shape of the marginal distribution of (ω 2 p) . For large values of p, the likelihood has a small information content for parameter ω 2 so that the posterior of ω 2 tends to its diffuse prior. In particular, we can notice a non-negligible number of draws in the quadrant [ω 2 > 1; p > 0.8]. Figure 8 Adaptive Mixture of Student-t Distributions: The R Package AdMit Figure 9: 50,000 draws from the marginal distribution of (ω 2 p) using the Griddy-Gibbs sampler.
presents the same type of graphs for the t 1 candidate. The left-hand side clearly shows the elliptical shape of the candidate density. On the right-hand side, only two draws are located in the quadrant [ω 2 > 1; p > 0.8]. In this case, the naive approach is not able to detect well the mass of the joint posterior in this region. Also, far too few draws are generated in the quadrant [0.8 < ω 2 1; p > 0.8] compared to the AdMit approach. The marginal distribution obtained with the Griddy-Gibbs displayed in Figure 9 underlines the importance of the additional components in reproducing the non-elliptical shapes of the joint posterior. The additional time required by AdMit compared to the naive approach is therefore useful and acts as a way to robustify the Bayesian estimation of this model.
In Table 2, we report the estimated probability P(ω 2 > ω * 2 | p > p * , y) for different values of ω * 2 and p * in the upper-right tail of the marginal distribution for (ω 2 p) . The probabilities are estimated using the 50,000 draws generated by the AdMit M-H, t 1 M-H and Griddy-Gibbs strategies. The 95% confidence intervals (CI) of the estimated probabilities are obtained using a robust estimate of the numerical standard error (i.e., using Time-series SE of the summary function provided by the R package coda). From this table, we notice that the t 1 approximation completely underestimates the probability compared to the Griddy-Gibbs approach. Most of the CI given by this approach are the same due to the small amount of draws in the upper-right quadrant of the marginal distribution. These should obviously be smaller for larger ω * 2 . On the other hand, the CI provided by AdMit M-H overlap the CI of the Griddy-Gibbs in every cases. The probability estimates in the extreme tail are therefore not significantly different between the AdMit M-H approach and the Griddy-Gibbs sampler.  Table 2: Estimation of the probability P(ω 2 > ω * 2 | p > p * , y) for different values of ω * 2 and p * . AdMit M-H: independence chain M-H algorithm using a four-component mixture approximation as the candidate density; t 1 : independence chain M-H algorithm using a Student-t distribution with one degree of freedom as the candidate density; GG: Griddy-Gibbs sampler; 95% CI: 95% confidence intervals of the estimated probability P(ω 2 > ω * 2 | p > p * , y) obtained using robust standard errors (i.e., using Time-series SE of the summary function provided by the R package coda). The number of draws in the joint posterior sample is 50,000 for the three estimation strategies.

Concluding remarks
This paper presented the R package AdMit which provides functions to approximate and sample from a certain target distribution given only a kernel of the target density function. The estimation procedure is fully automatic and thus avoids the time-consuming and difficult task of tuning a sampling algorithm. The relevance of the package has been shown in two examples. The first illustrated in detail the use of the functions provided by the package in a bivariate bimodal distribution. The second showed the relevance of the AdMit procedure through the Bayesian estimation of a mixture of ARCH model fitted to foreign exchange logreturns data. The methodology was compared to standard cases of importance sampling and the Metropolis-Hastings algorithm using a naive candidate and with the Griddy-Gibbs approach. Both for investigating means and tails of the joint posterior distribution the adaptive approach is preferable.
In a recent paper, Hoogerheide and van Dijk (2008b) illustrate the usefulness of the AdMit approach both in a bivariate posterior in an instrumental variable model and in a eightdimensional posterior in a mixture model. We believe that this approach may be applicable in many fields of research and hope that the R package AdMit will be fruitful for many researchers like econometricians or applied statisticians.
Finally, if you use R or AdMit, please cite the software in publications.

Computational details
The results in this paper were obtained using R 2.8.1 (R Development Core Team 2008) with the packages AdMit 1.01-01 , coda 0.13-3 (Plummer et al. 2008), fEcofin 270.75 (Wuertz 2008) and mvtnorm 0.9-3 (Genz et al. 2008). R itself and all packages used are available from CRAN at http://CRAN.R-project.org/. Computations were performed on Adaptive Mixture of Student-t Distributions: The R Package AdMit a Genuine Intel® dual core CPU T2400 1.83Ghz processor. Code outputs were obtained using options(digits = 4, max.print = 40, prompt = "R> "). Since the functions AdMit, AdMitIS and AdMitMH highly rely on evaluations of the function KERNEL, we strongly advise the users to implement this function in a vectorized fashion. Also, implementation in lower-level languages like C or Fortran is possible using the functions .C and .Fortran.

A. Mixture of ARCH(1) model
The implementation of the kernel function for the mixture of ARCH(1) model is realized in two steps. First, the prior (8) is implemented with the function PRIOR. The function PRIOR tests whether the constraints are fulfilled, and outputs a (N s × 2) matrix whose first column indicates if the constraint is satisfied, and the second column returns the value of the prior at the corresponding point: The function PRIOR is coded outside the kernel function to render the program more readable and more flexible (i.e., it is more easy to modify the constraints or the hyperparameters).
The KERNEL function is obtained by combining the prior (8) and the likelihood function (7). We provide here a full implementation in R; an alternative coding, calling C code, is provided in the code of this article (available as v29i03.R along with this paper). The function KERNEL requires as inputs: theta is a (N s × d) matrix of draws, where in our case d = 4; y is a vector of observations. The function returns a vector of N s (natural logarithm) kernel values.