bspmma : An R Package for Bayesian Semiparametric Models for Meta-Analysis

We introduce an R package, bspmma , which implements a Dirichlet-based random eﬀects model speciﬁc to meta-analysis. In meta-analysis, when combining eﬀect estimates from several heterogeneous studies, it is common to use a random-eﬀects model. The usual frequentist or Bayesian models specify a normal distribution for the true eﬀects. However, in many situations, the eﬀect distribution is not normal, e.g., it can have thick tails, be skewed, or be multi-modal. A Bayesian nonparametric model based on mixtures of Dirichlet process priors has been proposed in the literature, for the purpose of accom-modating the non-normality. We review this model and then describe a competitor, a semiparametric version which has the feature that it allows for a well-deﬁned centrality parameter convenient for determining whether the overall eﬀect is signiﬁcant. This second Bayesian model is based on a diﬀerent version of the Dirichlet process prior, and we call it the “conditional Dirichlet model.” The package contains functions to carry out analyses based on either the ordinary or the conditional Dirichlet model, functions for calculat-ing certain Bayes factors that provide a check on the appropriateness of the conditional Dirichlet model, and functions that enable an empirical Bayes selection of the precision parameter of the Dirichlet process. We illustrate the use of the package on two examples, and give an interpretation of the results in these two diﬀerent scenarios.

1. Introduction: Bayesian random effects meta-analysis 1.1. Parametric models In many medical studies, each of m hospitals or centers investigates the same medical issue, which we will think of as being a comparison between a new and an old treatment. Sometimes the results from the m studies are inconsistent: Some studies are favorable to the new treatment, while others are neutral or negative. The goals of a meta-analysis are then to arrive at an overall conclusion regarding the benefits of the new treatment, and also to describe and explain the heterogeneity among the different studies.
Suppose that for each i, center i gathers a summary statistic D i together with a standard error estimateσ i . When heterogeneity among the studies is significant, it is now common to carry out meta-analyses that are based on random effects models, in which for each center i there is a center-specific "true effect," represented by a parameter ψ i , and D i has distribution P i (ψ i ). This distribution depends on ψ i and also on other quantities, such as the sample size as well as nuisance parameters specific to the i-th center. In the most common example in epidemiological studies, ψ i is the log of the odds ratio arising in case-control studies, and D i is either an adjusted log odds ratio based on a logistic regression model that involves relevant covariates, or simply the usual log odds ratio based on a 2 × 2 table. The traditional model for dealing with this kind of situation is the following: In (1b), µ and τ are unknown parameters. (The σ i 's are also unknown but, typically, estimation of these is of secondary interest, and sample sizes are sufficiently large so that using thê σ i 's instead of the σ i 's does not cause any problems.) In a frequentist analysis µ and τ are estimated by maximum likelihood (DerSimonian and Laird 1986) and in a Bayesian analysis µ and τ 2 are given a prior distribution, typically a normal/inverse gamma prior, because this conjugate form results in simplifications when estimating the posterior distribution.

Bayesian nonparametric and semiparametric models
The approximation of P i (ψ i ) by a normal distribution in (1a) is typically supported by some theoretical result, for example the asymptotic normality of the observed log odds ratio or, more generally, the asymptotic normality of maximum likelihood estimates. By contrast, the normality statement in (1b) is a modelling assumption, which generally is made for the sake of convenience and does not have any theoretical justification. A number of authors have encountered meta-analyses in which the distribution of the study effects appears to be non-normal-for example because some studies appear to be outliers-and have suggested that in (1b) the normal distribution be replaced by a distribution that accommodates outliers. Sharples (1990) discusses normal contamination models in a classical Bayesian one-way random effects model. Weiss, Cho, and Yanuzzi (1999) develop a Markov chain Monte Carlo (MCMC) approach for fitting models in which either the likelihood or the prior is a mixture of two normals. Dozens of authors have considered t distributions. While t distributions accommodate outliers better than do normals, the distribution of the random effects may deviate from normality in ways that do not involve heaviness of tails. In fact, in one of the examples in the present paper, the distribution of the study effects appears to be multimodal.
This leads us to consider a model of the form conditional on ψ i , D i ∼ N (ψ i , σ 2 i ), independently, i = 1, . . . , m, conditional on F, ψ i iid ∼ F, i = 1, . . . , m, where π is a nonparametric prior on the family of all distribution functions. A natural approach is to take π to be a Dirichlet process (see Ferguson 1973Ferguson , 1974, for a formal treatment, or the beginning of Section 2 of the present paper for an informal review) and indeed this is done by Mallick and Walker (1997). They take π to be D α , the Dirichlet process with parameter measure α = M · H, where H is the centering distribution and M is the precision parameter, and they give a method for estimating posterior distributions based on this prior.
A key drawback of this approach is that the user must specify the centering distribution H exactly. A more flexible approach involves using a model based on mixtures of Dirichlet processes as introduced by Antoniak (1974). In the context of the meta-analysis model, Model (2), one can take π to be a mixture of Dirichlet process priors with parameter consisting of the triple ({H ϑ } ϑ∈Ω , M, λ), where {H ϑ } ϑ∈Ω is a parametric family of distributions, M is a precision parameter, and λ is a prior on Ω. An important particular case, which we discuss further below, is to take {H ϑ } ϑ∈Ω to be the two-parameter family of normal distributions N (µ, τ 2 ), and take λ to be the normal/inverse gamma prior on ϑ = (µ, τ ). Thus, (2c) becomes Consider now Model (1). For this model, the parameter µ has a clear interpretation as the mean or median of the N (µ, τ 2 ) distribution (and if we replace the normal with a t then µ also has a clear interpretation as the median of the distribution). In contrast, in Model (3) the parameter µ is not equal to x dF (x), the mean of F . (If F is chosen from a Dirichlet process with centering distribution H, then even if H is symmetric about µ, the probability that F is symmetric about µ is 0.) In fact, µ does not play the role of any location parameter for F . This is a problem in certain situations where there is no overwhelming evidence that the treatment is better than the control, and one is interested primarily not in estimation of the center-specific ψ i 's, but rather in resolving the basic question of whether the overall mean µ is different from 0, as this may for example justify carrying out further studies. For this reason, Burr and Doss (2005) propose using a "mixture of conditional Dirichlet processes" as the prior π in (2c). Loosely speaking, if F ∼ D M N (µ,τ 2 ) , then the distribution of F conditional on the event that the median of F is µ is called a conditional Dirichlet, and we will denote it by D µ M N (µ,τ 2 ) . By construction, if F ∼ D µ M N (µ,τ 2 ) , then with probability one µ is the median of F , so we have a parameter µ with a clear interpretation, and we can proceed to carry out inference about this parameter.
To summarize, Burr and Doss (2005) consider the hierarchical model conditional on µ, τ, In (4d) and (4e), d 1 , d 2 , d 4 > 0, d 3 ∈ R, and the resulting prior on (µ, τ ) is the normal/inverse gamma prior, which we will denote λ d . For the purpose of giving a dispersed prior on µ and τ , they take d 1 = 0.1, d 2 = 0.1, d 3 = 0, and d 4 = 1000. (For this choice, the marginal distribution of µ is a t distribution with 0.2 degrees of freedom, median 0, and scale parameter 1000 1/2 , which is a fairly diffuse prior.) Burr and Doss (2005) develop a Markov chain Monte Carlo algorithm for estimating the posterior distribution of the vector (ψ 1 , . . . , ψ m , µ, τ ) in this model. The purpose of this paper is to present and describe the R package bspmma, which implements the algorithms developed in Burr and Doss (2005), and to illustrate the use of these algorithms. The package bspmma is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=bspmma.
The rest of the paper is organized as follows. In Section 2, we give a review of Dirichlet and conditional Dirichlet processes and their mixtures, and give motivation for their use. Then we describe a method for determining whether the model based on mixtures of conditional Dirichlet processes fits the data as well or better than does a model based on ordinary mixtures of Dirichlet processes. A by-product of our model validation method is a procedure for carrying out an empirical Bayes selection of the Dirichlet precision parameter M . In Section 3, we show how to use the package on two examples.
Several other packages are available for conducting meta-analyses in the R language (R Development Core Team 2012). We mention rmeta (Lumley 2009), meta (Schwarzer 2010), and metafor (Viechtbauer 2010), which enable frequentist analyses, and assume that the effect distribution is normal; also, the metafor package allows meta-regression. These packages have extensive functions for plotting and producing summary statistics.
The DPpackage (Jara, Hanson, Quintana, Müller, and Rosner 2011;Jara 2007) is a very broad package for implementing Bayesian semiparametric models. It fits certain hierarchical models of the sort (2) where π is a mixture of ordinary Dirichlet processes, and also handles models from a variety of areas such as density estimation, survival analysis, and generalized linear mixed models. Branscum and Hanson (2008) also consider Bayesian semiparametric models for meta-analysis, and their prior on the effect distribution is a (finite) Polya tree, rather than a Dirichlet-based process. More specifically they consider a model similar to (4), except that in line (4c), the conditional Dirichlet process is replaced by a "median-constrained" Polya tree. The DPpackage function PTmeta implements the model in Branscum and Hanson (2008). In Section 4 we further discuss the Polya tree prior used in Branscum and Hanson (2008) and discuss the relationship between the model they use and the model used here. The bspmma package is not a general-purpose package for fitting Bayesian nonparametric or semiparametric models. It deals only with random-effects meta-analysis, and focuses specifically on inference for particular parameters of interest in meta-analysis, and on hyperparameter selection and model assessment.

Background
Let P be the set of all probability distributions on the real line. A parametric family H ϑ , ϑ ∈ Ω ⊂ R p may be viewed as a finite-dimensional subset of the infinite-dimensional space P, and a prior on Ω induces a prior on P which gives all its mass to the finite-dimensional family H ϑ , ϑ ∈ Ω. When there is no reason to think that any particular parametric model is exactly true, a desirable feature of a prior on P is that it be "nonparametric," i.e., does not give all its mass to any finite-dimensional subset of P. (More formally, having specified a topology on P, we would like the prior to have the "full support property," i.e., it gives positive probability to any open set.) The most commonly used nonparametric priors are mixtures of Dirichlet processes. Standard definitions are Ferguson (1973Ferguson ( , 1974, and Antoniak (1974), but here we give a brief intuitive review of those properties of this class of priors that are directly relevant to the present situation. Before describing mixtures of Dirichlet processes, we first discuss (single) Dirichlet processes. A Dirichlet process is a distribution on the space P, and is parameterized by a pair (H, M ), where H is a distribution function and M is a positive number. The product α = M H is a finite measure, and the Dirichlet process is denoted by D α (the measure α determines and is determined by the pair (H, M )). For the purpose of understanding the modelling assumptions when we use a Dirichlet process in the present context, the best way to explain this prior is through the construction of Sethuraman (1994). Generate B 1 , B 2 , . . . (1−B r ), and form the random distribution where δ a denotes the probability measure giving unit mass to the point a. Sethuraman (1994) showed that F defined by (5) is distributed according to the Dirichlet process with parameter measure α, as defined in Ferguson (1973).
Key properties of the Dirichlet process are: (i) The "center" is H in the sense that for every t, we have E(F (t)) = H(t); (ii) M is a precision parameter which determines the concentration of the prior around H: for large M , the distribution of F (t) is tightly concentrated around H(t) for every t, while for small values of M , the distribution is more diffuse; and (iii) the Dirichlet process is a nonparametric prior in the following sense: if the support of H is the entire real line, then the Dirichlet process has the full support property referred to earlier.
Note that if we first choose F from D α and then generate ψ 1 , . . . , ψ n iid ∼ F , then since F is discrete, with positive probability there will be ties among the ψ i 's; that is, the ψ i 's will form clumps. When M is small, the first few P j 's add up to nearly 1, and therefore the probability of ties is higher. This leads to important consequences regarding the posterior distribution of ψ 1 , . . . , ψ n given the data D 1 , . . . , D n . Consider the distribution of ψ i . As is standard in hierarchical models, conditioning on the data and ψ j , j = i results in shrinkage towards D i and toward a grand mean. But because of the propensity for clumping, the posterior is also shrunk towards those ψ j 's that are close to D i (the extent of the shrinking is determined in part by the standard error σ i ). This last results in a way of pooling information that involves weighing results of similar studies more heavily.
As mentioned in Section 1, a mixture of Dirichlet processes involves a parametric family H ϑ , ϑ ∈ Ω ⊂ R p , a precision parameter M > 0, and a distribution λ on Ω. Formally, it is the integral D M H ϑ λ(dϑ). We think of it as arising in a two-stage process, where in the first stage we pick the parameter ϑ according to λ, and in the second stage we pick F from the Dirichlet distribution with parameter M H ϑ . A mixture of conditional Dirichlet processes is defined in the obvious way, i.e., the Dirichlet process in the integrand is replaced by a conditional Dirichlet process. In the situation considered in this paper, {H ϑ } ϑ∈Ω is the two-parameter family of normal distributions. The role of the precision parameter may be described intuitively as follows. The parametric family {H ϑ } ϑ∈Ω is a (p-dimensional) "line" in the infinite-dimensional space of probability measures on the real line, and we imagine a "tube" around this line. Large values of M correspond to a narrow tube (which we interpret as an expression of confidence that the normal model holds), and small values of M correspond to a wider tube. (Sethuraman and Tiwari 1982 give a careful description of the proper interpretation of the Dirichlet process when M is very small.) It should be mentioned that when M is large, for any µ and τ the distributions D M N (µ,τ 2 ) and D µ M N (µ,τ 2 ) are close, since they are both nearly equal to the point mass at the N (µ, τ 2 ) distribution and, consequently, for large M the λ-mixture of Dirichlet processes and λ-mixture of conditional Dirichlet processes are close. A measure-theoretic discussion of this point is given in Doss (1985).

Model assessment and hyperparameter selection
In Section 1 we discussed conditional Dirichlet processes as a tool that enables us to make inference about a location parameter of the distribution of the latent effects. While there are inferential reasons for using a model based on these priors, it is nevertheless natural to ask if such a model provides a good fit for the data. More precisely, it is natural to ask if a model based on mixtures of conditional Dirichlet processes provides a better fit than does a model based on ordinary mixtures of Dirichlet processes.
In Model (4) the unknown parameter is effectively θ = (ψ, µ, τ ), where ψ = (ψ 1 , . . . , ψ m ) is the vector of latent variables (F plays a role only in the sense that it induces a distribution on the latent variables). Different prior distributions on θ induce different models for the data vector D. Generally speaking, when deciding between two models M 1 and M 2 for the data D, it is standard to consider the marginal likelihoods m M 1 (D) and m M 2 (D) which, in the framework of the present paper, are defined by In (6), D (θ) is the likelihood function, and ν i is the prior distribution of θ under model M i . (The marginal likelihood is just the likelihood of the data with the unknown parameter integrated out.) If these marginal likelihoods can be computed, it is common practice to select the model for which the marginal likelihood is greater. In our situation, we fix some value of the precision parameter M , some value d = (d 1 , d 2 , d 3 , d 4 ) of the hyperparameter vector of the normal/inverse gamma prior on (µ, τ ), and M 1 will be the mixture of conditional Dirichlet processes and M 2 the mixture of ordinary Dirichlet processes, each based on the hyperparameter specification h = (M, d). It will be more convenient to denote these models by We will estimate m c h /m o h by estimating m c h /m o h 1 and m o h /m o h 1 , and taking the ratio. Let ν c h,D and ν o h,D denote the posterior distributions of θ when the priors are ν c h and ν o h , respectively. We note that the marginal likelihood is simply the normalizing constant in the statement "the posterior is proportional to the likelihood times the prior." Proceeding as in Doss (2012), we write and using the fact that we obtain the identity Note: In (8) and (9), we have used the formalism of Radon-Nikodym derivatives instead of writing the ratio of densities, and we now digress briefly to explain this. When we have the parametric model corresponding to the case M = ∞, i.e., model (4) except that lines (4b) and (4c) are replaced by the single line ψ i iid ∼ N (µ, τ 2 ), the distribution of θ has a density on R m+2 (with respect to Lebesgue measure), and (8) is just the trivial formula which involves only densities. When M < ∞, the distributions ν c h , ν c h,D , ν o h 1 , and ν o h 1 ,D do not have densities with respect to Lebesgue measure on R m+2 , since there is positive probability that the ψ j 's have ties, and in (8) with respect to ν o h 1 ,D ." Loosely speaking, for θ ∈ R m+2 , the Radon-Nikodym derivative is given by where B θ is the ball in R m+2 centered at θ and with radius , and this is the measure-theoretic equivalent of the "ratio of densities." A more detailed discussion is given in the Appendix of Burr and Doss (2005).
−→ " means "converges almost surely." An explicit formula for [dν c h /dν o h ] was obtained in Burr and Doss (2005) and was extended in Doss (2012)  In a similar way, we have the identity , and for the same Markov chain θ 1 , θ 2 , . . . with stationary distribution ν o h 1 ,D , we have An explicit formula for [dν o h /dν o h 1 ] is given in Theorem 1 of Doss (2012). Combining (7), (10), and (11), we obtain that In the present situation, we will keep the prior λ on (µ, τ ) fixed and vary M ; specifically, we will fix d at (0.1, 0.1, 0, 1000), and we will be interested in estimating the ratio which, by a slight abuse of notation, we will denote by m c M /m o M . In the illustration in Section 3.3, we give a plot of the estimate of this ratio. To summarize, the discussion above provides a method for estimating this ratio simultaneously for all M , which gives a tool for comparing the conditional Dirichlet model with the ordinary Dirichlet model.
It is also natural to ask how does one select a value for M . We consider first the unconditional Dirichlet model. As before, d is fixed at (0.1, 0.1, 0, 1000) and so is not in the picture, and by slight abuse of notation we will write  (10). Section 3 provides an illustration of obtaining the empirical Bayes choice of the precision parameter M . To summarize, the discussion above provides a method for carrying out an empirical Bayes selection of the precision parameter M , whether we are using the unconditional or conditional Dirichlet model. In Section 3 we explain why estimates based on a single Markov chain can be unstable when the range of M is large, and discuss improvements which involve using multiple chains.
satisfies a bivariate central limit theorem. Therefore, the averages on the left side of (10) and (11) satisfy a central limit theorem, and by the delta method applied to the function g(u, v) = u/v, we also have a central limit theorem for the ratio on the left side of (12). Consequently we are able to form standard errors for all the estimates discussed above.
Markov chain Monte Carlo Details regarding an MCMC algorithm for estimating the posterior under Model (4) (mixture of conditional Dirichlets) are given in Burr and Doss (2005). It is important to note that the chain is a (m + 1)-cycle Gibbs sampler which cycles through the vector of ψ i 's and the pair (µ, τ ), and that the main part of the computational burden is in the first part of the cycle, the generation of the vector of ψ i 's.

Usage
In Output from the main bspmma functions can be analyzed further using routines in the R packages boa (Smith 2007) and coda (Plummer, Best, Cowles, and Vines 2006), which have functions for diagnosis of convergence, and for providing graphical and statistical summaries, of MCMC output. The two packages implement a similar set of published methods and can be applied to output from the bspmma functions dirichlet.c and dirichlet.o. The boa package has a menu-driven interface, and due to our focus on use of source code, we chose to illustrate application of the coda package here.

Example 1: Effect of NSAIDs on breast cancer
The hypothesis that use of non-steroidal anti-inflammatory drugs (NSAIDs) reduces the risk of breast cancer is currently of considerable interest and controversy in the medical literature (Harris et al. 2003;Terry et al. 2004;Zhang, Coogan, Palmer, Strom, and Rosenberg 2005). Because NSAIDs must be taken regularly for many years to have this beneficial effect, it is not possible to carry out randomized, controlled experiments in healthy populations. There have been many studies on the effect of NSAIDs on breast cancer (and other cancers) during the last 15 years; the studies that have been done are either at the epidemiological or at the cellular and molecular level, and several have strongly suggested that long-term use of NSAIDs significantly decreases the risk of breast cancer. However, this result is not seen in all studies; some suggest only a slight risk reduction and others in fact suggest no risk reduction at all. Harris, Beebe-Donk,  give a review of this work and discuss the epidemiological studies that have appeared in the medical literature. Each study reports a risk ratio for NSAIDs use vs. no NSAIDs use. This risk ratio is either simply an odds ratio obtained from a case-control study or an odds ratio based on a multiple logistic regression analysis that takes into account important risk factors for breast cancer.
It is not surprising that the studies give inconsistent results, since there is heterogeneity in the subject pools (characteristics such as age, ethnicity, and health status vary across the studies), and in the way the data were obtained (covariates collected, statistical method used, etc.). It is certainly of interest to carry out a meta-analysis of these studies, and because of the heterogeneity, it seems clear that the meta-analysis should be based on a random effects model. There have been some meta-analyses in the epidemiological literature. Khuder and Mutgi (2001) find fifteen studies and point out that the studies have heterogeneous effect estimates; they then form several sub-groups of the data, and carry out fixed-effects analyses of homogeneous sub-groups and random-effects analyses of heterogeneous groups. Gonzalez-Perez, Rodriguez, and Lopez-Ridaura (2003) use the classical random-effects meta-analysis for each of six cancers including breast cancer. For ten different cancers including breast cancer, Harris et al. (2005) do fixed-effects meta-regression with one study-level covariatedose of NSAID-using just the subset of studies for which dose information is available. All the random-effects meta-analyses we have seen assume normality of the distribution of effects, but without justification.
Columns 2-3 of Table 1 give the data from Harris et al. (2005) for each of the 17 studies that pertain to the particular NSAID aspirin. These columns give the reported risk ratio and a confidence interval for the risk ratio. Although these authors consider dose as well, and therefore consider only the 14 out of the 17 studies which contain dose information, we ignore dose in this analysis. It would be of interest to also carry out another analysis that includes dose.
For each study j, let L j and σ j denote the observed log risk ratio and its standard error, respectively, for that study. Let ψ j denote the true log risk ratio, i.e., ψ j is the log risk ratio that would be obtained if the sample size for study j were infinite. Standard asymptotic theory justifies writing L j ∼ N (ψ j , σ 2 j ). Therefore, if we let F represent the distribution of unknown shape of the ψ j 's, we are led to precisely Model (2), with D j = L j having standard deviation σ j . Column 4 of Table 1 gives L j (the observed log risk ratio) and column 5 gives σ j (the standard error of L j ). Harris et al. (2005) do not give the σ j 's, but these can be obtained from the confidence intervals for the risk ratios, which are given in column 3.
The package includes the dataframe breast.17, which gives the log risk ratios and their standard errors (columns 4 and 5 from Table 1). To prepare to run the main functions in the package, we first set up the data as a matrix with two columns, where the first column contains the log risk ratios and the second column contains the standard errors, as follows: R> library("bspmma") R> data("breast.17") R> breast.data <-as.matrix(breast.17) The next two commands show how to use the conditional Dirichlet MCMC function, first setting the precision parameter value to be M = 5 and then M = 1000. The seed is set (to 1) so that the example can be reproduced exactly. The algorithm to run MCMC for the conditional Dirichlet model, Model (4) completes 4000 cycles in about 14 seconds (as timed by R function system.time), on an Intel 2.8 GHz Q9550 running Linux.  Table 1: Summary data from 17 studies on aspirin and breast cancer: RR is the risk ratio for aspirin vs. no aspirin; CI is the associated confidence interval; L j is the observed log risk ratio; and σ j is the estimated standard error of L j .
R> breast.c1 <-dirichlet.c(breast.data, ncycles = 4000, M = 5) R> set.seed(1) R> breast.c2 <-dirichlet.c(breast.data, ncycles = 4000, M = 1000) Another argument to the function is the hyperparameter vector d of the normal/inverse gamma prior on (µ, τ ), which in the above runs is taken by default to be d = (0.1, 0.1, 0, 1000); there is also an argument start, which allows the user to supply starting values for the parameters. By default, the starting value for ψ i is the individual study estimate D i , for i = 1, . . . , m; and the starting values for µ and τ are the mean and standard deviation of the D i 's, respectively.
The output of dirichlet.c is a list with several components; the main component, called chain, is a matrix such that each row contains the output of a single iteration of the Monte Carlo simulation. In the above runs the matrix is of dimension 4001 × 19. The first row contains initial parameter values; each of the remaining 4000 rows contains a set of parameter values for one iteration of the chain. The columns contain the simulated values of the parameters of Model (4). Columns 1-17 contain the individual ψ i 's, that is, the log risk ratios for studies 1 through 17. Column 18 contains µ, the median of the distribution of the ψ i 's (and the mean of the centering normal distribution of the conditional Dirichlet model, Model (4)), and column 19 contains τ (the standard deviation of the centering normal distribution of the conditional Dirichlet model).
For each of the runs above taken separately, the MCMC output can be checked for convergence by any of the diagnostic functions in coda. This requires using the chain component of the bspmma output as an argument to the desired coda function, after first converting it to an object of class mcmc in coda. For the first chain above, with M = 5, we use coda to get the autocorrelation plot for the last three ψ i 's and for µ and τ . The reason for examining the autocorrelation plots for these particular ψ i 's is that they include the two most unusual studies. The study by "Langman" (in column 15) had the only positive estimate of the log odds ratio, and the study by "Moorman" (in column 17) had the estimate farthest from the mean, the most extreme negative estimate.
R> library("coda") R> breast.coda <-mcmc(breast.c1$chain) R> autocorr.plot (breast.coda[, 15:19]) In the resulting plots (not shown), the autocorrelations are 0 (for practical purposes) for all lags greater than three; thus this particular diagnostic does not provide evidence of a problem with the chain. There are many other functions in coda, for posterior inference as well as for convergence diagnosis, which can be applied in a similar manner.
In addition, there are two user-accessible functions in bspmma which are useful for exploration of models corresponding to different values of the Dirichlet precision parameter M . The function draw.post may be used to produce superimposed graphs for several M values, of the distributions of the hyperparameters µ and τ and, optionally, of the individual ψ i 's. The function uses the R function density to compute the kernel density estimate of the posteriors and the function matplot to produce the superimposed plots. The function has one required input argument, which is a list object, each element of which is the chain component of the output from dirichlet.c or dirichlet.o. These chain components are matrices of MCMC output and are assumed to correspond to different values of the Dirichlet precision parameter M . The names of the list elements will go into legend labels. An optional argument, burnin, with default value 1000, specifies how many of the initial runs to omit. Below is sample code which produces plots of the distributions of µ and τ , but not the ψ i 's, for the breast cancer data. The graphs produced by this code are shown in Figure 1.
R> breast.c1c2 <-list("5" = breast.c1$chain, "1000" = breast.c2$chain) R> draw.post(breast.c1c2, burnin = 100) The value of M = 1000 corresponds closely to a parametric model, whereas the value M = 5 is a typical value that would be used in practice. From the shapes of the two distributions in both of the above graphs, for µ and τ , we see that as expected, there is greater certainty or precision expressed in the parametric analysis. In addition, the centers of the posterior distributions arising from the parametric and semiparametric models, are different. To look at a brief comparison of the quantitative conclusions for different M values, we can get descriptive statistics of the posterior distributions using the function describe.post as follows: R> describe.post(breast.c1c2, burnin = 100) The output is the posterior means, and the probabilities that the risk ratios are less than 1:   For the overall conclusion about the parameter µ, both the parametric and the semiparametric model show very high significance of the result so that with either model one concludes that long-term use of NSAIDs appears to be associated with reduction of the risk of breast cancer, at least at the study level.
3.2. Example 2: Decontamination of the digestive tract Burr and Doss (2005) summarize the background for this dataset, which appeared in Selective Decontamination of the Digestive Tract Trialists' Collaborative Group (1993). The dataset in the package, ddtm.s, consists of fourteen rows corresponding to fourteen different studies of the effect of a combined regimen of topical and systemic antibiotics on mortality from infection in intensive care units. The dataset has four columns, giving counts of deaths and total sample sizes, first for the treatment group and then for the control group. Each row of counts must be converted to an odds ratio and its standard error. This data is accessed (the first four rows are shown), and then converted to the appropriate form for doing the meta-analysis, as follows: R> library("bspmma") R> data("ddtm. R> ddtm.s$treat.deaths <-ddtm.s$treat.deaths + 0.5 R> ddtm.s$treat.total <-ddtm.s$treat.total + 1 R> ddtm.s$cont.deaths <-ddtm.s$cont.deaths + 0.5 R> ddtm.s$cont.total <-ddtm.s$cont.total + 1 R> attach(ddtm.s) R> or <-(treat.deaths / (treat.total -treat.deaths)) / + (cont.deaths / (cont.total -cont.deaths)) R> lor <-log(or) R> se.lor <-((treat.total / (treat.deaths * (treat.total -treat.deaths))) + + (cont.total / (cont.deaths * (cont.total -cont.deaths))))^0.5 R> ddtm.14 <-data.frame(psi.hat = lor, se.psi.hat = se.lor) Next we run the Markov chain for the conditional Dirichlet model, Model (4), using several values of the precision parameter M . Below we show the code for values of M = 5, M = 20, and M = 100, plot the posterior distributions of µ and τ (see Figure 2), and compute the summary statistics. The algorithm to run MCMC for the conditional Dirichlet model, Model (4), completes 4000 cycles in about 11 seconds (as timed by the R function system.time), on an Intel 2.8 GHz Q9550 running Linux.

Bayes factors for breast cancer data
Package functions bf1 and bf2 estimate the Bayes factors for conditional vs. ordinary Dirichlet models, for a series of M values, by applying (12). This requires formulas for the Radon-Nikodym derivatives on the left-hand side of (12), and it requires generation of MCMC samples from the posterior distribution of θ under the ordinary Dirichlet model with specified hyperparameter h 1 . Regarding the hyperparameter value h 1 = (M 1 , d 1 ) under which the Markov chain will be run to estimate the Bayes factors for a range of M 's, in principle, any value of h 1 can be used, but in practice one would like to use a value of M 1 that is "close" to all values of M for which the Bayes factor m c h /m o h will be estimated, keeping in mind that in the practical sense, M = 1 is farther away from M 1 = 4 than is M = 10. In fact, for better accuracy of the estimates, it is preferable to run multiple Markov chains corresponding to several values of h 1 . Buta and Doss (2011) motivate and develop the use of multiple chains in a general context, and Doss (2012) gives a discussion focused on the present situation involving Dirichlet processes; in particular, he gives guidelines regarding the selection of the multiple values of h 1 . We do not give a theoretical discussion of these issues in the present paper, but mention only that doing importance sampling with respect to multiple chains results in a very significant increase in the accuracy of the estimates. We follow the recommendations given in Doss (2012) and use values of the Dirichlet precision parameter M starting from 2 −2 = 0.25 and increasing to 2 6 = 64 in multiples of 2, for a total of nine values. These values of M 1 should yield accurate estimates unless the user wishes to estimate the Bayes factor for values of M that are very small (M less than 0.25), which is generally considered dubious (Sethuraman and Tiwari 1982). Very large values of M are "covered" since for most data sets, the posteriors corresponding to M = ∞ and M = 64 are close. The nine simulations are carried out by the user-accessible function bf1. The only required argument is the data, in the form of a two-column matrix, as for the function dirichlet.c, illustrated in Section 3.1 and Section 3.2.
There are two steps controlled by the user in the package implementation of the multi-chain algorithm. First, we need the constants for the denominator on the left side of Equation 2.5 in Doss (2012). This requires output from the nine Markov chains produced by bf1, and then, a call to the function bf2 to use the MCMC output to compute the constants. It is not necessary to understand the role of these constants nor the method needed to compute them in order to use bf1 and bf2, and the user can view these functions as a "black box." The necessary commands for this preliminary step are illustrated below for the breast cancer data. In our analysis we fix the hyperparameter vector of the normal/inverse gamma prior on (µ, τ ) to be (0.1, 0.1, 0, 1000), which is the default value in the function bf1. The total number of cycles for each of the nine chains is indicated by the argument ncycles, and the number of simulations dropped for each chain is given by the argument burnin; in this example, the object returned by bf1 is a list of nine matrices, each having 5000 − 1000 = 4000 rows of MCMC output. Calls to bf1 with ncycles set at 5000 take about 1.5 minutes to run on an Intel 2.8 GHz Q9550 running Linux; calls to bf2 take about 7 seconds in examples this size. R> breast.data <-as.matrix(breast.17) R> chain1.list <-bf1(breast.data, ncycles = 5000, burnin = 1000) R> cc <-bf2(chain1.list) The next step is another set of nine MCMC simulations independent of the previous set (we use a different seed value). These will be used for the final computation of the sequence of Bayes factors.
R> chain2.list <-bf1(breast.data, seed = 2, ncycles = 5000, burnin = 1000) In general, this method of estimating Bayes factors produces accurate estimates for M ranging from a little less than 0.25 to infinity; however, for this breast cancer dataset, after experimentation, we determined that the only real information in the Bayes factor plot occurs for M ≤ 20; after that, the graph levels off. The commands to produce the plot for values of M from 0.8 to 20 are shown below; this command required about 3 minutes on an Intel 2.8 GHz Q9550 running Linux.
R> breast.bfco <-bf.c.o(from = 0.8, incr = 0.2, to = 20, cc = cc, + mat.list = chain2.list) R> draw.bf(breast.bfco) The resulting graph is shown in Figure 3. Note that at the right end of the graph, the value of the Bayes factor is 1 for M = ∞, since the two models are the same: They are both equal to the parametric model; thus M = ∞ serves as a reference point. For this particular data set, the Bayes factors are always less than 1, and thus the ordinary Dirichlet model is always preferred to the conditional model, although the preference is hardly strong, particularly when M ≥ 8.
For selection of the value of M , the package has the function bf.o for the ordinary model and bf.c for the conditional model. Since for this dataset the ordinary Dirichlet model is preferred to the conditional, we next illustrate how to select the value of M in the ordinary model. The steps to carry out the multi-chain algorithm are the same as given above, and for convenience, we will use the same sets of chains and constants as before; the only change is in the final two commands, which are shown below. The call to bf.o takes about 1.5 minutes to run on an Intel 2.8 GHz Q9550 running Linux.
R> breast.bfo <-bf.o(from = 0.8, incr = 0.2, to = 20,cc = cc, + mat.list = chain2.list) R> draw.bf(breast.bfo) The resulting graph is shown in Figure 4. The relevant features are the shape of the graph, and the point of the maximum, which occurs in this case at M = 2.4. Then, the Bayes factor for M = 2.4 vs. M = ∞ may be obtained by taking the ratio of the appropriate elements in the object breast.bfo, as shown below.
R> breast.bfo$y[9] / breast.bfo$yinfinity [1] 1.7541 The value of 1.75 for the Bayes factor suggests a slight preference for the nonparametric model over the parametric for the breast cancer dataset.
Since for this dataset the conditional model is not preferred to the ordinary, we created a hypothetical dataset by changing three of the points in such a way as to make the data more clumped. Figure 5 displays the original and modified versions of the breast cancer data for the 17 studies. The locations of the vertical lines are the observed log odds ratios, and their heights are proportional to the reciprocals of the estimated standard errors. The distribution is estimated using a kernel density approach (as implemented in the R function density) based on the observed log odds ratios, with weights proportional to the estimated standard errors. This density estimate is exploratory and should be viewed with caution, since the observed log odds ratios are only estimates of the true log odds ratios. Figure 6 shows the new Bayes factors, which now indicate that the conditional Dirichlet model is preferred to the ordinary model. From a wide range of experiments we have carried out, we have noticed that the Bayes factor plot seems to favor the model based on conditional Dirichlets for data sets which have outliers and also have clusters (although it is very difficult to come up with a mathematical explanation for this phenomenon). In general it is hard to tell by visual inspection of a data set which model will be favored, and the only way to tell for sure is to look at the Bayes factor plot.

Discussion
The core contributions of bspmma are: (i) a suite of functions which implement meta-analysis when the effect distribution has as prior a mixture of conditional Dirichlet processes, (ii) functions to carry out a comparison between a model based on a mixture of conditional Dirichlets vs. a model based on a mixture of ordinary Dirichlets, and (iii) functions to carry out an empirical Bayes selection of the precision parameter M of the Dirichlet process. It should be noted that (iii) effectively enables a decision on whether to use a semiparametric model or use a parametric model: an empirical Bayes choice of M = ∞ indicates that there is no compelling reason to use a more complicated semiparametric model, and that the parsimonious choice of a parametric model is adequate.
As mentioned in Section 1, it is possible to use mixtures of finite Polya trees instead of mixtures of conditional Dirichlets in model (4). This is done in Branscum and Hanson (2008), which also gives a comparison of Polya tree and conditional Dirichlet models on the digestive tract dataset of Section 3.2. In order to be able to comment on their approach we very briefly review the construction they use. Let {H ϑ } ϑ∈Ω be a parametric family of distributions on R, let ρ be a function mapping the positive integers into the positive real numbers, and let c > 0. We initially view ϑ as fixed. The real line is first split into two intervals, with the split point being the median of H ϑ ; then each interval is further split into two subintervals, with split points being H ϑ (1/4) and H ϑ (3/4); and this process is continued indefinitely. At the j-th level, when the k-th interval is split (k = 1, . . . , 2 j ), we form a random variable p(j, k) with the beta distribution Beta(cρ(j), cρ(j)). These random variables are all independent. At any level j, an interval at that level has (random) probability given by the product of all the beta random variables along the branch of the tree leading to that set. It can be shown that this process specifies a distribution on the set of cumulative distributions on the real line, and that if F is distributed according to this distribution, then for every t ∈ R, E(F (t)) = H ϑ (t). This distribution is called a Polya tree, and a mixture of Polya trees results when ϑ is random, and a probability distribution λ is assigned to it. In principle, the function ρ can be very general but, Branscum and Hanson (2008) consider the interesting one-parameter family given by ρ ν (j) = 2 −νj . Roughly speaking, small values of ν give rise to distributions which are smooth, while large values of ν give rise to distributions which are discrete. For instance, if ν < 0, the Polya tree gives probability 1 to the set of distributions which are absolutely continuous with respect to Lebesgue measure, while the case ν = 1 gives exactly the Dirichlet process. By taking the random variable associated with the first split to be deterministically equal to 1/2 (i.e., p(1, 1) = p(1, 2) = 1/2), one guarantees that if F is distributed according to the Polya tree, the median of F is equal to the median of H ϑ , which gives a direct analogue to the conditional Dirichlet process.
Of course a Polya tree is a probability distribution on an infinite-dimensional space, and for computational purposes it is necessary to take the number of levels J to be finite. Hanson (2006) gives recommendations for the value of J based on empirical evidence, but there are no theoretical results for the choice of J. Inference based on Polya trees can depend on the sequence of binary partitions used to define the tree. It is worth mentioning that the construction of the Dirichlet process does not depend on a sequence of partitions of the real line, and the Markov chain algorithm used in bspmma is exact, i.e., the stationary distribution of θ = (ψ, µ, τ ) is exactly the conditional distribution of θ given the data for model (4), and the only error incurred is the Monte Carlo error associated with the rate of convergence of the chain.
The class of Polya trees is overwhelmingly large, and this makes it necessary-and difficultto make a choice of a particular Polya tree to use. The choice ρ ν (j) = 2 −νj makes a reduction to a one-parameter family, which contains the conditional Dirichlet processes used in this paper. It would be interesting to develop methods for creating Bayes factor plots similar to those given in Section 2, for the purpose of enabling an empirical Bayes estimate of the hyperparameter ν. More specifically, let m ν be the marginal likelihood of the data when we use a Polya tree with parameter ν, and let B(ν) = m ν /m 1 . Suppose we can develop an estimateB(ν) for a range of values of ν that includes ν = 1. If the maximum of the estimated function is achieved at or near 1, then this would be viewed as evidence that the Dirichlet-based model holds, while if the maximum occurs at a point that is far away from 1, the Dirichlet-based model would be viewed as inadequate.