A Bayesian Nonparametric Multiple Testing Procedure for Comparing Several Treatments Against a Control

. We propose a Bayesian nonparametric strategy to test for diﬀerences between a control group and several treatment regimes. Most of the existing tests for this type of comparison are based on the diﬀerences between location parameters. In contrast, our approach identiﬁes diﬀerences across the entire distribution, avoids strong modeling assumptions over the distributions for each treatment, and accounts for multiple testing through the prior distribution on the space of hypotheses. The proposal is compared to other commonly used hypothesis testing procedures under simulated scenarios. Two real applications are also analyzed with the proposed methodology.


Introduction
We consider the problem of comparing continuous responses for p populations or treatments against those from a control group. This kind of comparisons are in the setting of multiple testing problems, that is, procedures that involve the simultaneous testing of several hypotheses (see e.g., Hochberg and Tamhane, 1987). More formally, let y = (y c , y 1 , . . . , y p ) t be the response variable, where y c = (y c,1 , . . . , y c,nc ) t is a random sample of size n c from the control population, whose distribution G c is supported on R. Similarly, for 1 ≤ k ≤ p, the vector y k = (y k,1 , . . . , y k,n k ) t denotes an independent random sample of size n k from the kth population that is to be compared against the control, and that are distributed according to G k , also defined on R.
For example, with p = 3 the hypothesis H (0,0,0) indicates that all treatments are equal to the control, while H (0,1,1) postulates that only the first treatment equals the control. From the Bayesian standpoint, evidence for or against each hypothesis is assessed through their posterior probabilities, and hence all hypotheses considered can be compared against each other, effectively a multiple comparison problem. Specification (1) poses the multiple comparison problem as a model selection one, where the structure and size of the space of hypothesis M (a.k.a. model space) facilitates a prior specification that accounts for multiple testing. Details on how penalization for model complexity and multiple testing are built into the posterior probabilities is provided in Section 2.
The space of hypotheses M can be visualized as a partially ordered set through a Hasse diagram (Simovici and Djeraba, 2008). For the case p = 3, in Figure 1 each node represents a hypothesis, and the node labels correspond to vectors γ ∈ {0, 1} p defined as above. The edges connecting the nodes indicate nested models represented by the hypotheses. For example, the path in the partially ordered set given by H (0,0,0) → H (1,0,0) → H (1,0,1) , indicates that the model corresponding to the hypothesis where all populations are equal to the control, is nested in the model where only population 1 is different from the control, which is itself nested in the model resulting from the hypothesis where populations 1 and 3 differ from the control. This notion of nesting reflects the way in which model parameters are specified: the parameters of a hypothesis nested within others are also present in the hypotheses that nest it. This nested structure is formalized in Section 3.2. Furthermore, representing the model space as a partially ordered set through this nesting of hypotheses, enables assigning prior probabilities to hypotheses accounting for the model space structure. Again, the Bayesian take on this multiple testing problem consists of identifying, among the 2 p hypotheses in M, those best supported by the data according to their posterior probabilities.
In this paper we have three aims, (I) develop a Bayesian hypothesis testing procedure which yields the posterior probabilities π(H γ | y) for all H γ ∈ M accounting for multiple testing, (II) relax the strong parametric assumptions in this kind of comparisons and identify differences between groups beyond the location and scale parameters, and (III) if some of the populations are different from the control group, gain understanding about which aspects of their distributions differ.
The comparison of several treatments against a control was first studied by Dunnett (1955) and his proposed method, or extensions of it, are still in use. The method proposed by Dunnett (1955) is based on the construction of confidence statements about the p differences between the mean of each treatment and the mean of the control group. Assuming that the observations are normally distributed, the procedure is based on quantiles of the multivariate t distribution and is capable of testing whether all the differences are simultaneously different from zero with a specified probability. Dunnett and Tamhane (1991) proposed a step-down procedure which provides p-values for the comparisons between the treatments with a control taking into account the multiple comparison nature of the problem. These p-values are also based on the multivariate t distribution. Dunnett and Tamhane (1991) showed that the step-down procedure is Figure 1: Hasse diagram for the partially ordered set representing the space of hypotheses for three groups against a control. more powerful than the step-down procedure of Holm (1979) and the single step procedure of Dunnett (1955). The approaches proposed in Dunnett (1955) and Dunnett and Tamhane (1991) assume that observations are normally distributed with unknown treatment means and unknown but common variance. Nonparametric alternatives to the Dunnett test are the Nemenyi-Damico-Wolfe test, which makes one-sided treatmentversus-control multiple comparisons based on joint ranks (Hollander and Wolfe, 1999), and the multiple comparison procedure for unbalanced one-way factorial designs in Gao et al. (2008) based on the unweighted version of the nonparametric relative effects and the associated linear pseudo rank statistics. Both tests are only able to detect possible differences in the locations of the distributions. However, in many applications, the assumptions made by the tests described above are not realistic.
In spite of the lasting popularity of Dunnett's approach, its validity is compromised by its underlying assumptions -normality and equal variances -and its usefulness is limited as comparisons are based only over the location parameters. In fact, distributions may differ in features other than the location, such as tail behavior, symmetry, and number of modes. Although different variances can be accommodated through locationscale families of distributions, the remaining distributional assumptions required (e.g., normality) may still prove to be too stringent to hold in practice. In light of this, and with aims (I) and (II) in mind, we propose a Bayesian nonparametric (BNP) inferential strategy (see, e.g., Ghosh and Ramamoorthi, 2003;Hjort et al., 2010;Müller and Mitra, 2013), which draws from elements of the Bayesian selection literature (Jeffreys, 1961;Scott and Berger, 2010;George and McCulloch, 1993;Liang et al., 2008;Berger and Pericchi, 1996;Wilson et al., 2010;Womack et al., 2015) to penalize both test multiplicity and the complexity of the underlying models. In addition, this approach has the advantage of enabling comparisons against the entire distribution of the control group. Aim (III) is addressed by calculating shift functions (Doksum, 1974), which have been extensively used in the literature to test equality of two distributions G k and G c (see, e.g., Doksum, 1974;Doksum and Sievers, 1976;Hollander and Korwar, 1980;Wells and Tiwari, 1989;Lu et al., 1994).
BNP approaches for the comparison of two distributions have been proposed in recent years. This kind of comparisons are also known as the two-sample problem. Some contributions are provided in the works of Ma and Wong (2011); Chen and Hanson (2014); Labadi et al. (2014); Holmes et al. (2015); Soriano and Ma (2017). BNP hypothesis testing procedures for two-sample problems in more general spaces such as manifolds are studied in Bhattacharya and Dunson (2012). On the other hand, BNP proposals for multiple testing problems have been developed for specific parameters or applications. For example, Ramanan and Berry (1998) propose a multiple comparison method among the means of p populations based on Dirichlet processes. The work of Scott (2009) describes a framework for multiple hypothesis testing of autoregressive time series, which is used to flag companies whose historical performance is significantly different from that expected due to chance. Kim et al. (2009) use Dirichlet processes with spike and slab priors as a base measure to test jointly for the significance of the random effects in mixed models. Recently, Cipolli et al. (2016) propose a multiple testing procedure for comparing several means based on Polya tree priors. As mentioned before, our approach differs from the existing literature in that it compares multiple populations against a control, identifying differences over the entire distributions. Finally, De Iorio et al. (2004) make a related proposal that uses the Dependent Dirichlet process (MacEachern, 1999(MacEachern, , 2000 to define a probability model for random distributions arranged in an analysis of variance (ANOVA)-like array. Although their proposal offers a flexible alternative to model p populations, this model was not formulated as a hypothesis testing procedure.
The remainder of the manuscript is organized as follows: In section 2, the multiple testing problem is formally introduced. In Section 3 we describe the proposed BNP model for hypothesis testing. In Section 4 the proposed model is illustrated using simulated data under different scenarios. This section also provides a Monte Carlo simulation study, where the proposal is compared with current parametric and nonparametric alternatives. Applications using real-life datasets are presented in Section 5. We finalize the paper in Section 6 with conclusions and a discussion.

Bayesian testing background
As discussed in the introduction, our goal is to find hypotheses in M that are best supported by data y. From the Bayesian standpoint this can be done by comparing model posterior probabilities for all models in M. These posterior probabilities are made up of two components: the Bayes Factor (Jeffreys, 1935;Kass and Raftery, 1995) and the prior distribution over M (Jeffreys, 1961;Scott andBerger, 2006, 2010).
Under hypothesis H γ ∈ M, the likelihood L(θ|y, H γ ) connects the data to the parameters θ given hypothesis H γ . Denote by π(H γ ) the prior probability for hypothesis (model) H γ , and let π(θ|H γ ) represent the prior density for θ under H γ . Letting H 0p represent the hypothesis of all populations being equal to the control, the posterior probability for any H γ ∈ M, can be expressed as: where m(y|H γ ) = L(θ|y, H γ )π(θ|H γ )dθ is the marginal likelihood of y under hypothesis H γ , obtained by integrating out the model specific parameters from the likelihood. The term, B γ,0p (y) = m(y|H γ )/m(y|H 0 p ) is the ratio of marginal likelihoods, also known as the Bayes Factor for hypothesis H γ relative to H 0 p .
As shown in (2), the model posterior probabilities are specified by the Bayes factor (determined by the priors on the parameter space), and by the priors on the space of hypotheses. The Bayes factor controls for model complexity whereas the prior over the space of hypotheses controls for multiple testing. In the remainder of this section we elaborate on these prior distributions, emphasizing the role that each of them play in the Bayesian testing problem.

The Bayes factor and the prior over the parameter space
The Bayes factor contains a penalty known as the Ockham's-razor effect (Jeffreys and Berger, 1992). This type of penalization has been commonly associated in the Bayesian variable selection literature as an automatic penalty for model complexity. Although this has not been theoretically established in the nonparametric setting, evidence for this is provided empirically in Basu and Chib (2003). The Bayes factor (and therefore the prior on the parameters), however, does not modulate test multiplicity. This is clear since the Bayes factor between two specific hypotheses (or models) is fixed, regardless of the size of the space of hypotheses (Scott and Berger, 2010).
The strength of the penalty built into the Bayes factor is leveraged by the prior distributions on the model parameters. A considerable body of literature is now available dealing with the definition of these priors for Bayesian testing. Particularly, large efforts have been devoted to the development of "non-informative" priors. Some notable examples are the spike and slab priors (Mitchell and Beauchamp, 1988) and variations of them (see for example George and McCulloch, 1993;Ishwaran and Rao, 2005;George, 2014, 2016), g-priors and scale-mixtures of g-priors (Zellner and Siow, 1980;Berger and Pericchi, 1996;Liang et al., 2008;Womack et al., 2014), and non-local priors Rossell, 2010, 2012;Altomare et al., 2013). Given the ease with which spike and slab priors can be adapted, their ability to yield simultaneously selection and estimation, and their time-tested performance, we will adopt a strategy that involves spike and slab components for both location and scale. We provide a detailed development of this prior in Section 3.2.

Prior distribution over the space of hypotheses
Although the prior probabilities assigned to hypotheses can also penalize model complexity, Jeffreys (1961) recognized that these play an essential role in mediating test multiplicity. The choice of prior distribution on the space of hypotheses requires careful consideration since seemingly innocuous alternatives can have undesirable consequences, and posterior inference is remarkably sensitive to this prior information for small and moderate sample sizes. For example, it has been a common practice to assume equal prior probabilities on all hypotheses; however, this seemingly non-informative alternative favors models of a particular level of complexity, making this choice inadequate.
To define the priors on the space of hypotheses M, each hypothesis is associated to a vector γ = (γ 1 , . . . , γ p ) ∈ {0, 1} p . In Figure 1, these vectors are used to label each node, defining a one-to-one correspondence between the specific configuration of the γ vector and a particular hypothesis. As such, the distribution over M is set in terms of the γ's. In comparing a control group against p other populations, M is populated by the 2 p possible configurations of γ.
While the importance of these priors was first acknowledged around the 1960's, the literature has only recently gained some attraction. Some examples of priors of this form are the Beta-Binomial(a, b) constructions proposed in (Ley and Steel, 2009;Scott and Berger, 2010;Wilson et al., 2010;Castillo et al., 2015), as well as the construction found in Womack et al. (2015). These priors assume that models of the same size (i.e., hypotheses of the same complexity) obtain the same prior probability. Denoting by a γ = p k=1 γ k with a γ = l (for some l ∈ {0, 1, . . . , p}), this assumption translates into priors for H γ of the form where π p (l) = P (a γ = l) is the probability for the entire class of hypotheses with exactly l groups differing from the control, so that dividing π p (l) by p l provides the probability for a single model in the class. As such, priors of this type penalize for the number of hypotheses within the class, as well as for the complexity of models representing the hypotheses in the class.
The prior formulation proposed in Womack et al. (2015) is derived from first principles, has a meaningful and intuitive interpretation, and has been shown to provide suitable penalization for hypothesis complexity as well as test multiplicity. As such, we consider this prior formulation in our model, and defer to Section 3.3 for details on its construction.

Model definition
Let y k,i be the ith observation in population k, we associate a predictor x k,i indicating the population membership, that is, x k,i = k with k ∈ {c, 1, . . . , p}. We propose the following model, where P = {P x : x ∈ {c, 1, . . . , p}}, φ(·|μ, σ 2 ) is the probability density function of a Gaussian distribution with parameters (μ, σ 2 ), π DDP (· | H γ ) is a prior induced by a Dependent Dirichlet Process DDP (MacEachern, 1999(MacEachern, , 2000 under the hypothesis H γ , and π M is a prior distribution defined on model space M. We now proceed to describe how the priors π DDP (·|H γ ) and π M are defined.
The prior π DDP (·|H γ ) is induced by a process P whose elements are defined as, where the weights ω j are defined with the stick-breaking construction (Sethuraman, 1994), that is, , and δ is the Dirac measure. To provide additional flexibility, we assume that κ ∼ Gamma(a 1 , a 2 ). Finally, the atoms are defined as where x ∈ {c, 1, . . . , p}. We set η c,j = 0, and τ c,j = 1, for all j, which is analogous to the standard reference cell constraint. The random sequences {μ c,j } j≥1 and {σ 2 c,j } j≥1 are the locations and scales of the control population. Thus, the possible differences detected in the kth population are captured by the changes in locations {η k,j } j≥1 and scales {τ k,j } j≥1 with respect to the control group. When η k,j = 0 and τ k,j = 1 for a particular k and for every j ≥ 1, the k-th and control populations are the same.
Model (3) belongs to the class of dependent stick-breaking processes, with dependent atoms and common weights. This class enjoys appealing theoretical properties (see for example Barrientos et al., 2012;Pati et al., 2013). The full hierarchical model (3)-(5) bears some similarities to the family of DP mixture models proposed by .  also propose a procedure for a finite number of populations that borrows strength across different but related mixture models. These mixture models relate to each other by their corresponding base measures, which are defined as a mixture of two measures: one idiosyncratic and another common to all populations. Under this strategy, while the atoms may come from the same distribution these are not exactly the same. This in turns implies that even if two populations have atoms drawn from the same distribution, the mixing distributions for the two populations can differ. Conversely, in our proposal the atoms for two populations can take the same values, hence, the corresponding mixing distributions themselves can be the same. It is the similarity among these mixing distributions what enables us to make comparisons between the control and treatment populations.

Priors on the atom parameters
In order to test the hypotheses of interest, we require priors for (η k,j , τ k,j ) that are able to concentrate around 0 for η k,j and around 1 for τ k,j whenever G c = G k . Here, k = 1, . . . , p denotes the kth population to be compared with the control group. This type of behavior can be induced using spike and slab priors (George and McCulloch, 1993;Ishwaran and Rao, 2005;Ročková and George, 2016) as the base measure of the Dirichlet process. Thereby, the definition of π DDP (·|H γ ) is completed with the following prior specification, where and (s, , b) are positive hyperparameters with G(·|b 1 , b 2 ) denoting the density function of the Gamma distribution with mean b 1 /b 2 and variance b 1 /b 2 2 . Notice that and s control the variance of η k,j and τ k,j . As such, is assumed to be a small near-zero constant and s is fixed in a relative large value. This implies that whenever γ k = 0 (i.e., the spike), the η k,j 's will be close to 0 and the τ k,j 's close to 1, so the atoms for the kth population defined in (5) will concentrate tightly about {μ c,j , σ 2 c,j } j≥1 . Conversely, if γ k = 1, the set of parameters of the kth population are {μ c,j , η c,j , σ 2 c,j , τ k,j } j≥1 . This spike and slab formulation imposes the nesting structure mentioned in the introduction among hypotheses in M.
In all of our simulations and case studies we fix the hyperparameters used in (6) and (7) at = 0.01, s = 1000 and b = 100. For these values, the variance of the slab component (γ k = 1) for η k,j is 10 and the variance for the spike is 0.01, in both cases the mean of η k,j is 0. On the other hand, the spike component for the precision parameter 1/τ k,j has mean 1 and variance 0.01, which implies that the prior induced on the variance parameter τ k,j is highly concentrated around 1. While, the slab component has mean 1 and variance 10. Thus, in this case τ k,j can be different from one. These values of the hyperparameters are chosen to calibrate the model for data standardized using the mean and standard deviation for all observations.

Priors on the space of hypotheses
As mentioned before, we consider the strategy formulated in Womack et al. (2015) to build the priors on the space of hypotheses. These priors control test multiplicity as well as model complexity. Recalling that a γ = p k=1 γ k , this prior formulation assumes that models with equal a γ have the same prior probabilities. The prior for the hypothesis that all populations are equal to the control is π(H 0p ) = ρ/(ρ + 1). The prior for the alternatives hypotheses H γ is obtained from the recursion where γ is a binary vector containing at least one element equal to one, l = a γ and ρ > 0 is a hyperparameter which fixes the relative odds of belief in a set of local alternatives versus a local null hypothesis.
The same prior structure is observed for any hypothesis in M; it arises from treating each hypothesis as a local null with respect to the set of all hypotheses that nest it (see Womack et al., 2015, for more details on this prior construction). Given that we want to penalize more complex hypothesis but not excessively so, we recommend using ρ = 1 as default and make use of this value in the examples and applications of Section 4 and 5.

Posterior inference and analysis of differences
The posterior inference of model (3) is quite efficient following the slice sampler algorithm described in Kalli et al. (2011) andWalker (2007). This algorithm overcomes the infinite-dimensionality inherent to the Dirichlet process, by considering an augmented model and truncating the number of components in the mixture to a random variable N that takes on finite values drawn as part of the algorithm. Once N is sampled, the posterior inference in each step of the Gibbs algorithm, is reduced to sample the locations, scales, and weights in a finite mixture model. The updating of the locations and scales is facilitated because of the conjugacy in model (3). The weights are simply sampled from a Beta distribution. The posterior probability π(H γ | y), for H γ ∈ M can be approximated with For those groups such that the testing procedure yields evidence for G k (·) = G c (·) (k ∈ {1, 2, . . . , p}), the shift function can be used as a simple alternative to visualize the aspects in which their distributions differ from the control. Suppose that G c (y) d = G k (y+Δ k (y)), for all y, so that Δ k can be interpreted as the amount needed to transform the distribution of the control group to the one of the treatments. Doksum (1974) showed that the shift-function, defined as is the unique function such that the equality in distribution holds.
Thus, Δ k (·) characterizes how two independent distributions differ. In fact, if Δ k (y) = 0, for all y, then the distributions are identical. If Δ k (y) = 0, for some y, the distributions are not equal and we can inspect the set {y : Δ k (y) = 0} to identify where they differ. Equivalently, if treatments are being compared to the control, the set {y : Δ k (y) = 0} gives information on what aspects of the distribution are being influenced by the treatment.
The computation of the shift function is straightforward in our algorithm because for each step of the Gibbs algorithm, we have posterior random realizations of G The realizations in (9) can be used to compute the sample posterior meanΔ(y) as a point estimator of the shift function. Also, a 95% credible set (Δ * (y), Δ * (y)) can be estimated using the 2.5% and 97.5% percentiles of the random realizations of Δ k (y). The values of y for which 0 / ∈ (Δ * (y), Δ * (y)) can be used to determine the set {y : Δ k (y) = 0}.

Illustrations with simulated data and a Monte Carlo study
In this section, we first illustrate the performance of the proposed BNP method using synthetic data. Then, we investigate the performance of the proposed BNP testing procedure using a Monte Carlo simulation study with scenarios that focus on different features that may be of interest in this type of problems. The posterior inference obtained from our method is compared to those from other popular hypothesis testing alternatives.

Examples with synthetic data
Three examples are provided, each differing in terms of which populations differ from the control, as well as the distributions considered to generate the data (i.e., distributions that differ in location, scale, asymmetry and the number of modes in the populations).
In particular, we consider a skew normal distribution for example 1 with density given by where φ(·) and Φ(·) correspond to the density and the cumulative distribution of a standard normal distribution. The parameters μ, τ and α correspond to the location, scale and shape of the distribution, respectively. We denote the skew normal distribution as SN(μ, τ, α). Notice that a skew normal distribution with α = 0 is equivalent to a normal distribution. In each example, we consider p = 3 populations and a control group with sample sizes n c = n k = 90, k ∈ {1, 2, 3}. A summary of the settings used for each example is given in Table 1.  The proposed model in (3) was fitted to each of the three data sets generated. The values of the hyperparameters for the total mass parameter κ were fixed at a 1 = a 2 = 1, which is a relatively standard choice. We experimented with other choices for a 1 and a 2 , for example, a 1 = 5, a 2 = 1, and in general the posterior quantities were robust to that choices. The posterior value of κ was updated as in Escobar and West (1995), see details in Supplementary Material. A graphical display of the posterior inference showing true vs estimated densities and shift functions for each example, is provided in Figures 3 and 4. The Figures also display the true shift function and its estimation given by the posterior mean and the corresponding credible sets. We also added a red dashed line, which represents Δ k (·) = 0 as a reference for interpretation purposes. In particular, we display the posterior inference where there are differences between the control and some population. As it can be seen from these figures, the proposed model provides good estimations of the densities and the shift functions. In most cases, estimates follow closely both the true densities and true shift functions and the true model was completely covered by 95% point-wise credible interval. Regarding the posited hypothesis, for each example our testing procedure assigned posterior probability of 1 to the true hypothesis. Thus, in conclusion, the method was able to accurately detect the true hypothesis, estimating correctly both the densities and shift functions.

Monte Carlo simulation study
The 9 scenarios to be investigated are determined by the levels of two factors: the extent to which the tested populations differ (difference levels which are given by the parameters μ l , σ 2 l and θ l , with l = 1, 2, 3, see the details in Table 2), and the sample size (n = 50, 150, 300). In all the scenarios, the control population was assumed to be   Table 2: Values of the parameters μ l , σ l and θ l , l ∈ {1, 2, 3}.
a standard normal distribution. Population 1 was generated by a normal distribution with variance equal to one and location denoted by μ l . In the case of population 2, we considered a normal distribution with mean zero and variance denoted by σ 2 l . Finally, population 3 was generated by a 50-50 mixture of two normal distributions. Each component of the mixture has variance equal to one, with locations −θ l and θ l , respectively. Specifications of the populations and the corresponding values for the parameters are shown in Table 2.
The results of the Monte Carlo simulation study are summarized in Figure 4. Each of the 9 scenarios is represented by an image plot. The vertical axis in these plots represent the true models, while the horizontal axis represents the estimated models. Each of the 64 cells shows the average over the 100 Monte Carlo replications for the posterior probability π(H γ | y). A probability value equal to 0 is represented by black, while a value 1 is represented by white in the grayscale. The correctly identified hypotheses are represented in the main diagonal. Figure 4 shows that as the value of l increases (i.e., the difference between the treatments and the control is more evident), the average posterior probability π(H γ | y) approaches 1 in the true model. The same behavior is observed as the sample size increases. We compared the performance of our model against some classical testing procedures. To this end, we select the maximum a posteriori hypothesis, given byĤ For the classical tests, we consider a significance level of 0.05. In particular, we compared our proposal with the multiple hypothesis testing procedures of Dunnett (Dunnett and Tamhane, 1991), Nemenyi-Damico-Wolfe (Hollander and Wolfe, 1999) and Gao (Gao et al., 2008). We also used some two-sample testing procedures (Welch's t-test, Levene, Wilcoxon and Kolmogorov-Smirnov), and adjust them for multiple comparisons with Bonferroni corrections. Table 3 provides a summary of the performance of each test. We report the number of times that the true hypotheses, as described in Table 2, were detected for each test. The multiple testing procedures of Dunnett, Nemenyi-Damico-Wolfe and Gao were able to detect differences only in the locations of the distributions, that is, they detect reasonably well the hypothesis H 100 . The two-sample tests provide the expected results; that is: the t-test and the Wilcoxon test are able to detect differences in locations. Their performances are better than our proposal in detecting this aspect of the distribution, especially in the scenarios with small sample size. The Levene test is able to detect scale differences (H 010 ). This test also detects the hypothesis H 001 , which could be thought as a scale difference, while actually the difference is due to the mixture specification in Population 3. The performance of Levene test for the scenarios with small sample sizes was better than our proposal. The Kolmogorov-Smirnov test was able to detect differences across the entire distribution. However, the performance of our proposal was better or as good as Kolmogorov-Smirnov in all the scenarios. Image plots showing the number of times that the eight considered test selected each model are provided in Figures

Educational achievement by school type
We implement our testing procedure to assess whether student educational achievement differs depending on the type of school they attend. In this context, it has been observed that, in some countries, school-type is a good proxy of students' socioeconomic-status. It is widely evidenced by international studies that educational achievement is strongly correlated with the socio-economic background of students (e.g., OECD, 2016). In Chile, this phenomenon has been present, not only in the case of international student assessments such as the Program for International Student Assessment (PISA), but also in The data considered stems from the System of Assessment Progress in Achievement (SEPA, by its Spanish acronym); a private national evaluation system in Chile. SEPA consists of a battery of tests in the subjects of Mathematics and Language, designed to assess achievement in students from first to eleventh grade. In each application, besides students' test scores, additional information such as school type (e.g., municipal, subsidized, private) is also available in the data base. For the illustration, we consider the abilities on the mathematics test for a total of n = 1, 130 students attending three different types of schools and distributed in the following way: n Mun = 324, n Sub = 640, n P ri = 166; where n Mun , n Sub , and n P ri are the sample sizes for the municipal, subsidized and private schools, respectively. The ability latent variables were predicted using a two parameter logistic item response theory model (2PL) (De Boeck and Wilson, 2004;van der Linden, 2016), with values defined in the real line.
Our aim here is to test whether ability distributions of the municipal schools (selected as the control group) are different to other types of schools. This selection is motivated by the interest of knowing how different are the achievements in private schools compared to the municipal ones, which constitute the public system. We fitted our model considering the same hyperparameters of Section 3.2 and 4.1. Figure 6 shows estimated ability distributions for each school type (dashed lines), 95% point-wise credibility intervals and the estimation of the shift functions that are used to compare the ability distributions of subsidized and private schools versus municipal ones. Visually, the ability distributions seem to differ from each other. Taking municipal schools as the control group, this finding is formally confirmed by our model which gives support to the hypothesis H 11 or, equivalently, assigned posterior probability concentrated in π(H 11 | y) = 1. The shifts functions show what was formally found using our BNP multiple testing procedure. As a matter of fact, none of the curves lie in the zero line, meaning that the ability distributions for schools differ. It can also be seen that the magnitude of the differences is different depending on the support region considered. For instance, if the students' abilities are low, there is some gain in changing type of school, and this gain is more evident if the abilities are higher. We also applied all the existing methods listed in Table 3. These methods found differences in locations or scales between the abilities distributions, but as can we seen in Figure 6 the differences are due to different kind of skewness. For instance, the municipal schools show positive skew, while the private schools show a negative skew distribution.

Phenology study
A consequence of climate change of increasing concern to ecologists is the decoupling of species interactions due to drastic changes in the timing of life cycle events. For example, large variations in the dates in which plants flower can have drastic impacts on species that depend upon them. The study of the timing of these events and how it is affected by variations in climate is called phenology.
Here we make use of historical data from a network of institutions collecting phenological data through participatory science methods between 1825-1878 in the state of New York, and compare it to a modern dataset between 2010 and 2015 in the same region. We restrict our analysis to a single phenological event across several species in the region, namely the day of first flowering. Taking the period between 1830-1840 as the control group (n c = 1, 703), we compare the first flowering dates between the control and two other time periods: 1850-1860 (n 1 = 2, 449) and 2010-2015 (n 2 = 2706). The response variable is the (centered and scaled) day of the year, whose observed values were assumed to be iid conditionally on the time period. The goal is to determine if and how the distribution of the first flowering has varied through time. This was the reason that motivated the election of the first period of observation as the control group. However, we must disclose that this example is intended as a means to illustrate the capabilities of the proposed methods, rather than a rigorous attempt to determine the impact of climate change.
The results derived from our analysis indicate that the density of the first flowering during the control period markedly differs from the density of both the 1850-1860 decade and that from modern times, with the posterior probability entirely concentrating on this hypothesis π(H 11 | y) = 1. That being said, the regions of the support where these differences take place differ drastically between the two comparison groups (Figure 7). In particular, the results show that, when compared to the control group, the dates of first flowering occurred later during the year in the 1850-1860 period. In the 2010-2015 period more mass is assigned to both earlier and later flowering dates when compared to the control period, this behavior is clearly visualized from the shift function estimation. Notice that, the shift function allows us to identify which part of the population have a treatment effect. Conversely, as expected due to the large sample size in this illustration, the existing methods considered in the Monte Carlo simulation section detected differences in location or scale, but in this study, the differences are not only due to location or scale, they are mainly given in the tails of the densities. Additionally, the existing methods do not provide a smooth estimation of the densities. Of course, simpler ways to estimate the density such as kernel estimators are available, but these kind of estimators are strongly affected by the selection of the bandwidth. On the other hand, the quantification of the uncertainty is not straightforward. In our proposal, the credible sets for the densities and shift function quantifies the uncertainty and allows us to visualize the differences.

Concluding remarks
We proposed a formal Bayesian hypothesis procedure to compare multiple treatments against a control. The methods developed are applicable to a wide variety of problems, and improve upon existing methods that test against a control group. The procedure avoids strong modeling assumptions over the distributions of each population, and is able to identify differences with respect to the entire distribution of the control. Additionally, we provide a simple approach to visualize the differences detected by the procedure between pairs of distributions by using the shift function.
The proposed method accounts for the multiplicity of the testing problem through the priors on the space of hypotheses. The comparisons are relatively simple, due to the nesting structure of the proposed model (see, Remark 1). The nesting structure is facilitated by considering common weights in the mixture model. Extensions of the model to consider dependent weights are possible, given that a nested structure for the weights can be also specified. More flexibility could be added to the model considering for example a skew normal distribution for the kernel in (3), see, e.g. Canale and Scarpa (2016). An extension for multivariate responses could also be feasible if an adequate parametrization can be found for a multivariate normal kernel.
As shown in Section 4, the performance of our approach proved to be consistently good, in most cases outperforming the other alternatives considered in the Monte Carlo simulation study. Unsurprisingly, classical multiple testing procedures were only able to detect differences in locations. Similarly, the two-sample testing procedures exclusively detected the features for which they were designed. That is, t-test and Wilcoxon excelled at detecting differences in location; Levene successfully detected changes in scale; and Kolmogorov-Smirnov identified differences across the entire distribution. The conclusions derived from the Kolmogorov-Smirnov test are similar to those resulting from our method; however, our method has the advantage of being a multiple comparison procedure that yields density estimates for all populations, from which the shift function can be calculated. Of course, the proposed approach is not designed as a procedure to deal with hundreds or thousands of comparisons given the computational costs associated to it. The proposed strategy relaxes parametric assumptions, provides estimates for the strength of the competing hypotheses in the form of posterior probabilities, and has the potential to yield new insight through the use of the shift function.
For problems with small sample sizes, specific location or scale tests are preferred, given that they target specific features of the populations. Nevertheless, many current applications have sufficiently large data to make implementing our approach possible. Furthermore, Bayesian hypothesis testing procedures yield the wealth of information contained in the posterior probabilities, which can be combined with a loss function to make decisions as with classical tests.

Supplementary Material
Supplementary Material for 'A Bayesian nonparametric multiple testing procedure for comparing several treatments against a control' (DOI: 10.1214/18-BA1122SUPP; .pdf). The online Supplementary Material contains the Gibbs Algorithm described in Section 3.4, as well as the image plots of the comparison between our proposal and other classical hypothesis tests (Section 4.2), including both multiple and two-sample cases.