Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures

In this article we propose novel Bayesian nonparametric methods using Dirichlet Process Mixture (DPM) models for detecting pairwise dependence between random variables while accounting for uncertainty in the form of the underlying distributions. A key criteria is that the procedures should scale to large data sets. In this regard we find that the formal calculation of the Bayes factor for a dependent-vs.-independent DPM joint probability measure is not feasible computationally. To address this we present Bayesian diagnostic measures for characterising evidence against a"null model"of pairwise independence. In simulation studies, as well as for a real data analysis, we show that our approach provides a useful tool for the exploratory nonparametric Bayesian analysis of large multivariate data sets.


Introduction
Identifying dependences among pairs of random variables measured on the same sample, producing datasets of the form D = {(x i , y i ), i = 1, . . . , n}, is an important task in modern exploratory data analysis where historically the Pearson correlation coefficient and the Spearman's rank correlation have been used. More recently there has been a move to the use of non-linear or distribution free methods such as those based on Mutual Information (MI) [Cover andThomas, 2012, Kinney andAtwal, 2014]. In this paper we present Bayesian nonparametric methods for screening large data sets for possible pairwise associations (dependencies). Having an explicit probability measure of dependences has numerous advantages both in terms of interpretability and for integration across different experimental conditions and/or within a formal decision theoretic analysis. As data sets become ever larger and more complex we increasingly require Bayesian procedures that can scale to modern applications and this will be a key design criteria here. The main building block of our procedures will be the Dirichlet Process Mixture (DPM) model, which is the most popular Bayesian nonparametric model.
We frame the problem of screening for evidence of pairwise dependence as a nonparametric model choice problem with alternatives: M 0 : X and Y are independent random variables M 1 : X and Y are dependent random variables . (1) Given a set of measurement pairs D, for n exchangeable observations one could then evaluate the posterior probability for competing models P(M 1 |D) = 1 − P(M 0 |D) or consider the Bayes factor P(D | M 0 )/P(D | M 1 ) which is a measure of the strength of evidence for independence between the two samples against dependence. However with p measurement variables under study there are ≈ 1 2 p 2 such pairwise Bayes factors to compute, where even just one such evaluation might be problematic to compute. This motivates us to explore scalable alternatives to a formal Bayesian testing approach, by deriving summary statistics and functionals of the posterior that can provide strong indication in favour or against independence.
Bayesian nonparametric hypotheses testing via Polya tree priors has been the focus of a couple of recent research papers [Holmes et al., 2015, Filippi and. Here, how-ever, we specify model uncertainty in the distribution of X and Y via DPMs of Gaussians.
This provides flexibility while also encompassing smoothness assumptions on the underlying joint distributions. Another advantage is that DPMs have been widely studied in the Bayesian nonparametric literature with excellent open source implementation packages available [e.g. Jara et al., 2011]. Moreover, although not explored here, the use of DPMs makes our approach readily extendable to situations when X and Y are themselves collections of multivariate measurements. Here we consider pairwise dependence between univariate measurements where for M 0 , independence, the joint distribution factorises into a product of two univariate DPMs on X and Y , while for M 1 we can define a joint DPM model on the bivariate measurement space (X, Y ).
In theory, given a DPM prior on the unknown densities, the Bayes factor can be calculated via the marginal likelihood. However this requires integrating over an infinite dimensional parameter space that does not have a tractable form. Moreover, using computational approaches to approximate the marginal likelihood is highly non-trivial, particularly when considering the need to scale to many thousands of comparisons with large p. To overcome this issue we present two new approaches to deriving scalable diagnostic measures corresponding to probabilistic measures of dependence, bypassing the need to calculate Bayes Factors that might not be feasible or desirable. Our methods are motivated by two recent proposals in the literature [Lock andDunson, 2013, Kamary et al., 2014], although neither of these papers consider the problem we address here as outlined below.
Our first approach utilises the well known latent allocation, or clustering, structure of the DPM model to induce a partition of the two-dimensional data space. By running a Gibbs sampler under the independence model the cluster allocation of observations to specific mixture components at each iteration can then be used to define a latent contingency table given by the mixture component memberships. For each of these contingency tables we perform a parametric Bayesian independence-vs.-dependence test using conjugate multinomial-Dirichlet priors that lead to explicit analytic forms for the conditional marginal likelihoods. This proposal follows a similar idea considered in Lock and Dunson [2013] who studied the two-sample testing problem. A key difference in what we present here, in addition to that we consider the problem of pairwise dependence, is that Lock and Dunson [2013] use a finite mixture model to induce a partition instead of an infinite nonparametric mixture model used here.
In our second approach, we adapt a recent procedure of [Kamary et al., 2014], turning the model choice problem into an estimation problem by writing the competing models under a hierarchy that incorporates both models, M * = πM 1 + (1 − π)M 0 . We investigate the specification of M * either as a mixture model with mixing component 0 ≤ π ≤ 1, or as a predictive linear ensemble of the two sub-models with constraints on the weights. We then estimate π which becomes a measure of the evidence for dependence. DPMs are used to obtain the likelihood associated to each of the competing models in M * , requiring a separate MCMC run for each potential pair of random variables.
We compare and contrast the two procedures with particular regard to their scalability to large data sets. This latter feature naturally includes the amenity of the methods to simulation with modern parallel computation. We demonstrate that our association measures are scalable and successfully detect some highly non-linear dependences with equivalent performance to the current best conventional methods using mutual information, with the added advantages that fully probabilistic Bayesian methods enjoy. As mentioned above, some of these key advantages includes the ability to integrate results within a formal decision analysis framework, or within optimal experimental design, and the combination of results with other sources of information, or across studies such as arise in meta-analysis.
The rest of the paper is as follows. In Section 2 we review the Dirichlet Process and the DPM of Gaussians. In Section 3 we describe the two approaches to quantify the evidence for dependence using Dirichlet Process Mixtures. In Section 4 we illustrate our approach on the exploratory analysis of a real-world example from the World Health Organisation data set of country statistics and also on simulated data generated from simple models. We conclude the paper with a short discussion in Section 5.

Dirichlet Process Mixtures
The Dirichlet process [Ferguson, 1973] is the most important process prior in Bayesian nonparametric statistics. It is flexible enough to approximate (in the sense of weak convergence) any probability law, although the paths of the process are almost surely discrete [Blackwell and MacQueen, 1973]. Many years ago this discreteness was considered a drawback but nowadays it is simply a feature that characterises the Dirichlet process. This feature has recently been highly exploited in clustering applications (e.g. [Dahl, 2006]).
The Dirichlet process is defined as follows. Let G be a probability measure defined on (X , B), where X ⊂ IR p and B the corresponding Borel's σ-algebra. Let G be a stochastic process indexed by the elements of B. G is a Dirichlet process with parameters c and G 0 if for every measurable partition (B 1 , . . . , B k ) of X , (G(B 1 ), . . . , G(B k )) ∼ Dir(cG 0 (B 1 ), . . . , cG 0 (B k )).
From here we can see that, for every B ∈ B, E{G(B)} = G 0 (B) and Var{G(B)} = G 0 (B){1− G 0 (B)}/(c + 1). Therefore the parameter c is known as precision parameter and G 0 as the centering measure.
The Dirichlet process when used as a priori induces exchangeability in the data. In notation, let X 1 , . . . , X n be a sample of random variables such that conditional on G, X i | G iid ∼ G. If we further take G ∼ DP(c, G 0 ) then the marginal distribution of the data (X 1 , . . . , X n ) once the process G has been integrated out, is characterised by what is known as the Pólya urn [Blackwell and MacQueen, 1973]. We start with X 1 ∼ G 0 then X n | X 1 , . . . , X n−1 ∼ cG 0 + n−1 j=1 δ X j c + n − 1 .
Instead of placing the Dirichlet process prior directly on the observable data, it can be used as the law of the parameters of another model (kernel) that generated the data. In notation, let us assume that for each i = 1, . . . , n, with f a parametric density function. We can further take This hierarchical specification can be seen as a mixture of density kernels f (x | θ) with mixing distribution coming from a Dirichlet process, i.e., f (x | θ)G(dθ). This model is known as Dirichlet process mixture (DPM) model and was first introduced by Lo et al. [1984] in the context of density estimation and written in hierarchical form by Ferguson [1983].
The most typical choice of kernel f is the (multivariate) normal, in which case θ i = (µ i , σ 2 i ), with scalars mean and variance, in the univariate case, and θ i = (µ i , Σ i ), with mean vector and variance-covariance matrix, in the multivariate case. We will work with this specific kernel throughout this paper.
As can be seen by construction (2), in the mixture case, the Dirichlet process induces a joint distribution on the set (θ 1 , . . . , θ n ) that allows for ties in the θ i 's. This in turn induces a clustering structure in the θ i 's (and X i 's). Posterior inference of the DPM model usually relies on a Gibbs sampler [Smith and Roberts, 1993]. At each iteration of the Gibbs sampler the model produces a different clustering structure. The number of clusters is a function of the sample size n and the precision parameter c of the underlying Dirichlet process. The larger the value of c, the larger the number of clusters induced. This clustering structure and parameter c will play a central role in one of the independence test procedures that will be described later.

Two approaches for measuring dependence
As noted in Section 1, the calculation or approximation of the formal Bayes factor under M 0 and M 1 is not feasible when considering a large number of model comparisons. Indeed it may not even be desirable given that our objective is to highlight potential departures from independence rather than answer a formal model choice question. In this section we describe two distinct approaches for comparing models M 0 and M 1 defined in (1) based on DPM models that are computable and scalable to large data.

Contingency tables approach
The first approach is motivated by the paper from Lock  To begin assume that the data are marginally clustered in K X and K Y clusters and denote by ξ X,i ∈ {1 . . . , K X } and ξ Y,i ∈ {1, . . . , K Y } the cluster indicators for the data points x i and y i respectively, for i = 1, . . . , n. Using these cluster indicators, we can construct a contin- , for k = 1, . . . K X and l = 1, . . . , K Y . The contingency table M ξ X ,ξ Y represents a discretised version of the (unnormalised) marginals and joint distribution of the continuous vector (X, Y ).
We can then apply Bayesian independence tests for discrete / categorical variables following Gunel and Dickey [1974] and Good and Crook [1987] who proposed a conjugate multinomial-Dirichlet independence test which is described as follows.
Under the independent model M 0 the observed counts M ξ X ,ξ Y can be expressed in terms of the marginal counts m X = {m k· } and m Y = {m ·l } whose implied distributions are again multinomial with probability vectors p X = {p k· } and p Y = {p ·l }, respectively, with p k· = l p kl and p ·l = k p kl . The induced prior distributions are also Dirichlet with where α k· = l α kl and α ·l = k α kl .
To compare evidence in favour of each model, we use expressions (3) and (4) to compute the Bayes factor Using equal prior probabilities for both models, i.e. P(M 0 ) = P(M 1 ) = 0.5, we obtain that the posterior probabilities for the independence and dependence models are .
It should also be noted that this contingency table approach would also afford a conditional frequentist test. For example, consider Pearson's chi-squared test of independence [Pearson, 1922]. Denote by m k· = l m kl and m ·l = k m kl the number of individuals classified in cluster k of X and cluster l of Y , respectively. Then, the well known test statistic is Under the null hypothesis M 0 of independence, statistic T follows a χ 2 distribution with If the test statistic is improbably large according to that chi-square distribution, then one rejects the null hypothesis M 0 in favour of the dependence hypothesis M 1 .
The hypothesis testing approach described in this section assumes that the data are marginally clustered. However, these clusters are not known a prior. A Bayesian approach for data clustering is to define a prior distribution over the clustering and then update the posterior based on the evidence provided by the data. Here we make use of the DPM model structure to create an empirical partition of the two-dimensional data space, taking into account the uncertainty on the allocation process. More precisely, we consider two independent DPM prior models for each of the marginal densities with the following specifications: where The latent clustering structure induced by the DPM models defined by (7) and (8), can then be used to construct a contingency table as described above. Note that in an ideal world one would carefully specify subjective beliefs on the prior marginals for X and Y . However, when the number of variables is large this is not feasible and we require some default specification as done here, by assuming a common prior after suitable transformation of the data.
Although it is clear from the properties of the DP that it induces a partition, in practice it is not easy to determine an optimal one. Fitting a DPM model via a Gibbs sampler provides a partition at each iteration. We can proceed in two different ways. One is to use all potential partitions coming from the MCMC, and for each of them perform the Bayesian independence test and report the expected posterior probability. More precisely, the functional we consider is This is the procedure we recommend and develop below. An alternative approach would be to consider the selection of one of the partitions using an appropriate optimization criterion, for example using the criterion of Dahl [2006] who proposes to choose the partition that minimises the squared deviations with respect to the average pairwise clustering matrix, and use that single partition to perform the test, ignoring the uncertainty in the partition structure as in Lock and Dunson [2013] for the two-sample test. In Supplementary Material we provide an empirical comparison between both procedures.
In the rest of the paper we will focus on the first alternative that considers all potential partitions; we will refer to this procedure as CT-BF.

Mixture model predictive approach
In this section we consider an alternative approach for testing between hypothesis (1). Motivated by Kamary et al. [2014] we replace the testing problem with an estimation one by defining a predictive ensemble model M * whose components are the competing models M 0 and M 1 . To be precise, let f 0 and f 1 denote the densities of (X, Y ) defined by models M 0 and M 1 , respectively. Then we define a predictive mixture model as a linear combination Algorithm 1 Independence measure based on Contingency table (CT-BF) Require: Prior parameters a Require: Prior parameters for the DPM and number of iterations N it Ensure: Probability of dependence p dep DPM inference: Infer a DPM model for the distribution f 0,X (x) using a Gibbs Sampler with n it iterations → for each iteration 1 ≤ j ≤ N it , record a vector of cluster indicator ξ where π is a free regression parameter with constraint 0 ≤ π ≤ 1 and f 0 (x, y) = f 0,X (x)f 0,Y (y).
This model embeds both M 0 and M 1 for values of π equal to 0 or 1. The main idea of this method is to estimate from the data the mixture parameter π, which indicates the preference of the data for dependence model M 1 . In contrast to the latent contingency table procedure this approach requires the explicit construction of a joint model under hypothesis M 1 .
Since f 0 and f 1 are unknown densities, we assume Bayesian nonparametric prior distri- butions. For f 0 X (x) and f 0,Y (y) we consider the DPM model defined by equations (7) and (8). For f 1 we take a bivariate DPM model defined as where θ X,Y = (µ, Σ), with and The parameter π has also to be estimated so we take a prior of the form π ∼ Be(a 0 , b 0 ). We ensure that the centring measures G 0 and G 1 are comparable by setting their hyper-parameters as follows: The hyper-parameters c µ , c Ψ and k 0 are set to be equal for G 0 and G 1 .
Our objective is to highlight pairwise dependence across many pairs of variables, and order the pairs into those showing evidence from strongest to weakest association. This motivates us to consider a simplified method by assessing the relative posterior predictive evidence under M 0 to that of M 1 , by calculating an ensemble model using the posterior predictive probability of the observed data f 1 (x new , y new |D) and f 0 (x new , y new |D) separately.
In the following we will use the notationsf pairs we use the same prior, and hence same model complexity across all pairs, so ranking by the improvement in posterior predictive likelihood under M 1 relative to M 0 should not a priori favour certain pairs over others. This procedure significantly simplifies the inference as we can infer the posterior models by first fitting the three DPM models separately each using the entire sample data, and then updating the ensemble parameter π from its posterior conditional distribution which is a simple line search on [0, 1]. We will refer to this inference procedure as MixModensemble -see Algorithm 2.
An alternative approach, more closely resembling Kamary et al. [2014], is to consider M * as a mixture-model rather than an ensemble model where with probability π the data arises from f 0 and with probability 1 − π from f 1 . Diebolt and Robert [1994] show that posterior sampling in a mixture model is simplified if we introduce latent variable indicators ζ i ∼ Ber(π) that determine whether observation i comes from f 1 , when ζ i = 1, or from f 0 , when ζ i = 0. Conditional on these latent indicators the mixture components f 0 and f 1 can be updated using only the data points allocated to each model. As noted by Kamary et al. [2014], the Gibbs sampler implemented in this way can become quite inefficient if the parameter π approaches the boundaries {0, 1}, specially for large sample sizes. We refer to this method as MixMod. For our purposes this requires specifying a Gibbs sampler for the mixture model utilising three DPM models {f 1 (x, y), f 0,X (x), f 0,Y (y)} and the mixture allocations for points across all p × (p − 1)/2 pairs. In the paper we will illustrate the performance using MixMod-ensemble, and in the Supplementary Material we provide a comparison between MixMod and MixMod-ensemble.
Regardless of the posterior inference procedure, different estimators of π could be obtained from its posterior distribution. We chose to select the expected value as a statistic of dependence, that is,π = E(π | D) = 1 0 π f (π | D)dπ . (13)

Computational tractability
Both of the Bayesian non-parametric approaches proposed here are motivated by the increasing necessity of screening large data sets for possible pairwise dependencies where calculation of the formal Bayes factor under M 0 and M 1 is unfeasible or undesirable. In this section, we discuss some computational advantages of our two methods including their amenity to implementation on modern computing architectures exploiting parallelisation on multi-core standalone machines, or clusters of multi-core and many-core machines, or cloud based computing environments.
In relation to parallelisation we see that both methods are divided in two steps: one starts by inferring DPMs using a Gibbs sampler and then perform a dependence test using is then combined and we perform N it × p × (p − 1)/2 independent tests where following (5) only involves computing ratios of Gamma functions. As an illustration, in the example described in more details in Section 4, for p = 562 measurement variables, the first stage of inference on the DPMs take less than 3 minutes on a 48-core machine, and then the resulting 1.5 × 10 8 pairwise tests of dependence for all pairs of variables are performed in one hour.
In comparison the MixMod-ensemble approach incurs a greater computational overhead as we require bivariate DPMs, f 1 (x, y), to be fit for all pairs. In the illustration below the MixMod-ensemble procedure for the 1.5 × 10 8 pairs takes approximatively 36 hours on the same 48-core machine.

World Health Organisation dataset
In this section, we apply the two approaches described in Section 3 to detect dependencies in economic, social and health indicators from the World Health Organisation (WHO). The

WHO Statistical Information System (WHOSIS) has recently been incorporated into the
Global Health Observatory (GHO) that contains a data repository (http://www.who.int/ gho/database/en/) with mortality and global health estimates, demographic and socioeconomic statistics as well as information regarding health service coverage and risk factors for 194 countries. We combined these datasets to obtain a set of 562 statistics per country. We aim at highlighting potential dependencies between these indicators. Scatterplots of some of these indicators are represented in Figure 1, where for example we see, unsurprisingly, strong dependencies between indicators such as life expectancy at birth and increased life expectancy at age 60 (Pair E).
We applied both the CT-BF and the MixMod-ensemble test to compute the probability of dependence for all the 157,641 pairs of indicators. The two proposed methods require the   (7) and (8) are set as follows: c 0 = 10, µ 0 ∼ N(0, 1), k 0 ∼ Ga(1/2, 100/2), ν = 3 and ψ ∼ IGa(1/2, 5). Note that c 0 controls the number of clusters induced, so in order to avoid having partitions with only one cluster we set this parameter at a relative large value. To specify the Dirichlet prior for the cell probabilities in the contingency table we took α kl = 1/2, which is the Jeffreys prior in a multinomial model.
In experimentation we found that the contingency table can be sensitive to the choice of the parameter c 0 . This parameter influences the number of clusters in the DPM model and therefore the size of the contingency tables and it is important to specify a value that induces a reasonable number of clusters. We would recommend exploring several values.
Results seem fairly insensitive to the choice of the parameters α kl in the Dirichlet priors.
For the approach considering an ensemble mixture model, the parameters c 0 and c 1 are not fixed but specified by c 0 , c 1 ∼ Ga(1, 1) and µ 0 ∼ N(0, 100). This change was introduced to allow the model to determine the best fit without constraining the number of clusters. In addition, the prior processes G 0 and G 1 are defined as follows: G d−1 = N(µ | µ 0 , (1/k 0 )Σ) IW(Σ | ν, Ψ) for d = 1 and 2 with ν = d + 2, the d-dimensional vector µ 0 ∼ N(0 d , 100 I d ), the d × d-matrix Ψ ∼ IW(ν, 0.1 I d ) and k 0 ∼ Ga(1/2, 50), where I d is the identity matrix of dimension d. The prior distribution of the mixing proportion π was specified by taking a 0 = b 0 = 1/2. Our experience is that results are fairly robust to the prior parameter settings (see Supplementary Material ).
The procedures were implemented in the environment for statistical computing R Core Team [2014] and make use of the package DPpackage [Jara et al., 2011]. Chains were run for 10,000 iterations with a burn in of 1,000 keeping one of every 5th draws for computing estimates.
For both approaches the tests were performed only for pairs containing measurements for at least 10 countries. For the CT-BF approach, the 562 DPMs are inferred using all the available data; however, the contingency tables were constructed taking into account only the countries for which both indicators (in the pair) are available. For the MixMod-ensemble approach, in order to avoid any bias towards one of the two models M 0 or M 1 , both the DPMs on the marginals and the DPM on the joint space are inferred only on the countries for which measurements are available for both indicators. Extending the method to handle missing data is a future objective.
The measure of dependences obtained following our two approaches, i.e. p dep for CT-BF andπ for MixMod-ensemble, defined respectively equations (9) and (13), are compared for each pair of variables in Figure 2 (left panel). Strong dependences (defined as p dep > 0.8) are detected for 5% of pairs, and credible independence (i.e. p dep < 0.2) between 30% of the indicators. We observe that the two probabilistic measures of dependence generally agree for most of the pairs, with the probability value obtained following the MixMod-ensemble method being generally higher than the probability measure obtained following the CT-BF approach. This elevation in the evidence in dependence is perhaps to be expected as MixMod-ensemble uses the conditional posterior predictive likelihood which will favour the more complex joint model of f 1 (x, y). However, the two methods disagree (defined as the probability value obtained following one method is lower than 0.2 while it is larger than 0.8 following the other method) for less than 0.36% of the pairs; and these differences mainly occur when one of the (X, Y ) variables is equal to 0 for more than 20% of the countries (see for example pairs C and D).
On balance we prefer to use the CT-BF approach due to its computational scalability, 1 hour of run-time on a 48-core computer in comparison with 36 hours for MixMod-ensemble in this example. We compared the analysis from the CT-BF to that using a mutual information approach computed using the 20-nearest neighbours method, as in Kinney and   (9) and (13) and approximated following algorithms 1 and 2. The letters A to F correspond to the 6 pairs of indicators illustrated in Figure 1.
We remark that some pairs of variables with strong dependences under CT-BF have a wide spread of mutual information, in particular we note pairs D and F that have a probability of dependence close to 1 for CT-BF but relatively low MI values. Visually at least one could argue that associations of the form seen in Figure 2 D and F may be of potential interest to follow up by the analyst.
For the sinusoidal, parabolic and circular models, the parameter φ controls the level of noise, whereas for the normal model the correlation ρ controls the degree of dependence between the two samples. We generated fifty independent datasets from each model with a sample size n = 250 with different correlations ρ ∈ {0, 0.1, 0.3, 0.5, 0.9}, for model (a), and levels of noise φ ∈ {1, 2, 3, 4, 5} for models (b)-(d). Figure 3 shows one of the fifty simulated dataset as illustration.
For all the simulated datasets we apply our different procedures for testing hypothesis (1). We use the same priors specifications as described in Section 4.1.
To investigate the power of the two approaches, we create ROC curves that compare the rate of true positives (percentage of times the procedure detects dependence among the fifty datasets generated from a dependent model) and false positives (percentage of times the procedure detects dependence among fifty null datasets generated by randomly permuted the indexes of the two samples to destroy any dependences) for different threshold values. We also compare the performance of the proposed methods to the current state of the art conventional method, which is based on mutual information (using the 20 nearest neighbours). The ROC curves are reported in Figure 4; see also Supplementary Material that contains additional more extensive comparisons.
We observe that the proposed methods have similar performances to the current leading conventional method for data coming from a sinusoidal or a parabolic model. For data generated from the circular model however the mutual information method outperforms our approaches.

Conclusion
We presented two Bayesian nonparametric procedures for highlighting pairwise dependencies between random variables that are scalable to large data sets. The methods make use of standard software in R for implementing DPM of Gaussians and are designed to exploit modern computer architectures. As such they are readily amenable to applied statisticians interested in exploratory analysis of large data sets. A power analysis shows that the procedures are comparable with that of current non-Bayesian methods based on mutual information, while having the advantage of being probabilistic in their measurement.

A Comparison between variants of the two approaches
In Section 3 we described two Bayesian nonparametric approaches to highlight dependences between two random variables. For each approach, we mentioned different variants. Here we provide a comparison of these variants on the simulated dataset described in section 4.2.
For the approaches based on contingency table, the main method consists in using all potential partitions coming from the Gibbs Sampling, performing the test at each iteration and reporting the average probability of dependence (over all the iterations). An alternative to this approach, as mentioned in Section 3, is based on only one of the partitions selected using an optimisation criteria; we refer to this approach as CT-BF-1-clust.
Regarding the mixture model approach, an alternative method to the posterior predictive approach of MiixMod-ensemble is a more conventional approach (called MixMod) described in algorithm 3. It consists in iteratively allocating the data to the independent or the dependent model and inferring each model based only on the data that has been allocated to it.
We compare the four approaches (two alternatives for each approach) in terms of their frequentist power on the simulated examples of Section 4. The ROC curves are reported in Figure 5. As expected for all models, the statistical power increases for larger correlation values and decreases for larger levels of noise. We observe that the CT-BF approach using all iterations of the Gibbs sampling has significantly more power than the approach when using a unique cluster. Regarding the mixture models approach, the posterior predictive two steps approach is slightly more powerful and is computationally much cheaper than the iterative approach.

B Sensitivity to prior choices
All the proposed methods require the specification of several hyper-parameters controlling the prior distributions of the DPM and also of the test themselves. We have investigated the impact of these choices and observed that none of the approaches is sensitive to the choice of the hyper-parameters controlling G 0 or G 1 in the DPM models. In the following, we illustrate the impact of the other hyper-parameters on simulated data sampled from the sinusoidal model described in Section 4.2.
The contingency table approach is highly sensible to the choice of the parameter c 0 . This parameter influence the number of clusters in the DPM model and therefore the size of the contingency tables. The frequentist power of the method on the simulated sinusoidal example increases with the parameter c 0 (see Figure 6). However, the ROC curves are fairly insensitive to the choice of the parameters a controlling the Dirichlet prior for the cell probability: α kl = a for all k and l. We suggest to use a = 0.5 as that value spreads the probability of dependence in the interval (0, 1) for the different levels of additive noise (see

C R code
In this section, we provide the R code that has been used to run our two independence tests on the WHO dataset. The data are stored in a matrix of size 194 × 562, called data.
The first subsection present the file containing the main functions of interest and the second subsection shows how to call these routines in parallel.