A Bayesian nonparametric approach to testing for dependence between random variables

Nonparametric and nonlinear measures of statistical dependence between pairs of random variables are important tools in modern data analysis. In particular the emergence of large data sets can now support the relaxation of linearity assumptions implicit in traditional association scores such as correlation. Here we describe a Bayesian nonparametric procedure that leads to a tractable, explicit and analytic quantification of the relative evidence for dependence vs independence. Our approach uses Polya tree priors on the space of probability measures which can then be embedded within a decision theoretic test for dependence. Polya tree priors can accommodate known uncertainty in the form of the underlying sampling distribution and provides an explicit posterior probability measure of both dependence and independence. Well known advantages of having an explicit probability measure include: easy comparison of evidence across different studies; encoding prior information; quantifying changes in dependence across different experimental conditions, and; the integration of results within formal decision analysis.


Introduction
Quantifying the evidence for dependence or testing for departures from independence between random variables is an increasingly important task and has been the focus of a number of studies in the past decade.A typical motivating example comes from the field of biology where a growing abundance of genetic, proteomic and transcriptomic data is being produced.In order to unravel the existing relationships between different molecular species (genes, proteins, ...) involved in a biological system, large datasets are commonly screened for evidence of association between the pairs of variables.This requires adequate statistical procedures to quantify the evidence of dependence (or lack of independence) between two samples of typically continuous random variables.
In this article, we propose a Bayesian nonparametric procedure to derive a probabilistic measure of dependency between two samples x and y without assuming a known form for the underlying distributions.In particular let M 0 denote a model, or hypothesis, of independence and M 1 a model, or hypothesis, of dependence.The posterior probability, p(M 1 |x, y), is then a natural measure of the strength of evidence for depen-small title dence between the two samples against independence.The Bayes Factor quantifying the relative evidence in the data in favour of M 1 over M 0 is simply, BF = p(x, y|M 1 ) p(x|M 0 )p(y|M 0 ) , which is the ratio of the prior predictive probability of the observed data given the two competing hypotheses.This Bayes Factor implicitly avoids conditioning on the form of the unknown distribution functions.The role of Bayesian nonparametrics is to allow one to accommodate this uncertainty via a prior measure on the space of probability measures, for instance, p(x, y|M 1 ) = f (x, y|M 1 )π(dF |M 1 ) where π(•) is a Bayesian nonparametric prior with wide support over the space of probability measures on the joint sample space Ω X × Ω Y ; see for example Hjort et al. (2010).
We use Pólya tree priors (Lavine 1992;Mauldin et al. 1992;Lavine 1994) to model the unknown distributions of the data.We show that the use of such priors leads to an analytic derivation of our association measure p(M 1 |x, y).In particular, this measure of dependence involves a finite analytic calculation though the Pólya tree prior is defined over an infinite recursive partition.Pólya tree priors have previously been used to derive Bayesian nonparametric procedure for two sample hypothesis testing (Holmes et al. 2015;Ma and Wong 2011) and extensions of these priors have been proposed to model distributions indexed by covariates (Trippa et al. 2011).The "two-sample testing" problem is different to that considered here in that it considers the same measurement, or outcome, Y , measured on independent samples under different conditions and tests for evidence of the "treatment" or covariate effect, whereas our paper is concerned with exploring evidence for statistical association between two joint measurement variables, {Y, X}, recorded together on a set of samples.However, our approach exploits a similar framework to the testing procedure from Holmes et al. (2015).In particular, our association measure necessitates the construction of Pólya tree priors on a two-dimensional space, and as discussed at the end of the paper, this engenders new challenges regarding the partitioning scheme to be employed.It is worth noting that Pólya trees offer a more appropriate nonparametric model than say Bayesian histograms, with Dirichlet priors (Leonard 1973), as the recursive tree structure of the Pólya tree is indexed on the measurement variable, whereas the Dirichlet prior for histograms is for unordered categorical data and local dependence between measurement bins must be introduced via a hierarchical prior.Moreover the Pólya tree is defined via an infinite sequence partitioning, bypassing the need to truncate at some level, and, as noted above, our approach can compute the Bayes factors from the infinite sequence.
Numerous frequentist approaches have been developed for identifying associations between two samples (Shannon and Weaver 1949;Cover and Thomas 1991;Reshef et al. 2011;Gretton and Györfi 2010) but to the best of our knowledge this is the first Bayesian nonparametric procedure to quantify of the relative evidence of dependence vs independence.Being able to provide an explicit probability of dependence is attractive for a number of reasons.First, it can be combined with a variety of probabilistic approaches.In particular, it can be easily merged with the decision theory framework in order for optimal statistical decisions to be made in the face of uncertainty.Another important property of probabilistic measures is their interpretability.Indeed, a given level p = p(M 1 |x, y) of this measure is exactly the probability of a dependent generative model given the data and the probability of an independent generative model is simply 1 − p.Over and above the standard arguments in favour of Bayesian inference, one explicit consequence of the coherence is that we can explicitly quantify the evidence for a change in dependence between two variables across two or more conditions.For example, if there is evidence that two dependent variables {X, Y } become independent on application of a treatment, or across disease states.Answering such questions is problematic from a non-Bayesian perspective, as a null hypothesis of dependence is of higher dimension than the corresponding alternative hypothesis of independence which is nested under the null.This makes the quantification of a p-value extremely challenging.In Bayesian analysis, the symmetry of the model space makes for a simple and intuitive solution.In Section 4 we illustrate this issue using an important application in cancer genetics.
The remainder of the paper is organised as follows.We first introduce the Pólya tree priors in Section 2 and summarise their main properties.In section 3, we describe our nonparametric procedure to test for dependencies between two samples.We then illustrate in section 4 our approach on data generated from simple models and then we apply our procedure to two real world problems from biology.In the Appendix we provide an empirical calibration comparing our method to that of other non-Bayesian approaches in the literature.

Polya Tree priors
Pólya trees form a class of distributions for random probability measures F on some domain Ω (Ferguson 1974) by considering a recursive partition of Ω into disjoint measurable spaces and constructing random measures on each of these spaces.A binary recursive partitioning is typically used for one-dimensional domains: Ω is divided in two disjoint sets C 0 and C 1 which themselves are divided in two other disjoints sets C 0 = C 00 ∪ C 01 and C 1 = C 10 ∪ C 11 , and so on.The infinite recursive partition is denoted by C = {C j , j = 0, 1, 00, 01, 10, 11, . . .}; the partition at level k is comprised of 2 k sets C j where j are all binary sequences of length k.
To better understand the probability measure constructed on these nested partitions, one can think of a particle going down the tree shown in Figure 1 Left; at each junction j, usually represented in binary format, the particle has a random probability θ j to choose the left branch.In Pólya trees, the random branching probability follows a Beta distribution, with θ j ∼ Beta(α j,(0) , α j,(1) ).Given the partition C, the sequence of nonnegative vectors A = {α j,(0) , α j,(1) } j and the sequence of realisations of the random branching variables Θ = {θ j } j , it is possible to compute the likelihood for any set of where the product is over the set of all partitions and n j0 and n j1 denote the number of observations that lie in the partitions C j0 and C j1 respectively.The Beta prior on the partition probability is conjugate to the Binomial likelihood and, integrating out θ j for all j, we obtain that where B(., .)refers to the Beta function.For more details on Pólya Tree priors, we refer the reader to Ferguson (1974); Lavine (1992); Mauldin et al. (1992); Lavine (1994); Ghosh and Ramamoorthi (2003); Wong et al. (2010).
In this paper, we are interested in testing independence between two samples x and y.We therefore need to consider the joint space Ω X × Ω Y of the two sample spaces.For reasons that will become obvious later on, we recursively subdivide this space into four rectangular regions.We start with partitioning and continue with nested partitions defined by C j = C j0 ∪C j1 ∪C j2 ∪C j3 for any base 4 number j.Thus the partition at level k is formed of 4 k sets C j where j are all quaternary sequences of length k.We assume that the sets C j are rectangular, i.e. can be written as a Cartesian product D×E where D ⊂ Ω X and E ⊂ Ω Y .We arbitrarily choose to denote C j0 the left bottom region of the set C j , C j1 the right bottom region, C j2 the left top region, and C j3 the right top region for all j.This recursive partition C = {C j , j = 0, 1, 2, 3, 01, 02, 03, 11, . . .} is illustrated in Figure 1B.Similarly to the onedimensional case, a probability measure can be constructed on this recursive partition by defining random branching probabilities in the recursive quaternary partition.In the following we will use different distributions for these branching probabilities.
3 A Bayesian nonparametric measure of dependence

The approach
Given a N sample (x, y) which are i.i.d.realisations of the random vector (X, Y ), we wish to evaluate the evidence for the competing hypotheses: M 0 : X and Y are independent random variables; M 1 : X and Y are dependent random variables. .We denote by F XY the unknown joint probability distribution of (X, Y ) and by F X and F Y the two unknown marginal distributions.Following a Bayesian approach, we aim at estimating the posterior probability

Two-dimensional case
Level 1 Construction of a Pólya tree distribution in the uni-dimensional case: at each junction j the particle has a random probability θ j to choose the left branch and 1 − θ j to choose the right one.(Right) Illustration of the first two levels of the quadrant partitioning scheme in the two-dimensional case.
where p(M 1 ) represents prior beliefs regarding the competing hypotheses.We specify our uncertainty in the distribution F XY via a Pólya tree prior.Denoting by Ω X and Ω Y the domains of the probability measures F X and F Y respectively, we consider a recursive quaternary partition of Ω X × Ω Y into disjoint measurable sets as described in the previous section.
To compare evidence in favours of both hypotheses, we compute the following ratio where the first term is the Bayes factor which can be written as a product over all partitions: where b j is defined below.From equations ( 3) and ( 4) and expressing Beta and multinomial Beta functions in terms of Gamma functions, we have The product in ( 5) is defined over the infinite set of partitions.However for any set C j containing zero or one data point (i.e.such that Therefore, only subsets with at least two data points contribute to this product.The Bayesian measure of the strength of evidence for dependence between the two samples against independence involves a finite analytic calculation even though ( 5) is over the infinite number of levels in the tree.
The procedure is described in Algorithm 1.For each set C j containing more than one datapoint, the term b j measures the relative evidence in favour of M 0 given the number of datapoints falling in each of the four quadrants of C j .Intuitively, for each set C j we perform a Bayesian independence test based on the local two-by-two contingency table .Pólya tree priors provides us with a theoretical framework to perform these "local" independence tests at every level while taking into account potential dependences on neighboring sets.In addition, it allows us to compute the Bayes Factor analytically without having to chose any arbitrary level or any truncation.The parameter a j which decreases with the depth of the set C j enables us to give more importance to "local" independence tests at the higher levels than at the deepest levels.In the next subsection, we investigate the impact of the choice of this parameter.

Sensitivity to choice of A
The proposed procedure relies on a choice of the sequence of non-negative vector A, A X and A Y .As discussed above, the α parameters are constant per level and such that α j,(i) = ck 2 where k is the depth of the set C j .In addition, α j,X,(i) = α j,Y,(i) = 2ck 2 .
Algorithm 1 Bayesian nonparametric evidence for independence 1. Fix the quadrant partitioning scheme; choose a constant c.
For every set C j containing more than one data point, compute b j defined in ( 6) with a j = ck 2 where k is the depth of the set C j .
2. Assuming equal prior belief for both hypotheses, The parameter c controls the speed of divergence of the α parameters with the depth k and therefore the relative contribution of each level of the partition to the Bayes Factor.We have investigated the impact of the setting of c (see Figure S1 and S2 in Supplementary Material) and observe that small values of c typically favours the simpler model (M 0 ) especially when the number of samples is small and there is not enough evidence to determine M 1 .This is to be expected as Bayesian modelling encompasses a natural Occam factor in the prior predictive (see for example chapter 28 of MacKay ( 2003)).We have found c = 5 to be a reasonable canonical choice but practitioners are strongly advised to explore the setting for their own analysis.

Choice of the partition
Basic approach: partition centred on the median of the data The inference resulting from a Pólya tree model is known to strongly depend on the specification of the partition C over the data space (Paddock et al. 2003), and a multitude of quadrant partitioning scheme could be used in our procedure.As a default, partially subjective approach we suggest to construct a partition based on the quantiles of two normal distributions (for the x-and y-axis respectively).In other words, both variables x and y are transformed through the inverse cumulative distributive function of a normal distribution; a quaternary recursive partition of [0, 1] × [0, 1] is then constructed by subdividing it into four rectangular quadrants of equal size (see Figure 2, Left) The mean and standard deviation of the normal distributions can be derived from empirical estimates of the location and spread of the two samples.In the next section we use the median and the median absolute deviation as this choice induces robustness to potential outliers.
A simple partial optimizing procedure for partition centering X is its complement.In the previous paragraph, we

Shifted partition Basic partition
Figure 2: Construction of the partition.Both the basic approach and the shifted approach are illustrated on a simulated sinusoidal dataset with some outliers.Under the basic approach (Left column), the data are marginally transformed via the inverse of the c.d.f. of normal distributions.The shifted approach consists in shifting the central location of the partition by a factor δ and wrapping the data around: the data space is divided into two regions Z 1 = {(x, y), x ≤ δ} and Z 2 = {(x, y), x > δ} which are then juxtaposed.The obtained "shifted" data are then normalised via the inverse of the CDF of a normal distribution.Quaternary recursive partitions of [0, 1] × [0, 1] are constructed by subdividing the normalized data into four rectangular quadrants of equal size (Bottom panels).
suggest centering the partition on the median (m x , m y ) of the data.For such choice of partition, at the top-level of the Pólya tree our procedure tests whether the distribution of Y conditional on The procedure performs the test symmetrically on the x-axis and the y-axis.Instead of focusing on the median, it would be more informative to test whether the distribution of Y conditionally on X ∈ D X is equal to the distribution of Y conditionally on X ∈ D C X for any compact set D X ∈ Ω X .In this section, we consider a simple partial optimisation of the partition-centering location by considering different compact sets D X .The approach involves shifting the central location of the partition as defined by the top-level split, and wrapping the data round to maintain balance in the number of points in each region.Consider a real number δ ∈ [a, b] where a and b are respectively the minimum and maximum values of the data on the x-axis.We denote by ψ δ the transformation that divides the data space in two regions Z 1 = {(x, y), x ≤ δ} and Z 2 = {(x, y), x ≥ δ} and juxtaposes them as illustrated in Figure 2, Right.More formally, The obtained "shifted" data are then transformed through the inverse cumulative distributive function of normal distributions and a quaternary recursive partition of [0, 1]× [0, 1] is constructed as in the basic approach.We denote the obtained partition by C δ .
We consider optimising the marginal evidence of dependence p(M 1 |x, y) by maximizing the Bayes factor as defined in equation ( 5) over all the partitions C δ for δ ∈ [a, b].The obtained probability of dependence is therefore where B δ designates the Bayes Factor given the partition C δ .This approach is called the "empirical Bayes approach" in the rest of the paper.Note that optimising the central location of the partition with respect to p(M 1 |x, y) will naturally tend to inflate the evidence for M 1 .However, when testing many pairs of random variables for evidence of dependence we are mainly concerned with the ranking of the pairs for further analysis, rather than explicit quantification of the evidence, and partial optimisation of the partition may well help to produce more stable and acurate rankings.

Applications
In this section, we illustrate the performance of our Bayesian nonparametric procedure for detecting dependence across different datasets.We first test the procedure on simple models proposed by Kinney and Atwal (2014) and then apply it on two real examples from biology.
We generated 500 independent data sets of size N for each of the 5 models, and varied the level of noise (σ), and the number of data points (N ).We used our procedure to compute the probability of the hypothesis M 1 given each of these datasets.The partition structure was set using the robust mean and standard deviation of the data, and the parameter c was set equal to 5. The frequency distribution (over the 500 independent runs) of the probability of the hypothesis M 1 for each model as a varying number of data points and the level of noise is shown in Figure 3 (Middle rows).The red curve represents the median while the light and dark grey area designate the zone between the 5th and 95th percentiles and the inter-quartile region respectively.
As expected, the probability that the two samples are dependent is equal to 0.5 for every model when N = 1 as we assumed equal prior belief of both hypotheses.The probability of M 1 increases as the number of data points increases and is very close to one for every model if N is larger than 4000.It is interesting to note that when the number of samples is small there is not enough evidence to determine M 1 and the Bayes Factor may favour the simpler model M 0 .As mentioned previously, this it to be expected as Bayesian modelling encompasses a natural Occam factor in the prior predictive (see for example chapter 28 of MacKay (2003)).This effect is stronger for other smaller values of the parameter c (see Figure S1 and S2 in the Supplementary Material).
The logarithm of the Bayes Factor defined in equation ( 5) can be decomposed in terms of levels in the recursive partition as follows and the contribution of each level in favour of the independence or dependence model (denoted by B k for the level k) can be investigated.When B k is close to 0, the contribution of level k is negligible, whereas large positive (resp.negative) B k indicates stronger evidence in favour of independence (resp.dependence) at level k.In Figure 3 (Bottom row), we show the distribution (over 500 independent runs) of B k for k ∈ {1, 2, 3, 4, 5} for a sample of N = 150 data generated from the 5 different models with σ = 2.We observe that in the linear model, the dependence can already be detected at the first level, with B 1 being strongly negative for most of the generated datasets.However, for the four other models, the value of B 1 is mostly positive and the top level does not contain enough information to detect that there are dependencies between the two variables.In these examples, most of the information in favour of the dependence model is in the second level.Deeper levels contribute less to the decision; this is due to the form of parametrisation with the α parameters being proportional to k 2 .This decomposition of the evidence across levels is an attractive qualitative feature of the Pólya tree testing framework, which can assist the statistical analyst in better understanding the dependence structure.
The symmetric nature of the probability, , allows us to explore the ability to detect independence.The probability of the independent hypothesis M 0 given data sampled from two independent standard normal distributions is shown in Figure 4 for increasing number of data points (N).The probability of the independent hypothesis increases with N and is very close to one if N is larger than 500.Such a measure of independence between variables is problematic to compute for non-Bayesian methods, as we are testing for a simpler model nested within a more complex one.Frequentist approaches use a p-value which is conditional on M 0 being true.Hence as it stands it cannot be used as evidence for M 0 .
The performance of our Bayesian nonparametric approach varies between models both in terms of number of data points required to detect dependence and in terms of noise sensitivity: dependence's are detected for the circular and checkerboard models even for relatively high levels of noise and relatively small number of data points but less so for the linear model, which visually appears closer to independence (top plots in Figure 3).In addition, our approach necessitates a relatively high number of data points (N ≥ 300) to detect dependence in the parabolic or the sinusoidal models even for a level of noise σ = 2.The lack of dependence detection on these two last models may be due to the symmetry of the generated data relative to the choice of the partitioning scheme.To overcome this issue, we suggest to change the partitioning scheme by optimising the partition centering as described in section 3.3 Figure 5 shows that the evidence for independence is strongly increased for those two models when running the empirical Bayes approach which consists of maximizing the marginal probability under the dependent model over the shifted partition scheme with data wrapping.As expected, this empirical Bayes approach inflates the probability of dependence for every models included when the data are generated using an independent model.

Gene expression network form measurements at single-cell resolution
The field of biology contains numerous examples where a large amount of data has been produced and adequate measures to detect dependence between variables are required.
Here we focus on an example from single-cell biology.Nowadays, the expression level of thousands of genes can be jointly measured at single-cell resolution, which allows biologists to precisely study the functional relationships between genes.In Wills et al. (2013), the expression of 96 genes affected by Wnt signaling have been measured in 288 single cells.The authors provided evidence that many of these transcriptomic associations are masked when expression is averaged over bulk sequencing on many cells.In their   : Distribution (over 500 independent runs) of the probability of dependence using our Bayesian nonparametric approach (Left) and our empirical Bayes approach which maximises the marginal probability of dependence over the shifted partition scheme with data wrapping (Right).Data are generated under the 5 illustrative examples as well as an independent generative model where both x and y are vectors of i.i.d.samples from a normal distribution with mean 0 and standard deviation 1.Here, N = 150 and σ = 2.
study the authors investigated the relationships between these genes and constructed an expression network using the measurements at single-cell resolution.The expression network can illustrate potential functional relationships and dependencies that are interesting to elicit molecular pathways.The network is constructed by highlighting genes that have correlated or anti-correlated expressions between cells using Spearman correlation coefficient.We reproduce this network detecting dependences both with Spearman correlation and with our Bayesian procedure (see Figure 6 Top row).Our procedure detects many associations between genes that were not detected by simple correlation analysis: More than 250 pairs of genes have a probability of dependence higher than 0.95 and an absolute value of Spearman correlation lower than 0.5 whereas every (except one) link with absolute Spearman correlation higher than 0.5 has a probability of dependence larger than 0.95 (see Figure 6 Bottom row, Left).Some of these links that are only detected using our approach are between genes that are known to interact such as APC and DVL2, AXIN1 and GSK3B, DVL2 and LRP6 or AXIN1 and DACT1 (see Figure 6 Bottom row).Other detected links remain to be investigated.

Differential co-expression analysis
Networks have proved themselves to be important representation of biological systems where various molecules are interacting and functionally coordinating.A typical example is gene expression networks such as the one described in the previous subsection where nodes correspond to genes and edges represent interactions between genes.Interactions in biological networks can substantially change in response to different conditions.In particular, gene co-regulations may be altered with disease and an interaction between two genes could be present in some conditions and not in other.Differential coexpression analysis consist in identifying which interactions in gene expression network change from one condition to another (Hsu et al. 2015).
The main objectives of differential co-expression analysis is to identify couples of genes (x, y) such that the strength of dependence between x and y changes in response to different conditions.Our Bayesian procedure is perfectly suited for this type of problems which require methods able to detect both dependences and independences.Non-Bayesian testing procedures for independences typically only provide p-values to identify when the null hypothesis (here, the independence hypothesis) can be rejected.To the contrary our approach enables us to quantify the relative evidence of dependence vrs independence.In particular, given the expression {x i , y i } i=1,...,n of two genes in n cells under condition A and the expression {x i , ỹi } i=1,...,ñ of the same two genes in ñ cells under another condition B, we can calculate the probability of a change of interactions between these two genes in response to conditions A and B as follows In Curtis et al. (2012), a collection of around 2, 000 breast cancer specimens from tumour banks in the UK and Canada is analysed and compared to a set of 144 normal cells.We propose to apply our algorithm to these gene expression dataset and make use of the probability in equation ( 7) in order to identify dysregulation in gene expression in response to breast cancer.We focus on comparing a subset of 997 tumour cells (called the discovery test in Curtis et al. (2012)) with the set of 144 normal cells; for each cell, the expression level of 48, 803 probes is available.Following the approach proposed by Langfelder and Horvath (2007) we use the gene expression of the normal cells to identify 25 modules of correlated genes and determine so called "eigengenes" that represent the expression of genes in each module.We used the R implementation of this module detection and eigengene computation provided in the R package WGCNA (Langfelder and Horvath (2008)).We represent in Figure 7 Left the eigengene expression of some selected modules under the two conditions.By computing the probability in ( 7) for the 300 pairs of modules, we identify interactions between modules that significantly change in response to breast cancer.We find that the probability p diff is larger than 0.95 for 69 module interactions, represented in Figure 7 Right.Among those 69 interactions, 49 are interactions that were present in normal cells and vanished in tumour cells whereas 13 of the interactions only appear in tumour cells.
The pathway enrichment analysis enables us to implicate each of the 25 modules with established biological cascades and clinically-relevant pathways (see Table 1).The two modules with the highest degree in the differential co-expression network are: (a) the ALK1 signaling pathway, and (b) complexes associated with translation (e.g.ribosome).For the former, this likely indicates loss of regulation and signaling cross-talk; for the latter, since ribosomes are essential and translational genes are often tightly regulated, this hints at wide-spread transcriptomic perturbation.In particular, the high degree of the ribosome/translation module indicates that, in the cancerous state, more modules are increasingly dysresgulated and out of sync with the more tightly regulated translation-involved modules.Our analysis shows that both of these high-degree modules are disconnected to numerous other important pathways such as the ERK/MAPK pathway, or genes involved in immune signalling (such as antigen presentation).More generally, this demonstrates that overlaying biological and clinically relevant annotations on the differential co-expression network may be the basis for further research regarding transcriptomic alterations in breast tumors.

Discussion/Conclusion
We have presented a novel Bayesian nonparametric approach that quantifies a probabilistic measure for the strength of evidence for dependence between two samples against that of independence.The procedure is based on Pólya tree priors that facilitate an analytic expression for the Bayes factor even though the Pólya tree prior is defined over an infinite recursive partition.We have applied our approach to simulated datasets as well as applications in molecular biology including single-cell gene-expression analysis and network analysis in cancer genetics.
The inference resulting from a Pólya tree model is known to strongly depend on the specification of the partition C over the data space.We have proposed an empirical method to select the partition centring by optimising the marginal likelihood in favour Figure 7: (Left) Expression of eigengenes for different modules (module 2, 3,4, 10 and 13) in the 144 normal cells and in the 997 tumor cells.We observe that some modules are strongly interacting under one of the conditions (Normal or Breast Cancer) and not under the other.Each dot corresponds to the expression of two eigengenes in one cell.(Right) Differential co-expression network: nodes correspond to modules of genes; edges represent module interactions that significantly change between normal and cancer conditions.Red continuous edges correspond to interactions that are present in normal cells and vanished in tumour cells; blue dashed edges correspond to interactions that are only present in tumour cells.1: Pathway enrichment results, using three pathway databases (PID, KEGG, and Reactome), for the 25 modules identified.For each module, the most significant pathway was selected based on adjusted p-values.Ties were resolved by taking the smaller pathway (for pathways with large discrepancy in size), or by the most significant by unadjusted p-values.
of dependence.This tends to inflate the evidence in favour of dependence, but can improve the ranking when testing over many pairs of variables.
Our probabilistic measure has importance advantages over other statistics due to its interpretability in terms of a recursive partition of the data space, symmetry over those based on Kullback-Leibler divergence.The explicit quantification of a probability allows for combining with other sources of information within a prior or meta-analysis.As shown in Section 4.2 the Bayesian approach provides a unified method for detecting both independence and dependence, something that is not possible without a fully probabilistic framework.The Bayesian probabilistic approach allows for the inclusion on substantive prior information on the plausibility of an association, which can be particularly useful for screening large biological data sets.There is also the possibility to embed the model within a hierarchical structure, borrowing strength coherently across categories, something that is simply not possible for existing methods based on nonprobabilistic methods such as Mutual Information.
Similarly to the M 0 case, the marginal likelihood can be obtained by integrating out the random branching probabilities as follows: p(x, y|C, A, M 1 ) = p(x, y|Θ, C, A, M 1 )p(Θ|M 1 )dΘ where ñj denotes the vector of counts of data: ñj = (n j0 , n j1 , n j2 , n j3 ).
Expressing the Beta and the multinomial Beta function in terms of Gamma functions, we obtain equation ( 6).
In addition, if where in the third line we use that for all t, Γ(t + 1) = tΓ(t).

A2 Other approaches
The square Pearson correlation is probably the most commonly used statistic to identify associations between two samples.This statistic accurately quantifies linear dependences but fails to detect dependencies when relationships are highly nonlinear.Another criteria, called mutual information (MI), arose from the field of information theory (Shannon and Weaver 1949;Cover and Thomas 1991).The original primary purpose of mutual information was to quantify bits of informations transmitted in a system.It has been used in a wide range of disciplines such as in economy (Maasoumi 1993;Maasoumi and Racine 2002), wireless security (Bloch et al. 2008), pattern analysis (Peng et al. 2005), neurobiology (Pereda et al. 2005) and systems biology (Cheong et al. 2011;Liepe et al. 2013;Uda et al. 2013;Mc Mahon et al. 2014).Mutual information is by definition a similarity measure between the joint probability of two variables and the product of the two marginal probability distributions; for the purposes of this paper considering continuous univariate random variables, {x, y}, MI is equivalent to the Kullback-Leibler divergence between the joint distribution and the product of marginal densities MI = x,y p(x, y) log p(x, y) p(x)p(y) dxdy .
It is equal to 0 when the variables are independent and increases with the level of association of the two variables.Therefore, by extension, this measure has been employed to quantify dependencies between two variables.It is well-known that the estimation of the mutual information between two samples is not straightforward.It is often based on approximations of the probability distributions (Paninski 2003;Cellucci et al. 2005;Khan et al. 2007) either using bin-based procedures, kernel densities (Moon et al. 1995), or k-nearest neighbours (Kraskov et al. 2004).
Another classical approach for detecting dependencies between two continuous univariate random variables consists in partitioning the sample space into bins and evaluating non-parametric test statistics on the binned data (Heller et al. 2014).This type of approach includes the well known Hoeffding's test (Hoeffding 1948) as well as the maximum information criterion (MIC) recently introduced by Reshef et al. (2011).In this last paper, the authors propose to estimate the mutual information using a bin-based procedure on any grid drawn on the scatterplot of the two variables up to a maximal grid precision.The MIC is then proportional to the highest mutual information on these grids.This recent work has lead to a series of discussion regarding properties that adequate measures of dependencies should fulfil (Gorfine et al. 2012;Simon and Tibshirani 2014;Kinney and Atwal 2014).More generally, dependences between multivariate random variables have commonly been modelled using copula transform (Hoff 2007;Smith et al. 2012) or by introducing latent random variables (Petrone et al. 2009).For high-dimensional spaces, the problem of testing independence can also be formulated by embedding probability distributions into reproducing kernel Hilbert space (Gretton and Györfi 2010).These papers provide innovative approaches to modelling dependence but do not provide a fully Bayesian nonparametric approach with an analytic Bayes factor for testing dependence vs independence, which is the focus of our paper here.
In the following, we consider 5 frequentist dependence statistics: the Rsquared measure, the distance correlation (dcor) (Székely et al. 2009), Hoeffding's D (Hoeffding 1948), the mutual information estimated using the k-nearest neighbour method (Kraskov et al. 2004) and the maximum information criterion (MIC) (Reshef et al. 2011).We use the MATLAB implementation of these algorithms provided in the supplementary material of Kinney and Atwal (2014).The Rsquared measure, the distance correlation and Hoeffding'D methods detect higher associations between the variables for linear models; whereas MIC and MI detect highest dependencies for the sinusoidal and the circular models respectively (see Figure A1).In addition to its probabilistic properties, a crucial advantage of our Bayesian approach compared to these five other approaches is the interpretability of the measure of dependencies.Indeed, whenever the dependence statistic is higher than 0.5 there is, by definition, more evidence in favour of the dependence hypothesis than the independent one.In contrast, for all the frequentist methods identifying a threshold above which one can claim that two samples are dependent is a challenging task, as opposed to claiming evidence against the null.Typically, these thresholds are either chosen heuristically or calibrated via the construction of "null datasets" -by randomly permuting the indexes of the two samples in order to destroy potential dependence -and defining the threshold as a quantile of the distribution of the dependence statistics on these "null" datasets.
It is difficult to identify a measure that would allow us to fairly compare our Bayesian approach to more traditional frequentist approaches.A traditional frequentist measure of statistical test performance consists in computing the power of the test which measures the true positive rate (percentage of times the method detect dependences) for a given significance level (i.e.false positive rate).Here, for each dependence test, we chose a significance level of 0.05, that is, we fix the detection threshold to be equal to the 0.95 quantile of the "null distribution" estimated via 500 permutations.In Fig- ure A2, we observe that the power of every method strongly varies from one generative model to another.The power of the empirical Bayes version of our procedure (denoted by EPT in Figure A2) is comparable with that of the Mutual Information algorithm estimated using the 20-nearest neighbours.Further power analysis via ROC curves is shown in Figure A3.Using a threshold based on the "null distribution" is very atypical for Bayesian approaches as there exists a natural threshold for the probabilistic measure which is equal to 0.5.Using this threshold the true positive and false positive rates for different generative models are summarized in the following table.and N = 300, σ = 4 (Bottom row).The colour scheme is similar to the one from figure A1 except that the red curve represents our Bayesian nonparametric approach and the dashed red curve our approach using an empirical Bayes approach maximizing the marginal probability over the shifted partition scheme with data wrapping.For clarity, the ROC curve for Rsquared is not shown.

Figure 3 :
Figure3: (Top row) Illustration of the five test data sets and sampling distributions with N = 300 data and noise parameter σ = 2. (Middle rows 2 and 3) Frequency distribution (over 500 independent runs) of the probability of the hypothesis M 1 for each model varying the number of data points (N ) in plot row 2 and the level of noise (σ) in plot row 3. When varying the number of data points, the level of noise is fixed at σ = 2; when varying the level noise, the number of data points is fixed to N = 300.The red curve represents the median while the light and dark grey area designate the zone between the 5th and 95th percentiles and the inter-quartile region respectively.(Bottom row) Distribution (over 500 independent runs) of the contribution (B k ) of the 5 first levels in the Pólya Tree.A negative B k indicates evidence against independence.We set N = 150 and σ = 2.

Figure 4 :
Figure4: (Left) Illustration of the independent models sampling 300 data from two independent standard normal distributions.(Right) Distribution (over 500 independent runs) of the probability of the hypothesis M 0 varying the number of data points (N ).The red curve represents the median whereas the light and dark grey area designate the zone between the 5th and 95th percentiles and the inter-quartile region respectively.

Figure 5
Figure5: Distribution (over 500 independent runs) of the probability of dependence using our Bayesian nonparametric approach (Left) and our empirical Bayes approach which maximises the marginal probability of dependence over the shifted partition scheme with data wrapping (Right).Data are generated under the 5 illustrative examples as well as an independent generative model where both x and y are vectors of i.i.d.samples from a normal distribution with mean 0 and standard deviation 1.Here, N = 150 and σ = 2.

Figure 6 :
Figure6: (Top row) Expression network constructed using correlation (Left) -where links with absolute correlation larger than 0.5 are shown and darker edges represent links with absolute correlation larger than 0.7 -and our nonparametric Bayesian procedure (Right) -where links with probability of dependences larger than 0.99999 are shown and darker edges represent links with probabilities equal to 1 .Genes are ordered clockwise according to increasing number of detected links.(Bottom row -Left) Comparison of the probability of dependence computed following our approach and the absolute value of Spearman correlation for every pairs of genes.The 4 red circles indicate some pairs of genes with known interactions which have an absolute correlation lower than 0.5 but a probability of dependences larger than 0.95.(Bottom row -Middle and Right) Examples of gene expression data for two genes with known interaction.

Figure A1 :Figure A2 :
FigureA1: Distribution (over 500 independent runs) of the dependence measures quantified by the Rsquared correlation (black), the dCor (light grey) and the Hoeffding method (dark grey), the MIC (green) and the Mutual Information estimated with the k-nearest method for different values of k (blue).Data are generated under the five illustrative examples as well as an independent generative model where both x and y are vectors of i.i.d.samples from a normal distribution with mean 0 and standard deviation 1.Here, N = 150 and σ = 2.

Figure A4 :
Figure A4: Impact of the parameter c when σ = 2. (Top) Median (over 500 runs of the probability of the dependent models as a function of the number of data points (N ) for different values of c. (Bottom) ROC curve for different values of c when N = 300.In both figure types, σ = 2 and the color scheme represents different values of c ∈ {0.1, 1, 5, 10} where c = 0.1 corresponds to the red line and c = 10 corresponds to the yellow line.