Building Hyper Dirichlet Processes for Graphical Models

Graphical models are used to describe the conditional independence relations in multivariate data. They have been used for a variety of problems, including log-linear models (Liu and Massam, 2006), network analysis (Holland and Leinhardt, 1981; Strauss and Ikeda, 1990; Wasserman and Pattison, 1996; Pattison and Wasserman, 1999; Robins et al., 1999);, graphical Gaussian models (Roverato and Whittaker, 1998; Giudici and Green, 1999; Marrelec and Benali, 2006), and genetics (Dobra et al., 2004). A distribution that satisfies the conditional independence structure of a graph is Markov. A graphical model is a family of distributions that is restricted to be Markov with respect to a certain graph. In a Bayesian problem, one may specify a prior over the graphical model. Such a prior is called a hyper Markov law if the random marginals also satisfy the independence constraints. Previous work in this area includes (Dempster, 1972; Dawid and Lauritzen, 1993; Giudici and Green, 1999; Letac and Massam, 2007). We explore graphical models based on a non-parametric family of distributions, developed from Dirichlet processes.


Introduction
Markov distributions are multivariate measures that satisfy a specified set of conditional independence relations. Often, an undirected graph is useful to represent this struture. A measure is Markov with respect to a graph if whenever two variables have no edge between them, they are conditionally independent given the other variables in the graph. Dawid and Lauritzen (1993) extended this notion to the parameter space. In Bayesian statistics, the measure of the data is random, and therefore has its own distribution called the prior. A prior law over Markov measures is hyper Markov if it gives probability one to Markov measures and the random marginal measures have the specified conditional independence structure. The hyper inverse Wishart distribution which serves as a prior for the multivariate Gaussian with known mean. The usual inverse Wishart is a specific case, which is hyper Markov for the saturated model.
Like all parametric models, the hyper inverse Wishart prior makes strong assumptions about the shape of the distribution. In many applications, such assumptions are undesirable. In contrast, non-parametric models make weak assumptions. Typical assumptions include continuity and the existence of some number of derivatives. For example, one may specify that the distribution is smooth, having derivatives of all orders. The current paper aims to apply the non-parametric approach to graphical models. We achieve this by following the framework laid by Dawid and Lauritzen. We begin with the Dirichlet Process, a commonly used non-parametric prior law. We then describe how to build this family into a non-parametric hyper Markov prior.
As in Dawid and Lauritzen (1993), we restrict our attention to decomposable graphs. The benefit of this is that a decomposable graph can be easily built up from smaller components called cliques which intersect to form the entire graph. Dawid and Lauritzen begin by considering a base distribution for each clique. The only requirement is that these distributions agree where the cliques intersect. They weave together base distributions by taking the base measure of one clique as its marginal, and conditioning the second clique on the intersection. They repeat this process for every clique. The third clique is added by conditioning its base measure on the intersection with the previous two cliques. This process is repeated until all the cliques have been combined. The end result is a Markov distribution whose marginal over each clique is the clique's base. For a prior on Markov distributions, Dawid and Lauritzen construct a hyper Markov law in the same way.
As an example of the Dawid and Lauritzen (1993) construction, consider the problem of estimating the covariance matrix of a multivariate Gaussian. If we believe that the data exhibit some conditional independence structure, this implies certain constraints on the covariance matrix. (Speed and Kiiveri, 1986) showed that the sufficient statistics are the component covariance matrices belonging to each clique. The inverse Wishart is the usual prior for the saturated model which has no constraints on the covariance matrix. In a non-saturated model, the sub-matrix of each clique is unconstrained, except that the submatrices must agree where their indices intersect. For this reason, the inverse Wishart is the natural choice as the base measure for each clique. The sub-matrix for the first clique has an inverse Wishart prior. If the graph is connected and the cliques have a perfect ordering (see Section 2.1), then the first and second submatrices have some elements in common. Thus, the sub-matrix for the second clique is the inverse Wishart, conditional on knowing some of the elements. By repeating the conditioning for each clique, the hyper inverse Wishart is defined.
In the current paper, we apply this framework to non-parametric priors. Instead of the inverse Wishart, the Dirichlet process prior is the base measure for each clique. Following the analogy, we build the marginals into a hyper Markov prior, which we refer to as the hyper Dirichlet process. The Dirichlet process is a special case of tail-free processes (Ferguson, 1973). Dirichlet processes have been used for non-parametric priors in many areas, including block models (Bush and MacEachern, 1996), survival analysis (Susarla and Ryzin, 1976;Ghosh and Ramamoorthi, 1995;Kim and Lee, 2001), and non-stationary point processes (Pievatolo and Rotondi, 2000). These are all areas that could potentially use a hyper Dirichlet process in multidimensional problems. In section 2, we explain notation and formalize some of the ideas presented so far. In section 3, we describe the Dirichlet and some previous results. In section 4 we weave Dirichlet processes on the cliques to build the hyper Dirichlet process, and show that it is a hyper Markov prior. Finally, we explore applications for this framework in section 5.

Notation and Setting
Throughout this paper we consider a graph, G, with vertex set V and edge set E. By convention, we assume that (γ, γ) ∈ E for all γ. We call such edges loops. There is no practical difference if loops are excluded from E, though some minor changes are required for certain definitions. If A ⊆ V, then G A is the subgraph of G over A. The subgraph G A has vertex set A, and edge set E A = (A×A)∩E. We say that A induces the subgraph G A . If E A = A × A, then G A is complete. A clique is a set A such that G A is complete and for any superset B ⊃ A, G B is not complete. For example, if G itself is complete, then there is one clique, viz.

V.
A k-path is a sequence (γ 0 , γ 1 , . . . , γ k ), such that (γ i , γ i+1 ) ∈ E, for 0 ≤ i < k. If A and B are subsets of V, then a path between them is a path between any a ∈ A and any b ∈ B. A graph is connected if there exists a path between every pair of subsets. A third subset C ∈ V is said to separate A and B if every path between them contains an element of C. A k-cycle is a path such that k ≥ 3, γ 0 = γ k and the other elements are distinct. A graph is decomposable if there are no cycles longer than length 3. A decomposable graph admits a perfect ordering of its cliques.
Definition 1 Perfect Ordering. Suppose a graph G has n cliques. Let the cliques have an arbitrary ordering C 1 , . . . , The ordering of the cliques is a perfect ordering if for each 2 ≤ k ≤ n, there exists j k < k such that S k ⊂ C j k .
The sets H k are called the histories. The separators, S k , separate C k from the previous history. The sets R k are called the residuals, which represent the new nodes being added to the history. In a perfect ordering, each new clique is separated from the current set of nodes by a single one of the earlier cliques.
For every, γ ∈ V, X γ is a random variable taking values in the space (X γ , F γ ). In this sense, we consider V an index set of components of some random variable X = (X γ : γ ∈ V). We denote the range and σ-field of X by (X , F ) = (× γ∈V X γ , × γ∈V F γ ). Furthermore, we extend these definitions to subsets, A ⊆ V.
Let α be a measure over some X A , then α = α/α(X A ). In other words, α is the probability measure proportional to α.
For convenience, if γ ∈ A, we may write θ γ for θ {γ} . If α and β are both measures on some space (X , F ), then we define their sum, α + β, by If x ∈ X , then the delta measure δ x is a point mass concentrated at x:

Graph Selection
For the remainder of the paper, we consider undirected graphs, which implies that (i, j) ∈ E if and only if (j, i) ∈ E. We also assume that the graph is connected and decomposable. An undirected graph depicts the conditional independence structure for some variable X in the following sense: Definition 2 Markov Probability Measure. If θ is a probability measure on (X , F ), we say it is Markov on a decomposable graph, G, if X ∼ θ satisfies the conditional independences in G. Example 1 Let G be the graph depicted in Figure 2 Implicit in the definition is the fact that it is only sensible to refer to a measure as Markov in relation to a specific graph. For example, if the measure θ is not Markov on G in Example 1, it is still Markov on the saturated graph with V = {I, J, K}. All measures over X V are trivially Markov on the saturated graph since there are no constraints on conditional independence. Furthermore, if µ is a measure such that each X γ is independent, then it is Markov on any graph (with the appropriate vertex set.) We denote the set of all distributions that are Markov on G by M (G).
It will be useful to keep Figure 1 in mind throughout this paper. While the graph technically has only three variables, it is representative of any connected graph of two cliques. Instead of one variable, imagine I, J, and K to contain multiple variables, with J being the variables that belong to both cliques. I is the set of variables in one clique but not the other, and K vice versa.
Let X ∼ P ∈ F be a random variable whose distribution is modeled by some family of probability distributions. In some applications, the focus is not on determining P , but on discovering the independence structure of X. A graph of this structure, G, denotes the belief that P is Markov with respect to G. Thus, it restricts the model to a sub-family, F G = F ∩ M (G). Graph selection is the problem of determining the smallest F G that contains P . The most prevalent examples are graphical Gaussian models. Graph selection for Gaussian models is often called covariance selection. In this setting, the relevant family is the set of p-variate Gaussian distributions. Denote this family N = {N p (µ, Σ) : µ ∈ R p , Σ ∈ M + p }, where M + p is the cone of real-valued, symmetric p × p matrices that are positive definite. Specifying a graph, G, translates to putting constraints on Σ. For example, if (x 1 , x 2 , x 3 ) is such that x 1 ⊥ ⊥ x 3 |x 2 , then σ 13 is no longer a free parameter, but a function of σ 11 , σ 22 , σ 33 , σ 12 , and σ 23 . In general, denote the sub-family of Gaussian distributions Markov on G by N G . Let P G be the set of positive definite matrices such that K ij = 0 for all (i, j) ∈ E; let Q G be the image of P G under matrix inversion. Speed and Kiiveri (1986) showed that if N p (µ, Σ G ) is Markov with respect to G, then Σ G ∈ Q G . Thus N G = {N p (µ, Σ) : µ ∈ R p , Σ ∈ Q G }. The goal of covariance selection is to find the smallest Q G containing Σ, the population covariance matrix.
Much progress has been made with graph selection for parametric models. Dawid and Lauritzen (1993) proved many results for decomposable graphical models, including multinomial and multivariate Gaussian problems. For example, they present the distribution of the restricted maximum likelihood estimate of Σ for the N G model with µ known. This distribution is called the hyper Wishart distribution since it reduces to the Wishart when G is complete. Letac and Massam (2007) have extended the hyper Wishart to a richer family of distributions on Q G and P G . Giudici and Green (1999) implemented a reversible jump Markov chain Monte Carlo algorithm for determining G.
The family of hyper inverse Wishart distributions is the subset of Markov distributions such that each clique marginal is inverse Wishart. Carvalho et al. (2007) provide an algorithm for generating random variables from this family. For decomposable models, the presence of a perfectordering simplifies the process. For two cliques, the algorithm begins by generating an inverse Wishart variable on one clique. If the cliques overlap, this determines some of the parameters for the other clique. Therefore, one needs to generate a conditional Wishart variable given those entries. For multiple cliques, one simply repeats this process. With a perfect ordering, the process is simplified because each new clique is conditioned on only one previous clique. Conditioning on multiple cliques can lead to moderate complications in the conditional distribution. Hence, decomposable models are computationally convenient.

The Dirichlet Process
The Dirichlet process (Ferguson, 1973) is a special case of tail-free distributions. It is a prior, meaning that it provides a distribution over the space of probability distributions on (X , F ). In this paper, we use the term law to refer to distributions over probability measures. However, this terminology is merely a convenience; the words "law" and "distribution" are typically interchangeable. The Dirichlet process is an example of a non-parametric law, which means that it cannot be specified by a finite-dimensional parameter. In this section, the Dirichlet process is introduced and some of its useful properties are given. This leads into the next section, in which we show how to compose a hyper Dirichlet processes from multiple Dirichlet processes.
Definition 3 Dirichlet Process. Let A be any subset of V. Let α be a measure over (X A , F A ), and let θ be a random probability measure over the same space. We say that the distribution of θ is a Dirichlet process with base measure α, and write θ ∼ DP α , if This definition leads to some useful properties, described in the following theorem.
The first property states that if the random measure is integrated out, the marginal distribution of the data is α. This property ensures that a Markov base measure implies that the Dirichlet process, integrated over all possible θ, is a Markov distribution. This does not guarantee that the process is a hyper Markov law. That requires the stronger condition that θ ∼ DP α is a Markov distribution with probability one. The second property states that if a Dirichlet process is used as a prior measure, then the posterior measure is also a Dirichlet process, with an easily updated base measure. This fact helps determine which properties of a prior will persist in the posterior.
If the prior law of θ is a Dirichlet process, then the various marginal distributions of θ will also have a Dirichlet process law. This is expressed in the following theorem.
Theorem 5 Marginal of a Dirichlet Process. Let θ ∼ DP α be a random probability measure on ( We proceed by showing how the Dirichlet process can be used as a nonparametric prior (see Ferguson (1973) for details.) Let F be an unknown cumulative probability distribution that we wish to estimate. For simplicity, we consider a one-dimensional random variable. Let π = DP α be the prior law. Let the loss function be a squared error loss. Then the Bayes' risk is The risk is minimized by setting F (t) to EF (t), where the expectation is relative to the posterior distribution. If we observe data X 1 , X 2 , . . . X n , then the posterior is DP α ′ , where α ′ = α + i δ Xi (see Theorem 7.) The posterior distribution of P ((−∞, t]) is Beta(α ′ ((−∞, t]), α ′ ((t, ∞)). Therefore, The Bayes estimate can be written as a weighted sum of two estimates where α((−∞, t]) is the prior estimate, F (t) is the empirical cdf, and w = n(α(X ) + n) −1 is the weight of the data. This convex combination of a prior estimate and frequentist estimate is common in Bayesian analysis. This shows the role of the base measure in the Dirichlet process. α is the prior guess about the shape of the unknown distribution. α(X ) is mathematically equivalent to the prior sample size.

Dirichlet Process as a Stick-Breaking Prior
A stick-breaking process is an almost surely discrete random probability measure, θ, that can be expressed as where the Z k are independently distributed atoms from some distribution G, and N k=1 w k = 1 almost surely. The number of atoms, N , may be finite or infinite. The weights are determined by successively breaking random pieces of a unit-length stick. Thus, Traditionally, stick-breaking measures are defined such that p k is defined as a Beta(a k , b k ) random variable for 1 ≤ k < N . Thus, a stick-breaking measure is specified by a probability distribution P , and a countable sequence of Beta parameters (a k , b k ) N −1 k=1 . Sethuraman (1994) showed that a Dirichlet Process is a stick-breaking measure with Z k ∼ α, and (a k , b k ) = (0, α(X )) for all k ∈ N. This relationship leads to an alternative definition of the Dirichlet process.
Definition 6 Dirichlet Process (alternate definition). Let A be any subset of V. Let G be a probability measure on (X A , F A ), and let θ be a random probability measure over the same space. For ν > 0, we say that the distribution of θ is a Dirichlet process with base distribution (or measure) G and precision ν, and write θ ∼ DP (νG), if whenever (A i ) k i=1 is a partition of A. Note that this distribution is equivalent to Definition 3 by letting α = νG. For example, ν is equivalent to the prior sample size, and G is equivalent to the prior mean. In this definition, ν and G are easily translated as the parameters of a stick-breaking measure. That is, the random atoms are iid G, and p k ∼ Beta(0, ν) for all k ∈ N. Because the stick-breaking representation is useful for many of the theorems we prove, Definition 6 will be the definition of choice for much of the current paper.
The previous theorems regarding Dirichlet processes can be expressed using νG notation. For example, we rewrite Theorem 4 regarding the posterior Dirichlet process.
In the following section we introduce the hyper Dirichlet process and show that it is an example of a stick-breaking measure. We then use Equation 8 to prove some of its properties. While we focus on the hyper Dirichlet Process for simplicity and concreteness, many of the results apply to other stick-breaking processes as well.

The Hyper Dirichlet Process
Consider a multivariate variable X with distribution θ. Suppose that we know little about θ, other than it is Markov on some decomposable graph, G. In this case we may wish to specify a non-parametric prior for θ. For example, we focus on the Dirichlet process. There are two main difficulties with this approach. The first is the elicitation of a proper base measure. The second is ensuring that the Dirichlet process gives probability one to M (G). Both concerns are addressed by using a framework that we dub the hyper Dirichlet process.
To define a hyper Dirichlet process, we begin by eliciting a base measure for each clique in G. Hopefully, this is simpler than eliciting a base measure for the entire graph at once. These base measures are combined to form a base measure over the entire graph. We define these combinations in a way which will ensure that the support of the process lies within the set of Markov distributions on G. In the remainder of this section, we provide details to this method and show that it satisfies the Markov property. Dawid and Lauritzen (1993) show that if two subsets of V are each endowed with a marginal probability measure, then there is a logical choice for their joint distribution, provided the marginals satisfy a consistency condition.

Markov Combinations of Probability Measures
Definition 8 Consistency (of probability measures). Suppose A, B ⊆ V. Let µ and λ be probability measures on (X A , F A ) and (X B , F B ), respectively. We say that µ and λ are consistent if they induce the same marginal over X A∩B .
Note that µ and λ are consistent only if Theorem 9 Suppose µ on (X A , F A ) and λ on (X B , F B ) are consistent probability measures, with A, B ⊆ V. There exists an almost-everywhere unique distribution, α, such that: Proof: Construct α such that its marginal over X A is µ, so that condition (i) is satisfied. Specify its conditional distributions over X B given X A to be the same as the conditional distributions of λ given X A∩B . This ensures that (iii) holds as well. Let C = A ∩ B and B ′ = B \ A. Then for any U ∈ B ′ and V ∈ C, The second equation follows from the construction of α. The third equation is ensured since µ and λ are consistent. Hence, condition (ii) is also satisfied. Furthermore, the conditional distributions are unique, except over some subset of X C with zero measure under λ, and hence also under µ by consistency. Therefore, this construction gives (a version of) the unique distribution satisfying the conditions.
Definition 10 Markov Combination (of probability measures). Let µ and λ be as in Theorem 9. We call the unique distribution satisfying (i)-(iii) the Markov Combination of µ and λ, and denote it by µ ⋆ λ.
Now suppose G has a perfect ordering of cliques (C 1 , C 2 , . . . , C k ), and that each clique C i is imbued with a marginal probability distribution P i . Further suppose that P i and P j are consistent for all i, j. Each clique is consistent with the previous history regarding the separator, since the separator is contained by a single previous clique. Using the idea of a Markov combination iteratively, we stitch together a distribution that is Markov on G and has the given marginals. Define G 1 = P 1 , and G i = G i−1 ⋆ P i for i ≥ 2. Dawid and Lauritzen show that G = G k is the unique Markov distribution satisfying G C k = P k . We call G the Markov combination of P 1 , . . . , P k . In general, we may write ⋆(Q 1 , . . . , Q k ) to indicate a Markov combination with the understanding that the cliques are perfectly ordered and Q 1 , . . . , Q k are pairwise consistent.

Markov Combinations of Finite Measures
Using Markov combinations, we are able to take probability distributions and build a distribution over the entire graph. The base measure of a Dirichlet process, however, is not necessarily a probability distribution. Therefore, we proceed by extending Markov combinations to finite measures in general. For probability measures, we required the conditionals over X A∩B to be the same. We simply extend this definition to any finite measure.
Definition 11 Consistency of Finite Measures. Let µ be a finite measure over (X A , F A ) and λ be a finite measure over (X B , F A ). We say that µ and λ are consistent if they induce the same marginal measure over A ∩ B. That is, µ and λ are consistent if Recall that µ is the probability measure proportional to µ. Equation 10 holds if the following two conditions are satisfied: 1. µ and λ are consistent. 2. µ(X A ) = λ(X B ).
Consider these two conditions in the context of base measures for Dirichlet processes. µ is the prior guess about the probability distribution of X A , and λ is the prior guess for X B . The first condition therefore states that the priors must agree about the distribution of X A∩B . It is reasonable to require that our prior is coherent in this way. The second condition states that the prior sample sizes for both sets of variables must be equal. This restraint is perhaps less desirable. It would be perfectly logical to be more certain about certain dimensions than others. Unfortunately, any measure on X A∪B must satisfy there is no measure α on X A∪B satisfying α A = µ and α B = λ. In some situations, this problem is not too severe. Using the alternative definition, we express µ = ν 1 G 1 and λ = ν 2 G 2 . The consistency conditions translate to G 1 = G 2 and ν 1 = ν 2 . If only the second condition fails, then it is still possible to find G = G 1 ⋆ G 2 . Employing the stick-breaking condition, we can generate random atoms from G. The problem lies in assigning weights to each atom. However, we can take solace in a mitigating factor. For density estimation, the value of the prior precision (ν) is typically small compared to the sample size. Hence, we may be reasonable to simply scale the precisions so that ν 1 and ν 2 are equal. For these applications, it is only important that G 1 and G 2 are consistent. If so, the base measures µ and λ only need to be proportional to each other over A ∩ B.
There may be other situations in which scale is important. Unfortunately, as Equation 11 shows, we cannot find a suitable base measure for the prior that satisfies both µ and λ. Without a suitable prior, there can be no suitable posterior. If the goal is to estimate a distribution and there is genuine concern about the precision of the prior estimate, then both conditions must be satisfied. That is, µ and λ must be consistent. Equivalently, G 1 and G 2 must be consistent and ν 1 must be equal to ν 1 . Cases in which one or both conditions fail are explored more fully in Appendix A.
Subsequently, we assume that both consistency conditions are satisfied. This leads to a natural extension of the previous work. We have equated consistency of base measures with consistency of probability measures. Thus, we generalize Markov combinations to include consistent finite measureszed by scaling them to probability measures, finding the Markov combination, and rescaling the measures.
Definition 12 Markov Combination of Finite Measures. Let µ be a finite measure on (X A , F A ). Let λ be a finite measure on (X B , F B ) that is consistent with µ. The Markov combination of µ and λ is denoted µ ⋆ λ, where where [µ ⋆ λ] is the almost-everywhere unique probability distribution satisfying Theorem 9.
This definition is a generalization of Definition 10 for probability measures. Note that the Markov combination defined in this way is unique almost everywhere, since [µ ⋆ λ] is unique almost everywhere. It is easy to show that the · and ⋆ operations commute (with respect to composition).

Constructing the Hyper Dirichlet Process
We now apply the idea of Markov combinations to component Dirichlet processes. To do so, we simply form the Markov combination of the base measures.
, and θ ∼ DP (νH) be random probability measures. The following are true: Proof: The proposition follows from Theorem 5.
Note that the non-parametric approach is actually simpler than the parametric approach in one sense. The hyper inverse Wishart is a generalization of the inverse Wishart to incomplete graphs. However, the distribution of θ in Theorem 14 actually is a Dirichlet process with a Markov base measure. Therefore, previous results regarding the Dirichlet process also apply to the hyper Dirichlet process. Most importantly, we know that the prior law is a stick-breaking prior.
The hyper inverse Wishart is so-called because it is an example of a hyper Markov prior (Dawid and Lauritzen, 1993).
Definition 15 Hyper Markov. Consider an undirected graph, G. Let θ ∼ L be a random probability measure over X . We say that L is (weak) hyper Markov on G if L is concentrated on M (G), and θ A ⊥ ⊥ θ B |θ C whenever C separates A and B.
It is tempting to refer to the distribution of θ in Theorem 14 as a hyper Dirichlet process; however, the given conditions are not sufficient to ensure that the process is hyper Markov. The next task is to discover the appropriate conditions under which the hyper Markov property holds. Let L = DP (νH) be a Dirichlet process law. Let G be any graph consisting of two cliques, A and B, with separator C. Using the stick-breaking construction, let w = (w i : i ∈ N) be the random weights and Z = (Z i : i ∈ N) be the atoms, which are iid observations from H. We use Z iΓ to denote the components of Z i belonging to a set Γ. For example, the marginal of θ over A is θ A = i∈N w i δ ZiA .
Obviously, one condition for hyper Markovity is that H is a Markov measure.
is a necessary condition, but it is not sufficient. This is because knowledge of θ B contains information about the distribution of weights at each atom. We must ensure that θ C contains the information as well. To see this, consider an example for which H C is a point mass. For H ∈ M (G), this implies that H A ⊥ ⊥ H B . Further suppose that H A and H B are not point masses. In this case, θ B implies certain constraints on w. For example, if each Z iB is distinct, then the mass at each atom determines the random weights modulo permutation. Therefore, the second condition for L to be hyper Markov is that θ B contains no information about w that is not contained by θ C . To begin, we use the condition expressed in the next theorem. The condition is sufficient, but more restrictive than necessary.
Theorem 16 Let H be a base measure on X A∪B . Let C = A ∩ B. Set L = DP (νH) for some ν > 0. Then L is hyper Markov on X A∪B if the following conditions hold: 1. H is a Markov measure 2. Refinement Condition: In other words, the refinement condition can be expressed equivalently as an "if and only if" statement: Consider θ ∼ L. The hyper Markov property has two conditions: The first condition follows from the refinement condition. Let Using the stick-breaking representation, we write the distribution of X A |X B .
Therefore, θ ∈ M (G). It remains to show that θ A ⊥ ⊥ θ B |θ C . We begin by writing the marginals of θ using the stick-breaking representation. Let Γ be any subset of V.
Let Z * Γ = {Z iΓ } be the set of unique occurrences among the random atoms. We refer to an element of this set using an arbitrary index, Z * iΓ . Let m * iΓ be the total mass at that atom; Note that Z * Γ is the support of the θ Γ , and m * Γ is the mass at each point in the support. Thus, there is a bijection between the measure θ Γ and the set { Z * Γ , m * Γ }. That is to say, both are completely identified if at least one is known. The immediate result is that conditioning on one (or both) is equivalent to conditioning on the other (or one of them).
Continue by partitioning the support into two sets.
In other words Z + Γ is the set of support points with strictly positive mass and Z 0 Γ is the set of points that are in the support but have probability zero. Again, we specify a particular element in either set with an arbitrary index, e.g. Z + iΓ . Partition m * Γ in the same way. This yields, We stipulate that the index is consistent with Z + Γ so that m + iΓ = θ Γ (Z + iΓ ). Denote the other set in this partition by m 0 . Separate the sum in Equation 19 using this partition.
where N · Γ = | Z · Γ |. Note that Z + Γ , has a degenerate distribution. If H Γ (γ) > 0, then with probability 1, γ will occur infinitely often in Z Γ . Therefore, Z + Γ = {z Γ : H Γ (z Γ ) > 0} almost surely. Since H is known, the sets of summation in Equation 22 are fully identified by { Z * Γ , m * Γ }. It follows that conditioning on θ Γ is equivalent to conditioning on the quartet We will now show that under the refinement condition, Z + B , m + B , and m 0 B are fully identified from A similar equation shows m 0 iC = m 0 iB . Therefore, ( m + C , m 0 C ) = ( m + B , m 0 B ). We now show that Z + B is a function of Z + C . This fact ensures that Z + B is fully identified by Z + C and therefore is conditionally independent of anything given Z + C . One consequence of the refinement condition is that if H C (c) > 0, then there exists B(c) such that H B|C (B(c)|c) = 1. This follows from a simple proof by contradiction. If H B|C (·|c) is not a point distribution, than either every point has probability 0 (as in a continuous distribution), or there is some point with positive probability strictly less than 1. We will see that neither of these can be true and conclude the conditional is indeed a point distribution.
Suppose H B|C (·|c) has measure zero everywhere. With probability H C (c) 2 > 0, the event Z 1C = Z 2C will occur. However, Z 1B = Z 2B almost surely. Therefore, the refinement condition fails with probability at least H C (c) 2 > 0. Now suppose there exists b such that 0 < H B|C (b|c) < 1. Then with probability H C (c) 2 H B|C (b|c)(1 − H B|C (b|c)) > 0, the events Z 1C = c = Z 2C and Z 1B = b = Z 2B will occur. Thus, the refinement condition fails with positive probability. By these two contradictions, we see that H B|C (·|c) must be a point distribution if H C (c) > 0. We denote the point of concentration by B(c). Clearly, c ∈ Z + C implies that B(c) ∈ Z + B . Furthermore, every element of Z + B = B(c) for some c ∈ Z + C . This follows from the fact that C ⊆ B, so We have shown that conditioning on θ · is equivalent to conditioning on . This provides an equivalent condition for the independence property that we want to show. That is, The remainder of this proof will show that the second property holds under the conditions of the theorem.
Begin by partitioning the atoms and weights as follows.
As usual, for Γ ⊆ V, letẐ Γ andZ Γ denote that the elements are the components in Γ. This partition is similar to, but different than, the partition defined earlier. (Ẑ Γ ,Z Γ ) depends on H C , whereas (Z + Γ , Z 0 Γ ) depends on H Γ . The goal, as above is to rewrite Z A by partitioning it in a way that preserves the conditional independence structure. This structure is preserved if the partioning function is non-random. In other words, the atoms must be partitioned based on on a known event. When conditioning on θ B , θ B and θ C are known, but θ A is unknown. Therefore, (Ẑ A ,Z A ) provides an observable partition of Z A .
Note thatZ C is equivalent to Z 0 C by definition, andZ B = Z 0 B by the refinement condition. We proceed by showing thatw,Z A ,ŵ, andẐ A are jointly independent of Z 0 B given { Z + C , m + C , Z 0 C , m 0 C }. We can express m + C as a function ofŵ,Ẑ, and Z + C , where m + iC a.s.
Furthermore, we have noted that m 0 C =w. By the stick-breaking construction,Z ⊥ ⊥ (w,ŵ,Ẑ). Since Z + C is known almost surely, it can be included in the independence property.Z Z is also independent of any function of the RHS of Equation 26. In particular, Repeating this argument on the LHS of Equation 26, we conclude Since all three of these are jointly independent of (Ẑ A ,ŵ) and ( m + C , m 0 C , Z + C )), it follows that Recall from Equation 24 that θ A is a function of (Ẑ A ,ŵ,Z A ,w). A discrete distribution can be defined by a set of points and the probability mass at each point. Therefore, θ A is a function of (Ẑ A ,ŵ,Z A ,w). It follows that, Hence, by the above argument, it follows that θ A ⊥ ⊥ θ B |θ C . We conclude that L is hyper Markov.
Theorem 16 provides sufficient conditions for a Dirichlet process to be hyper Markov. Thus, under those conditions, we may safely call the Dirichlet process a hyper Dirichlet process. When H satisfies the refinement condition, we will say that B is a refinement of C under sampling almost surely under measure H. It is a refinement in the following sense. Let X 1 , X 2 , . . . be an infinite iid sample from H. Form a partition of the natural numbers such that i and j are elements of the same set if and only if X iC = X jC . Call this partition X(C). Define X(B) by analogy. Under the refinement condition, X(B) is almost surely a refinement of X(C). We denote this relationship by X(B) X(C) a.s. [H], omitting H if the measure is contextually evident.
The refinement condition, as stated in Theorem 16 is sufficient, but it is stronger than necessary. By symmetry of conditional independence, θ B ⊥ ⊥ θ A |θ C , even though no refinement condition is needed between C and A. It may be necessary that at least one of the two refinements is present, but this has not been explored.
The hyper Dirichlet process defined on two cliques is an example of a hyper Markov combination, which is the analog of Markov combinations for prior laws. Consider two laws: Q for θ A and R for θ B . We say that Q and R are hyperconsistent if the marginal laws for θ A∪B are equal. Under this condition, Dawid and Lauritzen (1993) show that there is a unique hyper Markov law L such that L A = Q, L B = R. This is called the hyper Markov combination and is denoted L = Q ⊙ R.
As with Markov combinations, hyper Markov combinations are easily generalized to multiple cliques. Let G be a graph with perfectly ordered cliques (C 1 , . . . , C k ). Suppose C i is imbued with a prior law G i and that the priors are all pairwise hyperconsistent. Let L 1 = G 1 and L i = L i−1 ⊙ G i for i ≥ 2. Then L = L k is the unique hyper Markov prior satisfying L Ci = G i . We call L the hyper Markov combination of G 1 , . . . , G k . In general we may write ⊙(G 1 , . . . , G k ) with the understanding that the cliques are perfectly ordered and G 1 , . . . , G k are pairwise consistent.
The next definition generalizes the hyper Dirichlet Process to three or more cliques.
Definition 17 Hyper Dirichlet Process. Let G be a graph with a perfect ordering of cliques C 1 , . . . , C k . Suppose that the i th clique has marginal distribution G i and that the marginals are pairwise consistent. Let G = ⋆(G 1 , . . . , G k ). Further suppose that C j or H j is a refinement of S j under sampling almost surely under H, where H i is the i th history and S i is the i th separator. Then is a hyper Dirichlet process prior.
This hyper Dirichlet process defined in this way is guaranteed to be hyper Markov. Suppose L = HDP(ν, G 1 , . . . , G k ). By Theorem 7, L Ci = DP(νG i ) for i ≥ 2. Furthermore, it follows from the refinement conditions and Theorem 16 that L Hi−1∪Ci = DP(νG Hi−1 ) ⊙ DP(νG i ). Hence, Theorem 3.9 in Dawid and Lauritzen (1993) states that L is the almost-everywhere unique hyper Markov law such that L Ci = DP(νG i ).
The next theorem states that if the prior distribution of θ is a hyper Dirichlet process, then so is the posterior.
= δ a (·). (35) On the other hand, suppose H S (Z) = 0. Since H ′ S (Z) > 0, there is some i such that X iS = Z. Furthermore, with probability one it holds that X jS = Z for all j = i. Therefore, The last equation holds because A ⊂ S, so δ XjA (·) = 1 is a stronger condition than δ XjS (Z) = 1. From these two cases, we see that if H ′ S (Z) > 0 then H ′ A|S (·|Z) = δ a (·) for some a which depends on Z. Ergo, for X, Y ∼ H ′ , X S = Y S implies Z A = Y A almost surely and DP (νH) is a hyper Markov measure.

Applications
At first glance, the refinement condition seems unduly restrictive. However, it allows one to use hyper Dirichlet processes in most areas that have benefited from Dirichlet processes. For the many applications that use a continuous base measure, the random atoms are distinct with probability one, so the refinement condition is trivial. Furthermore, Theorem 18 states that the posterior will also be hyper Markov. Thus, the hyper Dirichlet process can replace the Dirichlet process in applications requiring a posterior estimate, as in MCMC.
The hyper Dirichlet process provides a non-parametric alternative to the hyper Markov laws currently used for problems such as covariance selection for fitting graphical Gaussian models. Most previous work has focused on hyper inverse Wishart priors over Q G (Roverato and Whittaker, 1998;Giudici and Green, 1999;Letac and Massam, 2007). The hyper inverse Wishart is conjugate to the hyper Wishart, which is the distribution of the restricted maximum likelihood estimate for the Gaussian problem with known mean. However, covariance selection could also be achieved using hyper Dirichlet processes. This provides a non-parametric prior for the covariance matrix. This could be advantageous if we wanted to remove the Gaussian assumption. A logical choice is to specify a hyper inverse Wishart for the base measure, setting the parameters by empirical Bayes. The precision parameter controls the concentration of the prior around this measure. Thus, it can loosely be considered a measure of confidence in the Gaussian assumption.
The real power of Dirichlet processes is in modeling mixture distributions. Suppose X 1 , . . . , X n are observations from some family parameterized by π. If we allow π i to be different for each observation, then the result is a mixture of distributions. The number of parameters increases with n, which necessitates placing some prior G on the distribution of π in order to fit the mixture. If the prior is unknown, it can be modeled with a Dirichlet process. For example, Escobar and West (1995) develop a Gibbs sampler to estimate the distribution of parameters of a Gaussian mixture model. In general, a Dirichlet mixture is a hierarchical model expressed as θ ∼ DP α π 1 , . . . , π n |θ ∼ θ X i |π 1 , . . . , π n ∼ f (X|π i ) From Theorem 7, the posterior for θ given π 1 , . . . , π n−1 is a Dirichlet process with base measure α(n − 1) = a/(a + n − 1)α + n−1 i=1 1/(a + n − 1)δ πi , where a = α(X ). Furthermore, this is also the distribution of π n |π 1 , . . . , π n−1 . Thus, with positive probability, π n will be a previous value of π i . Otherwise, it is drawn from α. As a result there will be k ≤ n unique values of π i . This induces a latent class model for X in which each class is defined by a shared value of π i . The observations are conditionally independent given this latent class. A key feature of the Dirichlet process is that the number of latent classes is estimated. It is clear from the form of α(n) that this estimate is influenced by a = α(X ). When a is large, new values of pi i will often be drawn from α. Contrarily, when a is small, π i will more often be drawn from the previous values. This is a natural setting for hyper Dirichlet processes. If X is a multivariate random variable, then θ can be a hyper Dirichlet process for some graph. Once again the observations will be conditionally independent given their latent class. Furthermore, the components of X will have the independence structure specified by the graph. For example, Escobar and West (1995) develop an MCMC algorithm for estimating Dirichlet mixture of Gaussian distributions. This can be extended to a mixture of the family N G of Gaussians that are Markov on G by restricting the base measure to Q G . For each update, the posterior of θ will be a hyper Markov prior.
As a final note, we point out that some of the results apply to other stickbreaking measures. Notably, Theorem 16 did not rely on the distribution of the random weights. Therefore, the same conditions imply that any stick-breaking measure is hyper Markov. That is, if the H is Markov and the refinement condition holds, then a stick-breaking prior whose atoms have distribution H is a hyper Markov law. Whether or not the posterior is also hyper Markov depends on how the measure is updated. For the Dirichlet process, the posterior update mechanism ensures a hyper Markov posterior as long as the prior is hyper Markov.

Acknowledgements
The author would like to thank Profs. S. Fienberg, A. Rinaldo, C. Schafer, and C. Shalizi for their guidance in developing this paper.
If there is no reason to choose one prior over the other, γ = 1/2 is appropriate. An interesting choice of γ is µ(X A )/(µ(X A ) + λ(X B )). This gives more weight toward the prior with more information. 3. Minimize the summed KL-divergence. Let µ C and λ C be the marginals over X C .
More work is needed to test these candidate solutions to form good recommendations about their use.