Bayesian inference in decomposable graphical models using sequential Monte Carlo methods

In this study we present a sequential sampling methodology for Bayesian inference in decomposable graphical models. We recast the problem of graph estimation, which in general lacks natural sequential interpretation, into a sequential setting. Specifically, we propose a recursive Feynman-Kac model which generates a flow of junction tree distributions over a space of increasing dimensions and develop an efficient sequential Monte Carlo sampler. As a key ingredient of the proposal kernel in our sampler we use the Christmas tree algorithm developed in the companion paper Olsson et al. [2018]. We focus on particle MCMC methods, in particular particle Gibbs (PG) as it allows for generating MCMC chains with global moves on an underlying space of decomposable graphs. To further improve the algorithm mixing properties of this PG, we incorporate a systematic refreshment step implemented through direct sampling from a backward kernel. The theoretical properties of the algorithm are investigated, showing in particular that the refreshment step improves the algorithm performance in terms of asymptotic variance of the estimated distribution. Performance accuracy of the graph estimators are illustrated through a collection of numerical examples demonstrating the feasibility of the suggested approach in both discrete and continuous graphical models.


Introduction
Graphical models provide a convenient framework for representing conditional independencies of a multivariate distribution which allows for efficient computational methods of for example conditional and marginal distributions. In this paper we focus on inferring the graph distribution purely from data by taking a Bayesian perspective. We consider the class of decomposable graphical models which have benefited from a great deal of interest the last decades due to their distributional properties. Specifically the factorisation of the distribution over cliques and separators make these models especially attractive for Bayesian inference, see e.g. Dawid and Lauritzen [1993], Rajaratnam et al. [2008].
The common strategy to approximate the graph posterior comprises the class of Markov Chain Monte Carlo (MCMC) methods such as the Metropolis-Hastings sampling scheme. These methods, by local moves on the edge set, generate Markov chains by either operating directly on the space of decomposable graph or their corresponding junction trees, see for example Green and Thomas [2013], Giudici and Green [1999], Dellaportas and Forster [1999]. However, these samplers as well as other MCMC strategies based on local moves often suffer from mobility issues since in each step, only a small part of the edge set is altered while the rest remains invariant. To tackle this issue, we recast the problem of graph estimation, which in general lacks natural sequential interpretation, into a sequential setting by an auxiliary construction which we will refer to as temporalisation, relying partly on the methodology of SMC samplers Del Moral et al. [2006]. Specifically, we suggest a recursive Feynman-Kac model which generates a flow of junction tree distributions over a space of increasing dimensions and develop an efficient sequential Monte Carlo (SMC) sampler. The corner stone in the proposal kernel of the SMC algorithm is the Christmas tree algorithm (CTA) presented in the companion paper Olsson et al. [2018]. The CTA is based on a Markovian transition kernel defined on the space of junction tress and dominates the Feynman-Kac transition kernel constructed by the temporalisation. The the porperty which motivates the extension to the junction trees space is the tractability of the proposal probability that is guaranteed by the CTA; this property seems to be hard to obtain by operating directly on the space of decomposable graphs, see e.g. Markenzon et al. [2008]. Further relevant approaches include e.g. Stingo and Marchetti [2015] where the focus is set on Gaussian graphical models which enable faster edge moves by dynamically updating the perfect sequence of the cliques in the graph. A completely different strategy for decomposable graph sampling based on what is called tree-dependent bipartiet graphs is presented in Elmasri [2017a,b].
Our SMC algorithm is then incorporated as an inner loop in a particle Gibbs (PG) sampler. In order to reduce the variance of and improve the mobility of the standard PG sampler, we further introduce a step of systematic refreshment step in terms of backwards sampling.
The rest of the paper is structured as follows. In Section 2 we introduce some notation and present some standard theoretical results for decomposable graphs and their junction tree representations. In Section 3 we specify the family of distributions over decomposable graphical models suitable for Bayesian structural inference and present a number of motivating examples. Section 4 presents a four stage strategy for the temporalisation of decomposable models. The SMC sampler is designed in Section 5 along with the standard PG and its systematic refreshment extension. In Section 6 we investigate numerically the performance of the suggested PG sampler for three examples of Bayesian structure learning in decomposable graphical models.

Preliminaries
2.1. Some notation. We will always assume that all random variables are well defined on a common probability space (Ω, F, P). We denote by N * the positive natural numbers, and for any (m, n) ∈ N 2 we use m, n to denote the unordered set {m, . . . , n}. By R * + we denote the positive real numbers.
Measurable spaces. Given some measurable space (X, X ), we denote by M(X ) and M 1 (X ) the sets of measures and probability measures on (X, X ), respectively. In the case where X is a finite set, X is always assumed to be the power set ℘(X) of X, and we simply write M(X) and M 1 (X) instead of M(℘(X)) and M 1 (℘(X)), respectively. In the finite case, counting measures will be denoted by |dx|. We let F(X ) the set of measurable functions on (X, X ).
Kernel notation. Let µ be a measure on some measurable space (X, X ). Then for any µ-integrable function h, we use the standard notation to denote the Lebesgue integral of h w.r.t. µ. In addition, let (Y, Y) be some other measurable space and K some possibly unnormalised transition kernel K : X × Y → R + . The kernel K induces two integral operators, one acting on functions and the other on measures. More specifically, given a measure ν on (X, X ) and a measurable function h on (Y, Y), we define the measure νK : Y ∋ A → K(x, A) ν(dx) and the function Kh : X ∋ x → h(y) K(x, dy) whenever these quantities are well-defined.
Finally, given a third measurable space (Z, Z) and a second kernel L : Y × Z → R + we define, with K as above, the product kernel whenever this is well-defined.
2.2. Decomposable graphs. Given a set V = {a 1 , . . . , a p } of p ∈ N * distinct elements, an undirected graph G with vertices V is specified by a set of edges E ⊆ V ×V , and we write G = (V, E). In addition, we say that For any pair (a, b) of vertices in G, a path from a to b, denoted by a ∼ b, is a sequence {a n k } ℓ+1 k=1 , with ℓ ∈ 1, p − 1 , of distinct vertices such that a n 1 = a, a n ℓ+1 = b, and (a n k , a n k+1 ) ∈ E for all k ∈ 1, ℓ . Here ℓ is called the length of the path. A graph is called a tree if there is a unique path between any pair of vertices. A graph is connected when there is a path between every pair of vertices, and a subtree is a connected subgraph of a tree. Let G = (V, E) be a graph and A, B, and S subsets of V ; then S is said to separate A from B if for all a ∈ A and b ∈ B, every path a ∼ b intersects S. This is denoted by is the subgraph of G with nodes V ′ and edges E ′ given by the subset of edges in E having both endpoints in V ′ . We write A complete subgraph is called a clique if it is not an induced subgraph of any other complete subgraph. We denote by Q(G) the family of cliques formed by a graph The primer interest in this paper regards decomposable graphs and junction trees defined next.
Decomposable graphs are sometimes alternatively termed chordal or triangulated, as Definition 1 is equivalent to the requirement that every cycle of length 4 or more is chorded, see e.g Diestel [2005].
The next theorem connects the two concepts above.
Theorem 3 ( [Cowell et al., 2003, Theorem 4.6]). A graph G is decomposable if and only if there exists a tree T of cliques that satisfies the junction tree property.
A tree satisfying the junction tree property and with vertices formed by the cliques of a decomposable graph is called a junction tree (of cliques). Each edge (Q, Q ′ ) in such a junction tree is associated with the intersection S = Q ∩ Q ′ , which is referred to as a separator. Since all junction tree representations of a specific decomposable graph G has the same separators, it makes sense to speak about "the separators of a decomposable graph". We denote by S(G) the multiset of separators formed by a graph G, where each separator has a multiplicity. The set of equivalent junction tree representations of a decomposable graph G is denoted by T (G), and µ(G) := |T (G)| denotes the number of such representations. The unique graph underlying a specific junction tree T is denoted by g(T ).

Problem statement and motivating examples
3.1. Problem statement. From now on, let V be a fixed set of p ∈ N * distinct vertices. Without loss of generality, we let V = 1, p . For U ⊆ V , we denote by G U the space of decomposable graphs with vertices U , i.e., G U := {(U, E) : E ⊆ U × U }. In particular, set G := G V . In addition, letḠ := ∪ U ⊆V G U be the space of all decomposable graphs with vertices given by V or some subset of the same.
Definition 4. A positive function γ onḠ is said to satisfy the clique-separator factorisation (CSF) if for all G ∈Ḡ, .
For some given function γ satisfying the CSF, the aim of this paper is to develop a strategy for sampling from the probability distribution |dG|, with γ| G denoting the restriction of γ to G and |dG| the counting measure on G.
The normalising constant γ ⋆ ½ G = G∈G γ ⋆ (G) will be considered as intractable, as computing the same requires the summation of over the whole space G, which is impractical as the cardinality of G is immense already for moderate p.
3.2. Application to decomposable graphical models. As we will see in the following, distributions of form (1) appear frequently in Bayesian analysis of graphical models. Let {(Y m , Y m )} p m=1 , p ∈ N * , be a sequence of measurable spaces and define : Ω → Y be a random element. We consider a fully dominated model where the distribution of Y has a density f on Y with respect to some reference measure ν := p m=1 ν m on (Y, Y), where each ν m belongs to M(Y m ). For some subset {a 1 , . . . , a m } ⊆ 1, p with a 1 ≤ . . . ≤ a m , we let Y A := (Y a 1 , . . . , Y am ) and define Y A = m ℓ=1 Y a ℓ and Y A = m ℓ=1 Y a ℓ . By slight abuse of notation, we denote by f (y A ) the marginal density of Y A with respect to ν A := m ℓ=1 ν a ℓ . For disjoint subsets A, B, and S of 1, p , we say, following Lauritzen [1996], that Y A and Y B are conditionally independent given where the conditional densities are defined as f (y A | y S ) := f (y A∪S )/f (y S ). The distribution of Y is said to be globally Markov w.r.t. the undirected graph G = (V, E), with V = 1, p and E ⊆ V × V , if for disjoint subsets A, B, and S of V it holds that We call the distribution governed by f a decomposable model if it is globally Markov w.r.t. a decomposable graph. Then, by repeated use of (2), it is easily shown that the density of a decomposable model satisfies the CSF-type identity where we, in order to justify the notation y Q and y S , by slight abuse of notation identify the cliques Q and the separators S (which are complete graphs) with the corresponding subsets of V . In the following we consider the dependence structure G as unknown, and take a Bayesian approach to the estimation of the same on the basis of a given, fixed data record y ∈ Y. For this purpose, we assign a prior distribution and ̟ :Ḡ → R * + is a function satisfying the CSF in Definition 4. For instance, in the completely uninformative case, ̟ ≡ 1; in the presence of prior information concerning the maximal clique size of the underlying graph, one may let ̟(G) = ½{ ∨ Q∈Q(G) |Q| ≤ M } for some M ∈ N controlling the sizes of the cliques. (In both cases, the CSF is immediately checked, see e.g Byrne and Dawid [2015]). We let the same symbol π denote also the corresponding probability function.
In this Bayesian setting, focus is set on the posterior distribution η ⋆ of the graph G given the available data y, which is, by Bayes' formula, obtained via (1) with γ ⋆ induced by , G ∈Ḡ.
The problem of computing the posterior may consequently be perfectly cast into the setting of Section 3.1. The model will in general comprise additional unknown parameters collected in a vector θ, which is assumed to belong to some measurable parameter space (Θ G , P G ) depending on the graph G. We add θ and G to the notation of the likelihood, which is assumed to be of form .
Our Bayesian approach calls for a prior also on θ = {θ Q , θ S : Q ∈ Q(G), S ∈ S(G)}, and we will always assume that this is hyper Markov w.r.t. the underlying graph G. More specifically, we assume that the conditional distribution of θ given G has a density w.r.t. some reference measure, denoted dθ for simplicity, on (Θ, P). This density is assumed to be of form where ϑ = {ϑ Q , ϑ S : Q ∈ Q(G), S ∈ S(G)} is a set of hyperparameters and each factor π(θ Q ; ϑ Q ) (and π(ϑ S ; ϑ S )) is a probability density π(θ Q ; ϑ Q ) = z(θ Q ; ϑ Q )/I(ϑ Q ) with I(ϑ Q ) = z(θ Q ; ϑ Q ) dθ Q being a normalising constant.
In the case where each π(θ Q ; ϑ Q ) is a conjugate prior for the corresponding likelihood factor f (y Q | θ Q ) it holds that , for some updated hyperparameter ϑ ′ Q (y Q ) and some constant c > 0. If the normalising constants I(ϑ Q ) are tractable, we may marginalise out the parameter and consider directly the posterior of G given data y. Indeed, since for all hyperparameters, , the marginalised likelihood is obtained as Thus, by Bayes' formula, the marginal posterior η ⋆ of G given the available data y can be expressed by (1) with γ ⋆ induced by , G ∈Ḡ.
Example 3.1 (discrete log-linear models). Let V be a set of p criteria defining a contingency table. Without loss of generality, we let V = 1, p and denote the table by I = I 1 ×· · ·×I p , where each I m is a finite set. An element i ∈ I is referred to as a cell.
In this setting, I = (I 1 , . . . , I p ) is a discrete-valued random vector whose distribution θ is assumed to be globally Markov w.r.t. some decomposable graph The vector I may, e.g., characterise a randomly selected individual w.r.t. the table I. Given G, the parameter space of the model is determined by the clique and separator marginal probability tables θ(i Q ) and θ(i S ); more specifically, Let Y be a collection of n ∈ N i.i.d. observations from the model; e.g., Y is an n × p matrix where each row corresponds to an observation of I. Then also Y forms a decomposable graphical model with state space Y = I n 1 × · · · × I n p and probability function f (y | θ, G) given by (5) with counts the number of elements of y Q belonging to the marginal cell i Q .
The problem of estimating the dependence structure G is complicated further by the fact that also the probabilities θ are unknown in general. When assigning a prior π(θ | G; ϑ) to the latter conditionally on the former, we follow Dawid and Lauritzen [1993] and let the prior π(θ Q ; ϑ Q ) of each θ Q be a standard Dirichlet distribution, are hyper parameters often referred to as pseudo counts. Under the assumption that the collection {π(θ Q ; ϑ Q )} Q∈Q(G) is pairwise hyper consistent in the sense that for all (Q, Q ′ ) ∈ Q(G) 2 such that Q ∩ Q ′ = ∅, π(θ Q ; ϑ Q ) and π(θ Q ′ ; ϑ Q ′ ) induce the same law on θ Q∩Q ′ , which in this case is implied by the condition [Dawid and Lauritzen, 1993, Theorem 3.9] implies the existence of a unique hyper Dirichlet law of the form (6). Thus, ) (the beta function), and the conjugacy (7) holds with c = 1 and . Then, putting a prior of form (4) on the graph, (8) implies that the marginal posterior of G given data y is obtained through (1) with γ ⋆ induced by Example 3.2 (Gaussian graphical models). A p-dimensional Gaussian random vector forms a Gaussian graphical model if it is globally Markov w.r.t. some graph G = (V, E) with V = 1, p and E ⊆ V × V . In the following we assume that the model has zero mean (for simplicity) and is, given G, parameterised by its precision (inverse covariance) matrix belonging to the set where M + p denotes the space of p × p positive definite matrices. It is well known that in this model, a zero in the precision matrix, θ ij = 0, is equivalent to conditional independence of the ith and jth variables given the rest of the variables, see Speed and Kiiveri [1986]. In addition, when G is decomposable, a model with θ ∈ Θ G is globally Markov w.r.t. G. In the following, for any matrix p × p matrix M and A ⊆ 1, p , denote by M A the |A| × |A| matrix obtained by extracting the elements (M ij ) (i,j)∈A 2 from M . Suppose that G is decomposable and that are we have access to n independent observations from the model. The observations are stored in an n × p data matrix Y , whose likelihood f (y | θ, G) is, as a consequence of the global Markov property, given by (5) with In order to perform a Bayesian analysis of the problem, we follow Dawid and Lauritzen [1993] and furnish, given G, θ with a hyper Wishart prior π(θ | G) of form (6), with each π(θ Q ; ϑ Q ) being proportional to p being a scale matrix and δ > ∨ Q∈Q(G) |Q| − 1 the number of degrees of freedom, and normalising constant where Γ p denotes the multivariate gamma function. Since all hyperparameters ϑ Q are extracted from the same scale matrix v, the prior collection {π(θ Q ; ϑ Q )} Q∈Q(G) is automatically pairwise hyper consistent, and the existence of the (unique) hyper Wishart prior is guaranteed by [Dawid and Lauritzen, 1993, Theorem 3.9]. As where α Q := (δ+n+|Q|−1)/2, we conclude that the conjugacy condition (7) holds for and similarly for factors corresponding to separators). Consequently, assigning also a prior of form (4) to the graph, (8) implies that the marginal posterior of G given data y is, in this case, obtained through (1) with γ ⋆ induced by The previous two examples will be discussed further in Section 6.

Temporalisation of graphical models
In the light of the previous section, our goal is now to develop an efficient strategy for sampling from distributions of form (1). As mentioned in the introduction, particle MCMC methods are appealing as these allow MCMC chains with "global" moves to be defined also on large spaces. However, unlike our setting, SMC methods sample from sequences of distributions, and a key ingredient of our developments is hence to provide an auxiliary, sequential reformulation of the sampling problem under consideration. This construction, which we will refer to as temporalisation, comprise four steps described in the following, where the last step relies partly on the methodology of SMC samplers Del Moral et al. [2006].
Step I. In the setting of Section 3, the first step of the temporalisation procedure is to define a family of distributions defined on the spaces {G v : v ⊆ V }. Using the function γ inducing the target (1) of interest, we define, for each v ⊆ V , the measure with γ| Gv denoting the restriction of γ to G v and |dG| the counting measure on G v . Note that η ⋆ V coincides with η ⋆ , the target of interest. As usual, we will let the same symbols γ ⋆ v and η ⋆ v denote the probability functions of these measures.
Step II. The second step is to extend each distribution η ⋆ v to a distribution η * v on T v := ∪ G∈Gv T (G), the space of junction tree representations of graphs in G v . Following Green and Thomas [2013], one way of carrying through this extension is to define, for each v ⊆ V , the measure with |dT | denoting the counting measure on T v . In particular, we set γ * = γ * V and η * = η * V .
The following lemma establishes that each extension (10) has the correct marginal w.r.t. the graph, i.e., that η ⋆ v is the distribution of g(τ ) when τ ∼ η * v .
; then by a similar computation, which completes the proof.
Note that by Lemma 5, for G ∈ G v , Moreover, using (11), the right hand side of (12) can be expressed as i.e., under η * v , conditionally on the event {g(τ ) = G}, the tree τ is uniformly distributed over the set T (G) (recall that µ(G) is the cardinality of T (G)). In other words, a draw from η * v can be generated by drawing a graph according to η ⋆ v and then drawing a tree uniformly over all junction tree representations of that graph.
Step III. A sequential formulation of the problem calls for additional augmentation of the distribution of interest, and this is the third step of the temporalisation procedure. In this step we design, using the construction in Step II, a sequence η * U 1 , η * U 2 , . . . , η * U p = η ⋆ of distributions "growing" towards the target η ⋆ of interest by forming increasing vertex sets for all m ∈ 1, p , through sequential addition of vertices to a set containing initially a single vertex. However, in the context of Bayesian structure learning, the user may, by always processing the vertices in some given order, run the risk of overlooking dependence relations running counter to this specific order. It is hence desirable to allow the vertex processing order to be randomised. For this purpose, let, for all m ∈ 1, p , S m be the space of all m-combinations of elements in 1, p . In particular, In addition, we define, for all m ∈ 1, p , the extended state spaces and, for some given discrete probability distribution σ m on S m , extended target distributions Here we have chosen to write T m instead of T sm in order to avoid double subscript notation. Note that each measure γ m (dx m ) has a density γ m (x m ) = γ * s m (T m )σ m (s m ) (by abuse of notation, we reuse the same symbol) w.r.t. |dx m |, the counting measure on X m . Moreover, since σ p = δ 1,p , η * is the marginal of η p with respect to the T p component. The measures {σ m } p m=1 are supposed to satisfy the recursion with Σ m being a Markov transition kernel from S m to 1, p such that Σ m (s m , j) = 0 for all j ∈ s m . Consequently,Σ m is a Markov transition kernel from S m to S m+1 . In other words,Σ m transforms a given m-combination s m into an (m + 1)-combination s m+1 by selecting randomly an element s * from the (non-empty) set 1, p \ s m according to Σ m (s m , ·) and adding the same to s m . When selecting s * , several approaches are possible; s * can, e.g., be selected randomly from the set {s ∈ 1, p : min s ′ ∈sm |s − s ′ | ≤ δ} for some prespecified distance δ ∈ 1, p . Also the initial distribution σ 1 can be designed freely, e.g., as the uniform distribution over 1, p .
Step IV. As mention above, SMC methods are, in their basic form, designed for approximating distribution flows defined over sequences of spaces of increasing dimension, and to construct such a flow is the last step of the temporalisation. For this purpose, we let {R m } p−1 m=1 be some sequence of Markov transition kernels acting in the reversed direction, i.e., for each m, R m : X m+1 × ℘(X m ) → [0, 1], and define, following Del Moral et al. [2006], for all m ∈ 1, p , where X 1:m := m ℓ=1 X ℓ . 2 Trivially,η m allows η m as a marginal distribution with respect to the last component x m . We will in the following use the same symbol R ℓ to denote the transition probabilities induced by each (discrete) transition kernel R ℓ , or, in other words, the transition density of R ℓ w.r.t. the counting measure |dx ℓ | on X ℓ . Moreover, we suppose that Supp(R ℓ (x ℓ+1 , ·)) ⊆ Supp(γ ℓ ) for all x ℓ+1 ∈ Supp(γ ℓ+1 ). As the reversed kernels are assumed to be Markovian and known to the user, each extended target distributionη m is known up to the same normalising constant γ m ½ Xm as its marginal η m . The algorithm that we propose is based on the observation that the distribution flow {η m } p m=1 satisfies the recursive Feynman-Kac model where we have defined the un-normalised transition kernel Remark 6. In the SMC sampler framework of Del Moral et al. [2006], focus is set on sampling from a sequence of probability densities known up to normalising constants and defined on the same state space. In this context, the authors propose to transform the given distribution sequence into a sequence of distributions over state spaces of increasing dimension (given by powers of the original space) by means an auxiliary Markovian transition kernel. In this construction, each extended distribution is of form (14), with X m ≡ X for all m, and allows the original density of interest as a marginal with respect to the last component x m . Having access to such a flow of distributions over spaces of increasing dimensions, standard SMC methods provide numerically stable online approximation of the marginals, the latter satisfying a Feynman-Kac recursion of form (15). In our case, we arrive at the recursion (15) from an entirely different direction, i.e., by starting off with a single distribution defined on a possibly high-dimensional space and constructing an auxiliary sequence of increasingly complex distributions used for directing an SMC particle sample towards the distribution of interest (see the next section).

Particle approximation of temporalised graphical models
In the following we discuss how to obtain a particle interpretation of the recursion (15). Assume for the moment that we have at hand a sequence {K m } p−1 m=1 of proposal kernels such that Q m (x m , ·) ≪ K m (x m , ·) for all m ∈ 1, p − 1 and all x m ∈ X m . In our applications, we will let these proposal kernels correspond to the so-called Christmas tree algorithm (CTA) proposed in the companion paper Olsson et al. [2018] and overviewed in Section 5.0.1. We proceed recursively and assume that we are given a sample {(ξ i m , ω i m )} N i=1 of particles, each particle ξ i m = (ς i m , τ i m ) being a random draw in X m (more specifically, ς i m is a random m-combination in 1, p and τ i m a random draw in Z ς i m ), with associated importance weights (the ω i m 's) approximating η m in the sense that for all h ∈ F(X m ), with Ω N m := N i=1 ω i m , denotes the weighted empirical measure associated with the particle sample. In order to produce an updated particle sample approximating η N m+1 , we plug η N m into the recursion (15) and sample from the resulting distribution m ) by means of importance sampling. For this purpose we first extend the previous measure to the index component, yielding the mixturě |di| on the product space 1, N × X m+1 , and sample from the latter by drawing i.i.d.
Each draw (I i m+1 , ξ i m+1 ) is assigned an importance weight where we have defined the importance weight function Finally, the weighted empirical measure is returned as an approximation of η m+1 . In the following, let Pr({a ℓ } N ℓ=1 ) denote the categorical probability distribution induced by a set {a ℓ } N ℓ=1 of positive (possibly unnormalised) numbers; thus, writing W ∼ Pr({a ℓ } N ℓ=1 ) means that the variable W takes the value ℓ ∈ 1, N with probability a ℓ / N ℓ ′ =1 a ℓ ′ . We will always assume that the proposal kernel K m is of form whereΣ m is defined in (13) and for all (s m , s m+1 ) ∈ S m × S m+1 , K * m s m , s m+1 is a Markov transition kernel from X sm to X s m+1 . Each discrete law K * m s m , s m+1 (T m , ·), T m ∈ T sm , has a probability function, which we denote by the same symbol. Note that the assumption (17) implies that for all i ∈ 1, N , , and, consequently, by (13), Thus, the importance weight (16) simplifies according to .
Further, we have the identity where △ denotes symmetric difference and is defined similarly). The first factor in (18) can be further factorized according to the result stated in Theorem 9 of Olsson et al. [2018]. Let G m+1 ∈ G m+1 be a graph expanded from a graph G m ∈ G m in the sense that G m+1 [{1, . . . , m}] = G m , then we can define the set S ⋆ ⊂ S(G m+1 ) consisting of the separators created by the expansion. The factorisation is then given as where U 1 = {s ∈ S(G m ) : ∃s ′ ∈ S ⋆ , such that s ⊂ s ′ } and U 2 = {s ∈ S(G m+1 ) : ∃s ′ ∈ S ⋆ , such that s ⊂ s ′ } are the set of separators in G m and G m+1 respectively, contained in some separator in S ⋆ . The function ν G (s) denotes the number of equivalent junction trees that can be obtained by randomizing a junction tree for the graph G at the separator s. For for a more detailed presentation see Olsson et al. [2018]. The sets Q(g(τ i m+1 ))△Q(g(τ I i m+1 m )) and S(g(τ i m+1 ))△S(g(τ I i m+1 m )) in the second factor might be composed by only a few cliques and separators, respectively, and computing the products in the numerator and denominator of (18) will in that case be an easy operation. This is the case for the CTA described in Section 5.0.1 below.
In summary the identity (18) suggests that the first part of the importance weights may, in principle, be computed with a complexity that does not increase with the iteration index m as long as the proposal kernel K * m only modifies and extends locally the junction tree (and, consequently, the underlying graph).
The SMC update described above is summarised in Algorithm 1.
Remark 7 (design of retrospective dynamics). As we will se, the reversed kernels {R m } p−1 m=1 will typically be designed on the basis of the forward proposal kernels {K m } p−1 m=1 . It is clear that for all m ∈ 1, p − 1 , the constraint that Q m (x m , ·) ≪ K m (x m , ·) for all x m ∈ X m is satisfied as soon as the retrospective kernel R m is such that Supp(R m (x m+1 , ·)) ⊆ Supp(K m (·, x m+1 )) for all x m+1 ∈ Supp(γ m+1 ). Consequently, if for all m ∈ 1, p − 1 , one may, e.g., construct each retrospective kernel R m by identifying, for all x m+1 ∈ Supp(γ m+1 ), a nonempty set S m (x m+1 ) ⊆ Supp(K m (·, x m+1 ))∩Supp(γ m ), and letting i.e., R m (x m+1 , dx m ) is the uniform distribution over S m (x m+1 ). The existence of such a nonempty set is guaranteed by (19). Indeed, let x m+1 ∈ Supp(γ m+1 ); then, by (19), , dx m ) may be defined arbitrarily. As we will see next, the property (19) is satisfied by the junction tree expanders used by us.
5.0.1. The Christmas tree algorithm. In this section we follow the presentation of Olsson et al. [2018] and without loss of generality disregard the permutations of the nodes for the underlying graphs specified by X m . This implies that we consider a fixed set of ordered nodes s = [1, . . . , m] ∈ S m and by T m we mean T s .
As previously mentioned {K m } p−1 m=1 and {R m } p−1 m=1 will here correspond to the kernels induced by the CTA and its reversed version, respectively. The CTA kernel takes as input a junction tree T m ∈ T m and expands it to a new junction tree T m+1 ∈ T m+1 according to K m (T m , dT m+1 ) by adding the internal node m + 1 to the underlying graph of T m so that g (T m+1 )[{1, . . . , m}] = g(T m ). It requires two input parameters α, β ∈ (0, 1) jointly controlling the sparsity of the produced underlying graph. Specifically, at the initial step of the algorithm, a Bernoulli trial with parameter 1 − β is performed in order to determine whether or not the internal node m + 1 is being isolated in the underlying graph of the produced tree. If m + 1 is not isolated, a high value of the parameter α controls the number of cliques in T m+1 that will contain m + 1. In this sense, K m (T m , dT m+1 ) can be regarded as a mixture distribution with weight parameter β.

Particle Gibbs sampling.
In the following, we discuss how to sample from the extended target η 1:p , having the distribution η * of interest as a marginal distribution, using Markov chain Monte Carlo (MCMC) methods. A particle Gibbs (PG) sampler constructs, using SMC, a Markov kernel P N p leaving η 1:p invariant. Algorithmically, the more or less only difference between the PG kernel and the standard SMC algorithm is that the PG kernel, which is described in detail in Algorithm 2, evolves the particle cloud conditionally on a fixed reference trajectory specified a priori ; this conditional SMC algorithm is constituted by Lines 1-16 in Algorithm 2. After having evolved, for p time steps, the particles of the conditional SMC algorithm, the PG kernel draws randomly a particle from the last generation (Lines 17-19), traces the genealogical history of the selected particle back to the first generation (Lines 20-22), and returns the traced path (Line 23).
As as established in [Chopin and Singh, 2015, Proposition 8], P N p is η 1:p -reversible and thus leaves η 1:p invariant. Interestingly, reversibility holds true for any particle sample size N ∈ N * \ {1}. Thus, on the basis of P N p , the PG sampler generates (after possible burn-in) a Markov chain {X ℓ 1:p } ℓ∈N * according to and returns M ℓ=1 h(X ℓ 1:p )/M as an estimate of η 1:p h for any η 1:p -integrable objective function h ∈ F(X 1:p ). Here M ∈ N * denotes the MCMC sample size. In particular, in the case where the objective function h depends on the argument T p only, we obtain the estimator of η * h, where each Z ℓ p variable is extracted, on Line 18, at iteration ℓ − 1 of Algorithm 2.
5.2. PG with systematic refreshment. For the graph-oriented applications of interest in the present paper, the naive implementation of the PG sampler will suffer from bad mixing, even though the distribution of interest, η * , is defined only on the marginal space T p . Thus, we will modify slightly the standard PG sampler by inserting an intermediate refreshment step in between the PG iterations. More specifically, define Given x 1:p , drawing X 1:p ∼ G p (x 1:p , dx ′ 1:p ) amounts to setting, deterministically, X p = x p and simulating X 1:p−1 according to the Markovian retrospective dynamics induced by the kernels {R m } p−1 m=1 . Note that each distribution G p (x 1:p , dx ′ 1:p ) depends exclusively on x p . Describing a standard Gibbs substep for sampling from η 1:p , G p is η 1:p -reversible; see, e.g., [Cappé et al., 2005, Proposition 6.2.14]. Consequently, also the product kernel P N p G p is η 1:p -invariant. Unlike standard PG, the MCMC sampling scheme that we propose, which is summarised in Algorithm 3, generates (after possible burn-in) a Markov chain {X ℓ 1:p } ℓ∈N * according to as an estimator of η 1:p h for any η 1:p -integrable function h. In addition, as previously, in the case where the objective functions h depends on the argument T p only, we obtain the estimator of η * h, where each Z ℓ p variable is extracted, on Line 2, at iteration ℓ−1 of Algorithm 3.
5.2.1. PG with systematic refreshment vs. standard PG. As established by the following theorem, the systematic refreshment step improves indeed the mixing of the algorithm. For any functions g and h in L 2 (η 1:p ) we define the scalar product g, h := η 1:p (gh). Moreover, for all η 1:p -invariant Markov kernels M on (X 1:p , X 1:p ) and functions h ∈ L 2 (η 1:p ) such that we define the asymptotic variance where {X ℓ 1:p } ∞ ℓ=1 is a Markov chain with initial distribution η 1:p and transition kernel M. (The assumption (23) can be shown to imply the existence of the limit (24)). In the case where the latter Markov chain satisfies a central limit theorem for the objective function h, the corresponding asymptotic variance is given by (24). As established by the following result, whose proof relies on asymptotic theory for inhomogeneous Markov chains developed in Maire et al. [2014], the improved mixing implied by systematic refreshment of the trajectories implies a decrease of asymptotic variance w.r.t. standard PG.
Theorem 8. For all N ∈ N * and all functions h * ∈ L 2 (η * ) such that both P N p G p and P N p satisfy the summation condition (23) with h := ½ X 1: . Proof. First, as established in [Chopin and Singh, 2015, Proposition 8], the standard PG kernel P N p is η 1:p -reversible. As mentioned above, the kernel G p is straightforwardly η 1:p -reversible as a standard Gibbs substep. Moreover, for all x 1:p ∈ X 1:p , G p (x 1:p , X 1:p−1 × {x p }) = 1 and G p dominates trivially the Dirac mass on the offdiagonal, in the sense that for all A ∈ X 1:p and x 1:p ∈ X 1:p , G p (x 1:p , A \ x 1:p ) ≥ δ x 1:p (A \ x 1:p ) = 0. The assumptions of [Maire et al., 2014, Lemma 18] are thus fulfilled, and the proof is concluded through application of the latter.

Structure learning in graphical models
In this section we investigate numerically the performance of the suggested PG algorithm for three examples of Bayesian structure learning in decomposable graphical models. The first two examples focus discrete data. Specifically, the first one treats the classical Czech Autoworkers dataset found in e.g. David Edwards [1985]. The second one considers simulated data generated from the discrete p = 15 nodes example presented in Jones et al. [2005]. The third example focus on continuous models and present Bayesian structure learning in decomposable GGM of dimensionality p = 50 with a time-varying dependence.
Here, the the proposal kernel {K m } p−1 m=1 and backward kernel {R m } p−1 m=1 are designed according to the CTA and its reversed version respectively, provided by expressions (2) and (3) in the companion paper Olsson et al. [2018].
For each example we investigate properties of the graph posterior and the produced Markov chains.
For the transition of m-combinations, from s m to s m+1 , a new nodes is selected at random from the set {s ∈ 1, p : min s ′ ∈sm |s − s ′ | ≤ δ} as suggested in step IV of the temporalisation procedure in Section 4. A practical situation where this selection procedure is expected to be particularly valuable is when the variables under study has a time interpretation. In such examples, it might be reasonable to believe that each variable is only directly dependent of those closest in time. Thus, by selecting δ properly we use it as a the neighborhood parameter that can control the PG algorithm to sample from the most probable graphs, which also most often coincide with the graph of most interest. In the first two examples, due to the absence of a time dependent dynamic, we set δ = p. In the third example we investigate the effect of δ in capturing the time varying dependence structure.
The estimated graph posteriors are summarize in terms of marginal edge distribution presented as a heatmap where the probability of an edge (a, b) is estimated by 1 M M ℓ=1 ½ (a,b)∈g(Z ℓ p ) according to (22). In order to evaluate the mixing property of the sampler we consider the estimated auto-correlation function for the number of edges in the sampled graph trajectory. Since in many practical situations the aim is to select one particular model that best represents the underlying dependencies, we also present the maximum aposteriori (MAP) graph for each of the examples.
6.1. Czech Autoworkers data. This dataset, previously analyzed many times in the literature comprises 1841 men cross classified with respect to six potential risk factors for coronary trombosis randomly selected from a population of Czech autoworkers; (Y 1 ) smoking, (Y 2 ) strenuous mental work, (Y 3 ) strenuous physical work, (Y 4 ) systolic blood pressure, (Y 5 ) ratio of beta and alpha lipoproteins and (Y 6 ) family anamnesis of coronary heart disease. In absence of any prior information we assume that the data are generated from the decomposable graphical model presented in Section 5. Each of the 64 cells in the contingency Table is assigned a pseudo count of 1/64, which in turn induces hyper parameters in the conjugate prior Hyp-Dir(ϑ) defined by {π(θ Q ; ϑ Q )} Q∈Q(G) where ϑ Q = {ϑ Q(i Q ) } i∈I Q and ϑ Q(i Q ) = |I V \Q |/64. This type of low dimensional model is suiTable for evaluation purpose since it is possible to exactly specify the posterior distribution. Specifically, the total number of decomposable graphs with six nodes is equal to 18154, making a full specification of the posterior distribution tractable.
All the estimators are based on N = 100 particles and averaging is performed over M = 10000 PG-runs according to equation (22). The heatmaps for the exact and estimated posterior distributions are displayed in Figure 1 along with a plot of the estimated auto-correlation. A visual inspection of the marginal edge probabilities in the heatmaps indicates a good agreements between the distributions. From the autocorrelation plot we deduce that the PG sampler exhibit very good mixing properties for this type of small scale problem. Table 1 summarizes the edge sets for the top five graphs on both the exact and estimated posterior distribution along with their corresponding probabilities. It is important to note that the top five graphs are exactly the same for these two distributions and the estimated probabilities are in a good agreement with the exact ones. Our findings are also consistent with the results obtained by Massam et al. [2009] and Madigan and Raftery [1994]. Specifically, our top highest posterior probability decomposable models are the same as those identified by Massam et al. [2009], see Table 2, case α = 1.0 in that paper.
Finally we remark that our results obtained in this example appear to be insensitive to the choice of CTA parameters α and β.

6.2.
Discrete log-linear model with 15 nodes. We study another discrete decomposable model, here with p = 15 nodes and the dependence structure displayed in Figure 2, presented in [Jones et al., 2005, Figure 1]. The parameters were selected to satisfy (9) thereby ensuring that the distribution Θ G specified in Example 3.1 will be Markov with respect to G. Analogously to the previous example, the we use the Hyp-Dir(ϑ) prior and assign to each cells in the contingency Table a pseudo count of 1/2 15 . Also, due to the absence of any time interpretation of the model, δ is selected as 15. We used the CTA parameters α = 0.2 and β = 0.8, obtained as the parameter setting giving best mixing properties within all the possible combination on the grid α, β = 0.2, 0.5, 0.8.
To evaluate the effect of the number of particles on performance accuracy we sampled n = 100 data vectors and estimated both the graph posterior and the autocorrelation function using N = 20 and N = 100. By comparing the true underlying graph in Figure 2 with the estimated models in Figure 3, we observe that increasing N from 20 to 100 gives a slightly better agreement with the true adjacency matrix. This effect further can be explained by the behavior of the estimated auto-correlation function; by increasing N we observe a clear reduction of the auto-correlation. Qualitatively we conclude that the mobility of the PG sampler is improved when increasing N .  Figure 2. The true underlying decomposable graph on p = 15 nodes along with its adjacency matrix.
6.3. Decomposable GGM with temporal dependence. In this example we study a decomposable GGM where a temporal interpretation of the underlying dependence structure is sensible. The graph structure along with its adjacency matrix are displayed in Figure 4 and can be interpreted as an AR-process with lag varying between 1 and 5. Regarding the parameterization, we consider the Gaussian distribution with zero mean and covariance matrix θ −1 defined as This is a modification of to the second order intra-class structure considered in Thomas and Green [2009], but with varying band width. We set σ = 1.0 and ρ = 0.9 and sampled n = 100 data vectors from this model. We use the Hyper-Wishart prior for θ as described in Example 3.2 where for each clique Q the shape parameter is set to be equal to p and the scale matrix v is set to be the identity matrix of dimension |Q|.
The temporal interpretation of the dependence structure in the current example is as mentioned above particularly appealing for investigating the role of the neighborhood parameter δ. Estimation results are presented in Figure 5 and Figure 6. By comparing the two heatmaps one can see that the dependence structure can be efficiently captured by the proper choice of δ. Indeed, by selecting δ to be close to the maximal band size of 5 for the true graph and letting α = 0.8 and β = 0.5 reflecting a priority for connected graphs, we obtain a dependence pattern more similar to the true one. This is also reflected by the log-likelihood in the bottom row of Figure 6. On the other hand, in view of auto-correlation (after burn-in) shown in the middle row of Figure 6, the mixing properties of the PG sampler seem to be improved by increasing δ.
6.4. Comparison to Metropolis-Hastings. We use the second order intra-class dataset from Section 6.3 to compare the PG sampler to our Python implementation of the Metropolis-Hastings (MH) algorithm proposed in Thomas and Green [2009]. However, we use the same Hyper-Wishart prior as in Section 6.3 for θ instead of specifying independent priors for σ 2 and ρ as in Thomas and Green [2009]. The output is summarized in Figure 7 and Figure 8. In order to improve the mixing properties of the MH sampler we followed the suggestions of Thomas and Green [2009] and randomize the junction tree every λ iteration. Here, we show results for λ = 100 and λ = 1000 in the left and right columns of the figures, respectively. Firstly, we note that the estimated auto-correlation of the MH sampler shown in the middle row of Figure 8, is substantially stronger than that for the PG sampler in both cases, being about 500 for δ = 50 compared to about 20000 for λ = 100. Also, from the size and the log-likelihood trajectory of the MH sampler presented at the upper and lower row of Figure 8, respectively, the time to reach what seems like a stationary distribution requires fewer steps in the PG sampler than for the MH samples, being about 2000 for δ = 5 compared to about 350000 with λ = 100.
Both the heatmaps and MAP graphs for the MH sampler presented in Figure 7, confirm the suggestion of Thomas and Green [2009] that a more frequent junction tree randomization in the MH sampler seems to have good effect on recovering the underlying model. The main advantage for the MH sampler is the superior speed, which compensates for the inferior mixing properties seemingly inherited by the local move approach. For the PG sampler, each sample took about 3 seconds to generate on an iMac late 2012 with 2.7 GHz Intel Core i5 processor. However, we expect that the speed of the PG sampler could be improved by, for example employing more efficient data structures for the junction tree representation and by improved caching strategies. The PG sampler is part of the trilearn library available for download at https://github.com/felixleopoldo/trilearn.