Inversion of Bayesian Networks

Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.


Introduction
A generative model is a set of probability distributions that models the distribution of observed and latent variables.Generative models are used in many machine learning applications.One is often interested in performing inference of the latent variable given an observation, i.e. obtaining the posterior distribution.For complex generative models it is often hard to calculate the posterior distribution analytically.The field of variational Bayesian inference (Wainwright et al., 2008) studies different ways of approximating the true posterior.One approach within this field is called amortised inference (Gershman and Goodman, 2014).This approach distinguishes itself through using one set of parameters for recognition that is optimised over multiple data points.This can be contrasted with "memoryless" inference algorithms, such as the message passing algorithm (Pearl, 1982;Cowell et al., 1999), which finds a separate set of parameters for every data point.Both the variational autoencoder (VAE) (Kingma and Welling, 2013) and Helmholtz machine (Dayan et al., 1995) are examples of amortised inference.In their most general form these consist of a Bayesian network that is used to model the generative distribution.A second network, called the recognition model, is used to model the posterior distribution.Both these networks have the same set of nodes, namely the union of the observed and latent variables.However, in the generative network the arrows point from the latent to the observed nodes but in the recognition network it is the other way around.The recognition network is therefore in some sense an inversion of the generative network.In many applications, one simply flips the direction of the edges of the generative network to obtain the recognition network.However, as the simple example in Figure 1 shows, this does not guarantee that the recognition model is actually able to model the true posterior distribution of the generative model.In this paper, we study the necessary and sufficient properties of the recognition network such that we do have this guarantee.We first discuss these properties in terms of d-separation, subsequently in terms of perfectness, and finally in terms of single edge operation using the Meek conjecture (Meek, 1997).
where G ′ is obtained by flipping the direction of the edges in G.The variables z 1 , z 2 represent the latent variables and x the observed variable.The distribution p such that z 1 , z 2 are Bernoulli(0.5)and x = z 1 + z 2 mod 2 can be modelled by G, but the conditional distribution p z1,z2|x cannot be modelled by G ′ .
In practice, one often puts further restrictions on the probability distributions the networks can model by, for example, letting the distribution of an individual node be Gaussian, with the mean (and variance) being a function of the values of the parent nodes.We discuss the general case of a restricted set of probability distributions, and in particular the case of the Gaussian distributions, in the last part of the results section.
The question of finding a sparse G ′ that can approximate the posterior distribution of the generative model well is also studied from a more practical perspective, using methods from machine learning.One can use a sparsity prior when learning the recognition model, to encourage that only the edges really necessary for modelling the posterior are added.Löwe et al. (2022); Louizos et al. (2017); Molchanov et al. (2019) present several approaches.
Markov equivalence is a property of a pair of Bayesian networks that indicates that they encode the same set of conditional independence statements (Verma and Pearl, 1990;Flesch and Lucas, 2007).A generalisation of this, that we will call Markov inclusion, is when the set of conditional independence statements encoded in one graph is a subset of the conditional independence statements encoded in the other graph (Castelo and Kočka, 2003).We will see in Proposition 1 that the results in this paper can also be viewed as describing under which conditions one Bayesian network is Markov inclusive of another.

Example
Before giving a formal definition of the problem, we illustrate the topic of this paper by an example.Consider the generative model for diseases and their symptoms in Intuitively it is clear that when someone is congested, the fact whether they have muscle pain or not, does give extra information on how likely it is that that person has hayfever.If someone is congested and also has muscle pain, the congestion is more likely to be caused by the flu.This dependence is however not captured in the graph in Figure 3, because no information can flow from muscle pain to flu.By adding an edge between muscle pain and hayfever, or between flu and hayfever, this dependence can be captured.The example above is intended to give an intuitive idea of the nature of the problem addressed in this paper, and provide context for the more formal treatment below.

Notation Graph theory
For a comprehensive overview of the theory and terminology of probabilistic graphical models, we refer to (Lauritzen, 1996;Cowell et al., 1999;Studeny, 2005).Let G = (N, E) be a directed acyclic Figure 5: Different subsets of N for a graph G graph (DAG), that we always assume to be connected.We say that two vertices s, t ∈ V are joined if (s, t) ∈ E or (t, s) ∈ E. A set of vertices is called complete if all pairs are joined.The set of parents, children, descendants, and non-descendants of a node s ∈ N are denoted pa(s), ch(s), des(s), nd(s) respectively.G is called perfect if for all s, the set pa(s) is complete.For a subset A ⊂ N , the vertex-induced subgraph of G is denoted G[A].We let Leaves(G) = {s ∈ N : ch(s) = ∅} be the set of nodes without children, and Roots(G) = {s ∈ N : pa(s) = ∅} be the set of nodes without parents.Furthermore we let V = Leaves(G) be the set of visible nodes, that corresponds to the set of observed variables (such as the symptoms in the example) and H = N \ Leaves(G) be the set of hidden nodes, which are the variables to be inferred (such as the diseases in the example).See Figure 5.For e = (s, t) ∈ E, let e * = (t, s), E * = {e * : e ∈ E}, G * = (N, E * ) the graph G with its edges reversed, G ∼ = (N, E ∪ E * ), the skeleton (i.e.undirected version) of G.The moral graph of G, denoted G M , is the skeleton of G, with extra (undirected) edges between parents of the same child in G.A path in G from s to t is a sequence of nodes s = u 1 , ..., u n = t such that (u i , u i+1 ) ∈ E for all i ∈ {1, ..., n}.A trail γ in G is a sequence of vertices that forms a path in G ∼ .A trail γ is said to be blocked by S ⊂ N if γ contains a vertex u such that either: (1) u ∈ S and the arrows do not meet head to head at u; (2) u and des(u) are not in S and the arrows do meet head to head at u. Two subsets A, B ⊂ N are said to be d-separated by S if all trails from A to B are blocked by S and we write A ⊥ d B | S. A topological ordering of G is an injective map O : N → N that assigns to every node a number such that, if two nodes are joined, the edge points from the lower to the higher numbered node.When the topological ordering is implied, we will write s < t to mean O(s) < O(t) and say "s is older than t" and the same for ">", with s being younger.Given a topological ordering O, the set of predecessors of a node s, denoted pr O (s), is the set of all nodes with a lower topological number, i.e. pr O (s) = {t ∈ N : O(t) < O(s)}.Note that this set in general depends on the choice of topological ordering (see Figure 6).For alternative DAGs G ′ or Ḡ we denote the above defined symbols with their respective accent, e.g.ch ′ (s), pa(s), ⊥ ′ d , < ′ , etc.

Probability on graphs
To every node s ∈ N we associate a measurable space (X s , X s ).The state spaces are either real finite-dimensional vector spaces or finite sets and to each measurable space we associate a (σ-finite) base measure µ s which is typically the Lebesgue measure or counting measure respectively.Then we let (X, X ) = (× s∈N X s , ⊗ s∈N X s ) and assign to this space the base measure µ = ⊗ s µ s .In this paper, we consider probability distributions P over the space (X, X ).For every s ∈ N we let X s : X → X s be the random variable projecting onto the individual spaces.
For a subset A ⊂ V we let (X A , X A ) = (× s∈A X s , ⊗ s∈A X s ) and similarly X A = (X s ) s∈A and X = X N .A typical element of X s is denoted x s with x A = (x s ) s∈A and x = (x s ) s∈N .We write P A for the pushforward measure of P though X A on (X A , X A ), i.e. for A ∈ X A , For A, C ⊂ N disjoint, we say that a map K : Furthermore, we say that K is a (regular) version of the conditional probability of A given C if it is Markov kernel and for all C ∈ X C holds.It can be shown that in our setting, one can always find such a Markov kernel that is unique P C -a.e.(Dudley, 2018).We therefore also denote such a Markov kernel by P A|C .For disjoint subsets A, B, C ⊆ N we says that A is conditionally independent of B given C and write For s ∈ N , a kernel function will be a map A probability distribution P is said to factorise over G if it has a density p w.r.t.µ and there exist kernel functions (k s ) s∈N such that We denote the set of probability distributions on X that factorise over G by We denote the set of such Markov kernels by K G .

Problem statement
Goal I Given a DAG G = (N, E), find a DAG G ′ = (N, E ′ ) such that Roots(G ′ ) = Leaves(G) and for every P ∈ P G , there exists K ∈ K G ′ that is a version of the conditional distribution P H|V .
It turns out (Proposition 1 in the results section) that this goal is equivalent (up to edges between nodes in Leaves(G)) to the following goal: Goal II Given a DAG G = (N, E), find a DAG G ′ = (N, E ′ ) such that there exists a topological ordering of G ′ such that there is no vertex outside Leaves(G) 1 that is older than the vertices in Leaves(G) and P G ′ ⊃ P G .
In the remainder of the paper, we will focus on Goal II.Moreover, we sometimes impose the following extra condition: It can be argued that this is a natural condition since this enforces that the hierarchical structure of the generative model G is preserved when finding a suitable G ′ .Note that this condition also guarantees that there exists a topological ordering of G ′ such that Leaves(G) are oldest.Proof.Since pa(s) ⊃ pa(s) for every node s, a density that can be written as s k s (x s |x pa(s) ) can also be written as s k s (x s |x pa(s) ).
Lemma 2. Let A, B, S be subsets of N .We have, A ⊥ ⊥ B | S for all P ∈ P G if and only if S d-separates A and B in G.
Lemma 3. (Theorem 5.14 in Cowell et al. (1999)) Let G be a DAG with a topological ordering O.
For a probability distribution P on X, the following conditions are equivalent: (1) 1 Although G ′ has the required structure, it can happen that not all possible topological orderings reflect this.See Figure 6 for an example.
Corollary 1.Let O, Õ be two topological orderings of G.If P satisfies property (4) of Lemma 3 w.r.t.O, then the same is true for Õ.
Proof.Note that (1) − (3) of Lemma 3 are independent of the topological ordering.Therefore we have the following implications: for all s we have s ⊥ ⊥ pr O (s) | pa(s) w.r.t.P =⇒ P ∈ P G (with topological ordering O) =⇒ P ∈ P G (with topological ordering Õ) =⇒ for all s we have s ⊥ ⊥ pr Õ (s) | pa(s) w.r.t.P .
In the rest of the paper, we fix a topological ordering for every DAG and in light of the corollary, it does not matter which for the purpose of applying Lemma 3. Therefore, we will omit the dependence on the topological ordering when talking about the set of predecessors.

Results
Equivalence of two goals and S a set of distributions on X that have a density w.r.t.µ.For all P ∈ S there exists a Markov kernel K ∈ K G that is a version of the conditional distribution of N \ Roots(G) given Roots(G) if and only if P G ⊃ S.
Proof. ( =⇒ ) Let P ∈ S with density p and suppose that K is a version of P N \Roots(G)| Roots(G) and K ∈ K G .We need to show P ∈ P G.We can write p as follows: where p x N \Roots(G) |x Roots(G) is the density corresponding to K (Dudley, 2018).From the fact that K ∈ K G we know Since all the nodes in Roots(G) are joined in G we have Combining the above gives and therefore P ∈ P G.
( ⇐= ) Now let P ∈ S again and suppose P ∈ P G and x ∈ X such that p x Roots(G) > 0. We can write where we can switch from pa to pa in the third equality because there are only edges added between nodes in Roots(G) to obtain G.It can be shown that s∈Roots(G) k s x s |x pa(s) = p x Roots(G) (Cowell et al., 1999, p. 70).Dividing by p x Roots(G) on both sides gives: We know that there exists a Markov kernel K that is a version of the conditional distribution of N \ Roots(G) given Roots(G) and that this kernel has density p x N \Roots(G) |x Roots(G) (Dudley, 2018).Equation ( 18) shows that the density factorises and therefore K ∈ K G .

Conditions in terms of d-separation
Necessary and sufficient conditions for our goal can be deduced from the following theorem: Theorem 1.Let G = (N, E), G ′ = (N, E ′ ) be DAGs.The following statements are equivalent: (1) (2) =⇒ (1) Let P ∈ P G .We need to show This means that P satisfies (2) Lemma 3 w.r.t.G ′ and therefore P ∈ P G ′ .

Conditions in terms of perfectness
A sufficient condition for our goal can be deduced from the following theorem: Theorem 2. Let G = (N, E), G ′ = (N, E ′ ) be two DAGs.If G ′ contains a subgraph Ḡ′ such that Ḡ′ is perfect and its undirected version Ḡ′∼ contains the moral graph G M then, P G ′ ⊃ P G .Proof.Let P ∈ P G .By Lemma 5.9 from Cowell et al. (1999) we know that P factorises undirectedly2 over the undirected graph G M and thus over any undirected graph H = (N, E H ) containing G M .From Proposition 5.15 in Cowell et al. (1999) we know that P factorises (directedly) over any perfect directed graph Ḡ′ such that Ḡ′∼ = H.Therefore when Ḡ′∼ ⊃ G M we have From this theorem we can conclude that if we flip all the edges of G and then add edges until both G ′ is perfect and G ′∼ ⊃ G M , we obtain an inverse of G that satisfies our goal.The example in Figure 7 shows however that the condition that G ′ needs to contain a perfect subgraph Ḡ′ such that Ḡ′∼ ⊃ G M is not a necessary condition.We do have the following necessary condition on the graph G ′ to satisfy our goal: This theorem is based on the following proposition: Note that the proposition implies that when | Roots(G)| = 1 the conditions of Theorem 2 are both sufficient and necessary.We first prove Proposition 2 and then show how Theorem 3 can be obtained from it.
Proof of Proposition 2. Below we introduce an algorithm for inverting G.We show that the end result is a perfect graph, and that all the steps in the algorithm are necessary for obtaining a graph Ḡ′ for which Ḡ′ ⊃ G * and P Ḡ′ ⊃ P G holds.This implies that any G ′ for which G ′ ⊃ G * and P G ′ ⊃ P G holds, needs to contain a subgraph Ḡ′ that can be obtained through this algorithm and is therefore perfect and such that Ḡ′∼ ⊃ G M .
The algorithm starts by creating a graph Ḡ′ 0 by flipping all edges of G. Now we fix a topological ordering of the nodes3 that is compatible with Ḡ′ 0 .Subsequently all parents in G are joined.The while loop starts with the root of G, r 0 , and every rounds adds more vertices (r i ) to the set R and makes sure that the set pa ′ (r i ) is made complete for every i.The idea is that at every step, this  Ḡ′3 ) this is the status halfway the fourth while loop.The red edges have been added by the algorithm between i = 0 and i = 3. set R includes one more node of G and that the induced subgraph Ḡ′ i [R i ] is perfect at every step of the algorithm.See Figure 8 for an example course of the algorithm.Since at the end we have R i = N , we end up with a perfect graph Ḡ′ .

End result perfect
First note that Ḡ′ 0 [R 0 ] is perfect.Every node r i that enters R i has all its parents joined in Ḡ′ i .After it has entered R i , no new edges will be joined to it.Therefore at every step Ḡ′ All steps are necessary for P Ḡ′ ⊃ P G It is necessary that parents in G are joined At the start of the algorithm we join all nodes in Ḡ′ 0 that are parents of the same node in G.For t 1 , t 2 ∈ pa(s) that are not joined in G, we have that t 1 ⊥ d t 2 | N \{t 1 , t 2 }.However for any graph Ḡ′ for which t 1 and t 2 are not joined and that has G * as a subgraph, we do have Therefore the only way to satisfy condition (2) of Theorem 1 is by joining t 1 and t 2 in Ḡ′ .

It is necessary that parents in Ḡ′
i of r i are joined Let t 1 , t 2 ∈ pa ′ (r i ) that are not joined in Ḡ′ i and assume WLOG that t 2 < ′ t 1 .Case 1: There exists a path γ 2 from r By the assumption t 2 < ′ t 1 there is always a path γ 1 in G from r 0 to t 1 not containing t 2 .In order to satisfy property (4) of Theorem 1 we need that the concatenation of the trails γ 1 and γ 2 is blocked by pa ′ (t 1 ).Since all nodes except t 2 are younger in Ḡ′ than t 1 it follows that t 2 must be a parent of t 1 .Case 2: There is no path γ 2 from r 0 to t 2 in G such that γ 2 \ {t 2 } ⊂ R i Let us investigate how the edge (t 2 , r i ) came about.First note that (t 2 , r i ) / ∈ E * since otherwise the path (r 0 , ..., r i , t 2 ) would contradict the assumption of Case 2. Now one of the following must hold: 1. ∃j < i such that r i , t 2 ∈ pa ′ (r j ) 2. ∃s ∈ N such that t 2 , r i ∈ pa(s).
In case of option 1, we can ask again how the edge (t 2 , r j ) came about.We have again that (t 2 , r j ) / ∈ E * , for similar reasons as above.The same two options are left (with j taking the role of i): 1. ∃j ′ < j such that r j , t 2 ∈ pa ′ (r j ′ ) 2. ∃s ∈ N such that t 2 , r j ∈ pa(s).
Since for j = 0 option 1 is definitely not a valid option any more, we know there must be j * with 0 ≤ j * ≤ i such that option 1 no longer holds for the edge (t 2 , r j * ).
At this point, the only option is that the edge (t 2 , r j * ) came about because t 2 and r j * are both parents in G of a node s (see Figure 9).We know that s < ′ t 2 < ′ r i and therefore s / ∈ R i .Furthermore, because there is a path in G from r 0 to s via R i we know by a similar argument as in Case 1, that s must be a parent of t 1 in Ḡ′ .Now, in order to satisfy property (4) of Theorem 1, either the trail (t 1 , ..., r 0 , ..., r j * , s, t 2 ) must be blocked by pa ′ (t 1 ) \ {t 2 } or t 2 ∈ pa ′ (t 1 ), or both.Since s ∈ pa ′ (t 1 ), the v-structure (r j * , s, t 2 ) does not block this path.Since all other nodes on the path, except for t 2 are younger than t 1 in Ḡ′ and there is no other v-structures, it follows that the path is unblocked and therefore t 2 must be a parent of t 1 in Ḡ′ .
Example situation of Case 2 in proof of necessity that parents in Ḡ′ i are joined, highlighting the important edges that play a role in the proof.
Remark 1.Note that all arbitrariness of the algorithm is captured in the fixation of the topological ordering of Ḡ′ 0 .Given a pair of graph G, G ′ such that G ′ ⊃ G * , P G ′ ⊃ P G and Roots(G) = 1, the algorithm can give us a necessary and sufficient subgraph Ḡ′ by fixing the topological ordering of Ḡ′ 0 to be compatible with G ′ .Remark 2. Since any perfect graph with with a single leave has a unique topological ordering4 , it follows from the proposition that any G ′ such that G ′ ⊃ G * , P G ′ ⊃ P G and Roots(G) = 1 has this same property.E) and Ḡ = (N, Ē) are such that P G ⊂ P Ḡ, then the same holds for the vertex-induced subgraph of both graphs: Proof.One can easily check that the condition (3) in Theorem 1 remains satisfied when taking vertex-induced subgraphs.
Proof of Theorem 3. Consider a DAG G with | Roots(G)| ≥ 1.Note that by Lemma 4 for any s ∈ N , P G ′ ⊃ P G implies P G ′ [{s}∪des(s)] ⊃ P G[{s}∪des(s)] .Since s is the unique root for G[{s} ∪ des(s)], we know from Proposition 2 that this implies that G ′ [{s} ∪ des(s)] contains a perfect subgraph Ḡ′ s , such that Ḡ′ s ⊃ G M [{s} ∪ des(s)].In practice, the inverse G ′ is often obtained by simply inverting the edges in G.In this case we have the following necessary and sufficient condition to satisfy our goal.
Proof. ( ⇐= ) If pa(s), ch(s) are complete for all s ∈ N and G ′ = G * this implies that G ′∼ ⊃ G M and G ′ is perfect.The result now follows from Theorem 2. ( =⇒ ) We will show the contrapositive.Assume first that there exists an s ∈ N such that t 1 , t 2 ∈ pa(s) are not joined.Now consider the distribution P ∈ P G for X s = X t1 + X t2 mod 2 and all other nodes are independent Bernoulli(0.5).It is easy to see that P / ∈ P G ′ .Now assume that there exists an s ∈ N such that u 1 , u 2 ∈ ch(s) are not joined.Now consider the distribution P ∈ P G such that X u1 and X u2 are equal to X s and all other nodes (including s itself) are Bernoulli(0.5).It is again easy to see that P / ∈ P G ′ .

Conditions in terms of single edge operations
In the proof of Proposition 2, we suggested an algorithm for inverting G, that started by flipping all the edges of G at once and then add edges where necessary.In this section we are looking at obtaining an inverse of G by flipping the edges one by one, and potentially adding edges where necessary.The reversal (flipping) of an edge (s, t) is called covered when pa(t) = pa(s) ∪ {s}.Meek (1997) states the following conjecture: Conjecture 1 (Meek conjecture).Let G = (V, E) and G ′ = (V, E ′ ) be DAGs.P G ′ ⊃ P G if and only if there exists a sequence of DAGs L 1 , ..., L n such that L 1 = G ′ and L n = G and L i+1 is obtained from L i by one of the following operations: -covered edge reversal -edge removal.Chickering (2002) later proved this conjecture.This result suggests the outline of an algorithm for the inversion of a Bayesian network G.This algorithm starts with G and chooses a suitable next edge of G to be inverted.Before the edge can be inverted, it first needs to be covered.This can be done by adding new edges, or changing the direction of the edges that were added before.However, all of these operations have to conserve the acyclicity of the graph.

Restricting the set of possible kernel functions
The results derived in the above discuss the question what conditions G ′ must satisfy such that for every P ∈ P G , K G ′ contains a version of the conditional distribution P H|V .Here it is implied that we allow for all possible kernel functions k s in the definitions of P G and K G ′ .In practice, however, restrictions are often put on the space of possible kernel functions.A common choice (Kingma and Welling, 2013) is to allow for only Gaussian kernel functions, of the form with f some fixed possibly nonlinear function.We will now investigate which results remain valid for the restricted case.Given a subset R of kernel functions, we will denote the restricted spaces of probability distributions and Markov kernels factorising over G by P G R , K G R respectively.Before we dive into the results for general restrictions, we start by examining the case where R f is the set of Gaussian kernel functions defined above.Consider the pair of graphs G, G ′ in Figure 10.It is clear that this pair of graphs satisfies our original Goal I.However, when we restrict to the set Gaussian kernel functions, we are no longer able to model the posterior distribution exactly, as we will show now.Consider the distribution in P G R f given by X s ∼ N (f (X), 1).( 22) If the distribution P t|s would be in K G ′ R f , we would need that the joint density of X t , X s satisfies the following proportionality as a function of where only b may depend on x s .Working out the actual joint density gives We can conclude that we only have that P t|s ∈ K G ′ R f if f is a linear function. 5From this example, we can conclude that the conditions that were sufficient for the unrestricted case, are in general not sufficient in the restricted case.Now we look at the validity of our results for the general restrictions.We start with the equivalence of the two goals, Proposition 1. Recall that the proposition shows that finding a G ′ such that there exists a topological ordering of G ′ for which there is no vertex outside Leaves(G) that is older than the vertices in Leaves(G) and P G ′ ⊃ P G is both a necessary and sufficient condition to satisfy Goal I.It is easy to see that it is still a sufficient condition (reverse implication ( ⇐= ) in Proposition 1).However in order to get the forward implication ( =⇒ ), we used that when all the nodes in Roots(G) are connected, any density function can be written as p(x Roots(G) ) = s∈Roots(s) k s (x s |x pa(s) ).This is no longer the case when we restrict the space of possible kernel functions.We have that the condition is only necessary if for every P ∈ P G R , the marginal distribution P V factorises over a complete directed graph of the leaves of G.A slightly weaker necessary condition for Goal I still holds in general, namely that P For Theorem 1, note that conditions (2)-( 4) only relate to the graph structures of G and G ′ .Therefore these conditions will still be equivalent for the restricted case.The implication (2) =⇒ (1) does not hold in general, which was exemplified by the Gaussian kernel functions above.The implication (1) =⇒ (2), on the other hand, does still hold, under the extra assumption that the the restriction R is such that for any graph G, for all A, B, S ⊂ N such that A ⊥ d B | S, there is a P ∈ P G R for which A ⊥ ⊥ B | S. We will sketch how this assumption is satisfied for the Gaussian kernel functions described above.Let A, B, S ⊂ N such that A ⊥ ′ d B | S.This implies that there is a trail γ : A ∋ a → b ∈ B that is unblocked by S. If we let θ (s,t) = 1 for all s, t ∈ γ and zero otherwise, it can be shown Theorem 2 is only a sufficient condition which is, by the Gaussian kernel function example, not sufficient any more in the restricted case.Theorem 3 on the other hand is only a necessary condition.The proof of this theorem only uses the necessity of the conditions in Theorem 1 which we showed above are still valid in the restricted case.We conclude that therefore Theorem 3 also still holds in the restricted case.
To conclude this section we summarise the results for the restricted case.We saw that we only have a slightly weaker necessary condition for Goal I, namely that for every subset S ⊂ H we need that P . Necessary conditions for this latter condition are then provided by Theorem 1 and 3, which are still valid for the restricted case.

Conclusion
In this paper, we derived necessary and sufficient conditions for the recognition network to be able to model the exact posterior distribution of a generative Bayesian network.In case that the generative network has a single node without parents, the necessary and sufficient conditions coincide.However, for multiple nodes without parents there is still a gap in both conditions.

Further study directions
A further direction of study could be to find a single necessary and sufficient condition for the general case.Another interesting question is the following: "What is the smallest number of edges in an inversion G ′ of G?".Using the results on single edge operations, one could try to find an algorithm that finds an optimal inversion of G.It is generally believed that the recognition network needs many edges to make exact modelling of the posterior distribution possible (Welling, personal communication, 2022).Therefore, the number of edges in the recognition network will be reduced to make it computationally efficient.In practice, this approximation does not seem to affect the quality of the inference.This phenomenon remains an open problem that is relevant for machine learning.

Figure 4 :
Figure 4: Recognition model capturing the dependence between muscle pain and hayfever

Figure 6 :
Figure6: Pair of DAGs G, G ′ that satisfy the first requirement of Goal II, but for which there exists a topological ordering of G ′ (the one on the right) that does not reflect this.
2) For all sets A, B, S ⊂ N such thatA ⊥ ′ d B | S, we have A ⊥ d B | S (3) For all s ∈ N , we have s ⊥ d nd ′ (s) | pa ′ (s) (4) For all s ∈ N , we have s ⊥ d pr ′ (s) | pa ′ (s).Proof.(1) =⇒ (2) (by contradiction) Suppose there exist A, B, S such that A ⊥ ′ d B | S, but A ⊥ d B | S.By Lemma 2 this implies there exists an P ∈ P G for which A ⊥ ⊥ B | S.This violates (2) of Lemma 3 and therefore P / ∈ P G ′ .

Figure 7 :
Figure 7: Pair of DAGs G, G ′ that satisfy Goal II but G ′ does not satisfy the condition in Theorem 2

Figure 8 :
Figure 8: Example course of the algorithm.(G) is the original graph.( Ḡ′ 0 ) is the version with the edges of G flipped and the parents connected (red arrow).(Ḡ′3 ) this is the status halfway the fourth while loop.The red edges have been added by the algorithm between i = 0 and i = 3.

Figure 10 :
Figure 10: Pair of graphs G, G ′ 6 that for this distribution a ⊥ ⊥ b | S and therefore A ⊥ ⊥ B | S. With this extra assumption we will now show (1) =⇒ (2).Suppose A, B, S ⊂ N such that A ⊥ ′ d B | S.This implies that for all P ∈ P G ′ R , we have A ⊥ ⊥ B | S. Now suppose by contradiction that A ⊥ d B | S. By the assumption, there must be a P ∈ P G R for which A ⊥ ⊥ B | S, which would contradict (1).Therefore A ⊥ d B | S which shows (1) =⇒ (2).