Uniqueness of the Level Two Bayesian Network Representing a Probability Distribution

Bayesian Networks are graphic probabilistic models through which we can acquire, capitalize on, and exploit knowledge. they are becoming an important tool for research and applications in artificial intelligence and many other fields in the last decade. This paper presents Bayesian networks and discusses the inference problem in such models. It proposes a statement of the problem and the proposed method to compute probability distributions. It also uses D-separation for simplifying the computation of probabilities in Bayesian networks. Given a Bayesian network over a family I of random variables, this paper presents a result on the computation of the probability distribution of a subset S of I using separately a computation algorithm and Dseparation properties. It also shows the uniqueness of the obtained result.


Introduction
Bayesian networks are graphical models for probabilistic relationships among a set of variables.They have been used in many fields due to their simplicity and soundness.They are used to model, represent, and explain a domain, and they allow us to update our beliefs about some variables when some other variables are observed, which is known as inference in Bayesian networks.
Given a Bayesian network 1 relative to the set X I X i i∈I of random variables, we are interested in computing the joint probability distribution of a nonempty subset S called target of I.
The computation of the probability distribution of X S requires marginalizing out a set of variables of S I − S from the joint distribution P I corresponding to the Bayesian network.
In large Bayesian networks, the computation of probability distributions and conditional probability distributions may require summations relative to very large subsets of I.

International Journal of Mathematics and Mathematical Sciences
Consequently there is a need to order and segment, if possible, these computations into several computations that are less intensive and more accessible to a parallel treatment.These segmentations are related to the graphic properties of the Bayesian networks.
This paper describes the computation of P S using a specific order described by a proposed algorithm, and the segmentations of P S .
We consider discrete random variables, but the results presented here can be generalized to continuous random variables with the density of X i relative to a finite measure μ i the summations will be replaced by integrations relative to those measures μ i .
The paper is organized as follows.Section 2 introduces Bayesian networks and Level two Bayesian networks.We, then, present in Section 3 the inference problem in Bayesian networks and the proposed computation algorithm.Section 4 outlines D-Separation in Bayesian networks and describes graphical partitions that will allow the segmentations of the computations of probability distributions.Section 5 proves the uniqueness of the results obtained in Sections 3 and 4.

Bayesian Networks
A Bayesian network BN is a family of random variables X i i∈I such that: i the set I is finite and endowed with a structure of a directed acyclic graph DAG , G, where, for each i: ii for each i, X i is independent of X j j∈I− p i ∪d i conditional on X p i for more details see, e.g., 1-5 .
We know that this is equivalent to the equality where P I is the joint probability distribution of X I X i i∈I and P i/p i is the probability distribution of X i conditional on X p i X j j∈p i .The joint probability distribution corresponding to the BN in Figure 1 can be written as

level two Bayesian networks
We consider the probability distribution P I of a family X i i∈I of random variables in a finite space Ω I i∈I Ω i .Let I be a partition of I and let us consider a DAG G on I.We say that there is a link from J to J where J and J are atoms of the partition I if J , J ∈ G.If J ∈ I, we denote by p J the set of parents of J, that is, the set of J such that J , J ∈ G.
The probability P I is defined by the Level Two Bayesian network BN2 , on I, I, G, P J/p J J∈I , if for each J ∈ I, we have the conditional probability P J/p J , or the probability of X J conditioned by X p J which, if p J ∅, is the marginal probability P J , so that P I x I J∈I P J/p J x J /x p J .

2.3
The probability distribution P I associated to the BN of level 2 in Figure 2 can be written as 2.4

Close Descendants
We define the set of close descendants of a node J denoted cd J as the set of vertices containing the children of J and the vertices located on a path between J and one of its children.
In the example below Figure 2 , we have

Initial Subset
For each subset S, we denote by S the initial subset defined by S, that is, the set consisting of S itself and the S such that there is a path in G from S to S. We can identify this subset with the union of all S such that S is an ancestor of S For each S, the Initial subset S is a BN, in other words the restriction of a BN to an initial subset is a BN P S s∈S P s/p s . 2.6 In the example above Figure 2 , we have

Inference in Bayesian Networks
Consider the BN in Figure 1.Suppose we are interested in computing the distribution of S I − {3}, in other words all variables are in the target S except X 3 .
By marginalizing out the variables X 3 from the joint probability distribution P I , the target distribution P S can be written as  α is a function that depends on X 1 , X 2 , X 4 , X 5 , and X 7 but has nothing to do with P 1,2,4,5,7 joint probability distribution of X 1 , X 2 , X 4 , X 5 , X 7 .
By doing this we loose the structure of the BN.
If we do the marginalization as follow we obtain according to Bayes' theorem :  3. The variables used in the marginalization above, to keep a structure of a BN2, is the the set of close descendants defined above.
More general, if we have to sum out more than one variable there is a need to order the variables first.The aim of the inference will be to find the marginalization, or elimination, ordering for the arbitrary set of variables not in the target.This aim is shared by other node elimination algorithms like "variables elimination" 6 , "bucket elimination" 7 .
The main idea of all these algorithms is to find the best way to sum over a set of variables from a list of factors one by one.An ordering of these variables is required as an input.The computation depends on the ordering elimination; different elimination ordering produce different factors.
The algorithm we proposed to solve this problem is called the "Successive Restrictions Algorithm" SRA 8 .SRA is a goal-oriented algorithm that tries to find an efficient marginalization elimination ordering for an arbitrary joint distribution.
The general idea of the algorithm of successive restrictions is to manage the succession of summations on all random variables out of the target S in order to keep on it a structure less constraining than the Bayesian network, but which allows saving in memory; that is International Journal of Mathematics and Mathematical Sciences the structure of Bayesian Network of Level Two.This was possible using the notion of close descendants.
The principle of the algorithm was presented in details in 8 .

D-Separation and Computations in a Bayesian Network
We have introduced an algorithm which makes possible the computation of the probability distribution of a subset of random variables X s s∈S of the initial graph.It is also possible to use the SRA to compute any probability distribution of a set of variables X A conditionally to another subset X B P A|B .This algorithm tries to achieve the target distribution by finding a marginalization ordering that takes into account the computational constraints of the application.It may happen that, in certain simple cases, the SRA would be less powerful than the traditional methods 6, 9-13 , but it has the advantages of adapting to any subset of nodes of the initial graph, and also to present in each stage interpretable result in terms of conditional probabilities, and thus technically usable.
In addition to the SRA we propose, especially for large Bayesian networks, to segment the computations into several less heavier computations that could be carried independently.These segmentations are possible using the D-separation.

D-Separations and Classical Results
Consider a DAG I, G .A chain is a sequence x 0 , . . ., x n of elements of I such that for all i ≥ 1, x 1 , . . ., x n−1 are called intermediate nodes on this chain.On an intermediate node x i a chain can have three connexions as illustrated in Figure 4.
Let I, G be a DAG, S ⊆ I, a and b be distinct nodes in S. A chain between a and b is d-separated by S if there is an intermediate node x satisfying one of the two properties: i x ∈ S and the connection is, on x, serial or diverging, ii x / ∈ S and the connection is converging on x.
In other words, A chain is not d-separated by S if it is in a converging connection at each intermediate node of S, and in a serial or a diverging connection at each intermediate node that has no descendants in S.

Classic Result 1
If A and B are d-separated by S. then the variables X A and X B are independent conditional on X S .As we can see on the following example:

Classic Result 2
X C is d-separated from the rest of the variables conditional on the Markov blanket of C. The proof of these two results and other results related to D-separation can be found in 11 .

Moral and Hypermoral Graphs
Another classic graphic property used with some inference algorithms that we can find in the literature is the notion of the Moral graph.

Moral Graph
Given a DAG I, G its associated moral graph is the undirected graph I, G m , where G m is the set ii of pairs {i, j} such that i and j have a child in common.
In a similar way, we define what we call the hypermoral graph defined as follow:

Hypermoral Graph
Given a DAG I, G its associated hypermoral graph is the undirected graph I, G hm , where G hm is the set: ii of pairs {i, j} such that i and j have a close descendant in common.
In Figure 5, there is a link between 4 and 7 in the hypermoral graph because they share 9 as a close descendant.

Moral and Hypermoral Partitions
The moral graph helps defining the moral partition as follow.
i We call a S-moral partition the partition of S − S, denoted P m S −S , defined by the equivalence relation ∼ S −S , where x ∼ S −S y means " there exists a chain from x to y, in S − S, not blocked by S." In an equivalent way "there exists a chain, in the moral graph G m , connecting x to y without an intermediate node in S." In a similar way we define the hypermoral partition.
ii We call an S-hypermoral the partition of S − S, denoted P hm S −S , defined by the equivalence relation ≈ S −S , where x ≈ S −S y means "there exists a chain, in the hypermoral graph G hm , connecting x to y without an intermediate node in S".As Illustrated in Figure 6.

Results
The following results show the possibility of segmenting the computation of the probability distribution P S .where The proof of the theorem can be found in 14 .

Unique Partition
We have seen in the last two sections that the application of the SRA for the computation of P S provides a structure of BN2, and that the use of D-Separations properties allows the segmentation of the computation of P S and provides also a structure of a level two Bayesian network on S. In fact the two obtained structures are same, this results is giving by the following theorem.

Interpretation
This theorem indicates that the level two Bayesian network, characterizing the probability distribution P S , obtained by application of the SRA, is unique independently of the choices done while running the algorithm.This unique partition is constituted of sets of the two types 1 and 2 mentioned above.
Proof.Let us show that the partition of the target S consists of the subsets of types 1 and 2 as mentioned above in the theorem, in other words consists of T C ∩ S where C ∈ P hm S −S , and {k} for all k ∈ K.
As S is a BN-containing S, without a loss of generality, we can limit ourselves to the case where S I.The application of the SRA for the computation of P S requires marginalizing out the set of variables of S I − S following a specific order.Let try to show that the obtained partition by application of the SRA is same mentioned above.
Let us proof this result by induction on the cardinality of S.
Let us assume that Card S 1.In this case S has only one element, S {i}.On one side, marginalizing out the variable á i, by application of the SRA, creates a new node E i that contains the close descendants of i i.e., E i cd i .The BN2 resulting from this marginalization is formed of the new node E i along with all other remaining nodes, in other words all the {k} such that k ∈ I − i 1 ∪ cd i 1 S − E i , which is shown in Figure 8.
On the other side, since Card S 1, P hm S is constituted of a unique equivalent class, C {i}, so, by definition, the partition of S is constituted of This shows the result in this first case.Let us suppose now that Card S n, and i 1 , . . ., i n as an hierarchical order on S. We are going to sum out following the inverse order of the given hierarchical order.Let us assume that the result is right till step in other words marginalizing out i n− 1 and let's proof the result for step 1 in other words marginalizing out i n− .

Figure 1 :
Figure 1: An example of a Bayesian network.

Figure 3 :
Figure 3: Level two Bayesian network on S.

Figure 4 :
Figure 4: a A converging connexion.b A diverging connexion.c A serial connexion.
Given a subset C of I. Markov Blanket.F C of a subset C is made of the parents of C, the children of C and the variables sharing a child with C. Markov Boundary of C. M C C ∪ F C Close Descent of C. T C is the set of all close descendants of the elements in C other that those in C itself.Exterior Roots of C. R C is the set of the parents of the elements of C ∪ T C other that those in C ∪ T C itself.

Figure 6 :
Figure 6: a P hm S The S-hypermoral partition.b P m S The S-moral partition.

Theorem 4 . 1 .
Let I, G, P I be a BN, and let S be a subset of I. Let P hm S −S be the S-hypermoral partition of S − S and let K be the set of elements of S which are not close descendants of any element of S − S, that is, K {k ∈ S, ∀y ∈ S − S, k / ∈ cd y }.Then,

Theorem 4 . 2 .
The set Q s of singletons {k}, where k ∈ K (if K / ∅), and of subsets T C ∩ S , where C ∈ P hm S −S , constitutes a partition of S. As Illustrated in Figure 7.

Figure 7 :
Figure 7: a P hm S the S-Hypermoral Partition.b BN2 Representing P S .

Theorem 5 . 1 .1
The following two sets.The subsets T C ∩ S , where C ∈ P hm S −S . 2 The set Q s of singletons {k}, where k ∈ K (if K / ∅) constitute a unique partition defining a BN2 on S.

Figure 8 :
Figure 8: a : fraction of a BN before marginalizing out i. b fraction of a BN2, on S, obtained after summing out i.

1 Figure 9 :
Figure 9: a : fraction of a BN2 on A. b : BN2 obtained ater summing out i n− .