Finding the optimal bayesian network given a constraint graph

Despite recent algorithmic improvements, learning the optimal structure of a Bayesian network from data is typically infeasible past a few dozen variables. Fortunately, domain knowledge can frequently be exploited to achieve dramatic computational savings, and in many cases domain knowledge can even make structure learning tractable. Several methods have previously been described for representing this type of structural prior knowledge, including global orderings, super-structures, and constraint rules. While super-structures and constraint rules are flexible in terms of what prior knowledge they can encode, they achieve savings in memory and computational time simply by avoiding considering invalid graphs. We introduce the concept of a "constraint graph" as an intuitive method for incorporating rich prior knowledge into the structure learning task. We describe how this graph can be used to reduce the memory cost and computational time required to find the optimal graph subject to the encoded constraints, beyond merely eliminating invalid graphs. In particular, we show that a constraint graph can break the structure learning task into independent subproblems even in the presence of cyclic prior knowledge. These subproblems are well suited to being solved in parallel on a single machine or distributed across many machines without excessive communication cost.

. A simple method is to specify an ordering on the variables and 38 require that parents of a variable must precede it in the ordering (Cooper & Herskovits, 1992). 39 This representation leads to tractable structure learning because identifying the parent set for each 40 variable can be carried out independently from the other variables. Unfortunately, prior knowledge is 41 typically more ambiguous than knowing a full topological ordering and may only exist for some of the 42 variables. A more general approach to handling prior knowledge is to employ a "super-structure," i.e., 43 an undirected graph that defines the super-set of edges defining valid learned structures, forbidding 44 all others (Perrier et al., 2008). This method has been fairly well studied and can also be used as a 45 heuristic if defined through statistical tests instead of prior knowledge. A natural extension of the 46 undirected super-structure is the directed super-structure (Ordyniak & Szeider, 2013), but to our 47 knowledge the only work done on directed super-structures proved that an acyclic directed super-48 structure is solvable in polynomial time. An alternate, but similar, concept is to define which edges 49 must or cannot exist as a set of rules (Campos & Ji, 2011). However, these rule-based techniques 50 do not specify how one would exploit the constraints to reduce the computational time past simply 51 skipping over invalid graphs. 52 We propose the idea of a "constraint graph" as a method for incorporating prior information into the 53 BNSL task. A constraint graph is a directed graph where each node represents a set of variables in the 54 BNSL problem and edges represent which variables are candidate parents for which other variables. 55 The primary advantage of constraint graphs versus other methods is that the structure of the constraint 56 graph can be used to achieve savings in both memory cost and computational time beyond simply 57 eliminating invalid structures. This is done by breaking the problem into independent subproblems 58 even in the presence of cyclic prior knowledge. An example of this cyclic prior knowledge is 59 identifying two groups of variables that can draw parents only from each other, similar to a biparte 60 graph. It can be difficult to identify the best parents for each variable that does not result in a cycle in 61 the learned structure. In addition, constraint graphs are visually more intuitive than a set of written 62 rules while also typically being simpler than a super-structure, because constraint graphs are defined 63 over sets of variables instead of the original variables themselves. This intuition, combined with 64 automatic methods for identifying parallelizable subproblems, makes constraint graphs easy for  Figure 1: A constraint graph grouping variables. (a) We wish to learn a Bayesian network over 11 variables. The variables are colored according to the group that they belong to, which is defined by the user. These variables can either (b) be organized into a directed super structure or (c) grouped into a constraint graph to encode equivalent prior knowledge. Both graphs define the superset of edges which can exist, but the constraint graph uses far fewer nodes and edges to encode this knowledge. (d) Either technique can then be used to guide the BNSL task to learn the optimal Bayesian network given the constraints.
are independent of the optimal parents for another variable given that the variables do not form a 91 cycle in the resulting Bayesian network. This acyclicity requirement is typically computationally 92 challenging to determine because a cycle can involve more variables than the ones being directly 93 considered, such as a graph which is simply a directed loop over all variables. However, given an 94 acyclic constraint graph or an acyclic directed super-structure, it is impossible to form a cycle in the 95 resulting structure; hence, the optimal parent set for each variable can be identified independently 96 from all other variables. A convenient property of constraint graphs, and one of their advantages 97 relative to other methods, is that independent subproblems can be found through global parameter 98 independence even in constraint graphs which contain cycles. We describe in Section 3.2 the exact 99 algorithm for finding optimal parent sets for each case one can encounter in a constraint graph. Briefly, 100 the constraint graph is first broken up into its strongly connected components (SCCs) that identify 101 which variables can have their parent sets found independently from all other variables ("solving a 102 component") without the possibility of forming a cycle in the resulting graph. Typically these SCCs 103 will be single nodes from the constraint graph, but may be comprised of multiple nodes if cyclic prior 104 knowledge is being represented. In the case of an acyclic constraint graph, all SCCs will be single 105 nodes, and in fact each variable can be optimized without needing to consider other variables, in line 106 with theoretical results from Ordyniak & Szeider (2013). In addition to allowing these problems 107 to be solved in parallel, this breakdown suggests a more efficient method of sharding the data in a 108 distributed learning context. Specifically, one can assign an entire SCC of the constraint graph to a 109 machine, including all columns of data corresponding to the variables in that SCC and all variables in  It is possible to convert any directed super-structure into a constraint graph and vice-versa though it 115 is far simpler to go from a constraint graph to a directed super-structure. To convert from a directed 116 super-structure to a constraint graph, one must first identify all strongly connected components that 117 are more than a single variable. All variables in a strongly connected component can be put into 118 the same node in a constraint graph that contains a self loop. Then, one would tabulate the unique 119 parent and children sets a variable can have. All variables outside of the previously identified strongly 120 connected components with the same parent and children sets can be grouped together into a node in 121 the constraint graph. Edges then connect these sets based on the shared parent sets specified for each 122 node. In the situation where a node in the constraint graph can draw parents from only a subset of the 123 variables in a node created by the identification of the strongly connected components, the node must

Manuscript to be reviewed
Computer Science be broken into two nodes that both have self loops and loops connecting to each other to allow for 125 only a subset of those variables to serve as a parent for another node. In contrast, to convert from a 126 constraint graph to a directed super-structure one would simply draw, for each node, an edge from 127 all variables in the current node to all variables in the node's children. We suggest that constraint 128 graphs are the more intuitive method both due to their simpler representation and ease of extracting 129 computational benefits from the task.  The goal of the algorithm is to identify the optimal Bayesian network defined over the set of variables 137 without having to repeat any calculations and without having to use excessive memory. This is done 138 by defining additional graphs, the parent graphs and the order graph. We will refer to each node in 139 these graphs as "entries" to distinguish them from the constraint graph and the learned Bayesian 140 network. A parent graph is defined for each variable and can be defined as a lattice, where the entries 141 to some layer i correspond to combinations of all other variables of size i. Each entry is connected to 142 the entries in the previous layers that are subsets of that entry such that (X 1 , X 2 ) would be connected 143 to both X 1 and X 2 . For each entry, the score of the variable is calculated using the parents in the 144 entry and compared to the scores held in the parent entries, recording only the best scoring value 145 and parent set amongst them. These entries then hold the dynamically calculated best parent set 146 and associated score, allowing for constant time lookups later on of the best parent set given a set 147 of possible parents. The order graph is structured in the same manner as the parent graphs except 148 over all variables. In contrast with the parent graphs, it is the edges that store useful information 149 in the form of the score associated with adding a given variable to the set of seen variables stored 150 in the entry and the parent set that yields this score Each path from the empty root node to the leaf 151 node containing the full set of variables encodes the optimal network given a topological sort of the 152 variables, and the shortest path encodes the optimal network. This data structure reduces the time  Structure learning is flexible with respect to the score function used to identify the optimal graph.

156
There are many score functions that typically aim to penalize the log likelihood of the data by property, such that the score for a dataset given a model is equal to the product of the score of each 162 variable given its parents. While constraint graphs remain agnostic to the specific score function used, 163 we assume that MDL is used as it has several desirable computational benefits. For review, MDL 164 defines the score as the following: where |B| defines the number of parameters in the network. The term "minimum description length"  The resulting order graph during the BNSL task. It is significantly sparser than the typical BNSL task because after choosing a variable to start the topological ordering the remaining variables must be added in the order defined by the cycle.
independently. In many cases the SCC will be a single node of the constraint graph, because prior 174 knowledge is typically not cyclic. In general, the SCCs of a constraint graph can be solved in any 175 order due to the global parameter independence property.

176
The algorithm for solving an SCC of a constraint graph is a straightforward modification of the 177 dynamic programming algorithm described above. Specifically, parent graphs are created for each 178 variable in the SCC but defined only over the union of possible parents for that variable. Consider 179 the case of a simple, four-node cycle with no self-loops such that no additional memory cost if the components are solved sequentially.

203
The algorithm described above has five natural cases and are described below.

204
One node, no parents, no self loop: The variables in this node contain no parents, so nothing needs 205 to be done to find the optimal parent sets given the constraints. This naturally takes O(1) time to 206 solve.

207
One node, no parents, self loop: This is equivalent to exact BNSL with no prior knowledge. In nodes. This is because we only need to define a parent graph for the variables in the node we are 245 currently considering, but these parent graphs must be defined over all variables in the node plus all 246 the variables in the parent nodes.

247
Multiple nodes: The algorithm as presented initially is used to solve an entire component at the     Table 2: Algorithm comparison on a node with a self loop and other parents. The exact algorithm and the constrained algorithm proposed here were on a SCC comprosied of a main node with a self loop and one parent node. Shown are the results of increasing the number of variables in the main node while keeping the variables in the parent node steady at 5, and the results of increasing the number of variables in the parent node while keeping the number of variables in the main node constant. For both algorithm we show the number of nodes across all parent graphs (PGN), the number of nodes in the order graph (OGN), the number of edges in the order graph (OGE) and the time to compute.
A popular Bayesian network classifier is the naive Bayes classifier that defines a single class variable 291 as the parent to all feature variables. A natural extension to this method is to learn which features are 292 useful, instead of assuming they all are, thereby combining feature selection with parameter learning 293 in a manner that has some similarities to decision trees. This approach can be modeled by using a 294 constraint graph that has all feature variables X in one node and all target variables y in its parent 295 node, such that y → X. 296 We empirically evaluated the performance of learning a simple Bayesian network classifier on the  (Table. 2). The data consisted of randomly generated binary values, because the running time 326 does not depend on the presence of underlying structure in the data. We note that in all cases there 327 are significant speed improvements and simpler graphs but that there are particularly encouraging 328 speed improvements when the number of variables in the main node are increased. This suggests that 329 it is always worth the time to identify which variables can be moved from a node with a self loop to a 330 separate node.

332
Lastly, we consider constraint graphs that encode cyclic prior knowledge. We visually inspect the 333 results from cyclic constraint graphs to ensure that they do not produce cyclic Bayesian networks 334 even when the potential exists. Two separate constraint graphs are inspected, a two node cycle and a 335 four node cycle (Fig. 4a/c). The dataset is comprised of random binary values, where the value of one 336 variable in the cycle is copied to the other variables in the cycle to add synthetic structure. However, 337 by jointly solving all nodes cycles are avoided while dependencies are still captured (Fig. 4b/d). 338 We then compare the exact algorithm without constraints to the use of an appropriate constraint graph 339 in a similer manner as before (Table. 3). This is done first for four node cycles where we increase the   Table 3: Algorithm comparison on a cyclic constraint graph. The exact algorithm and the constrained algorithm proposed here were run for four node cycles with differing numbers of variables, cycles with different numbers of nodes but three variables per node, and differing numbers of samples for a four-node three-variable cycle. All experiments with differing numbers of variables or nodes were run on 1000 randomly generated samples. of variables and many constraints present from prior knowledge. We anticipate that the automatic 355 manner in which parallelizable subtasks are identified in a constraint graph will be of particular 356 interest given the recent increase in availability of distributed computing.

357
Although the networks learned in this paper are discrete, the same principles can be applied to all 358 types of Bayesian networks. Because the constraint graph represents only a restriction in the parent 359 set on a variable-by-variable basis, the same algorithms that are used to learn linear Gaussian or 360 hybrid networks can be seamlessly combined with the idea of a constraint graph. In addition, most of 361 the approximation algorithms which have been developed for BNSL can be modified to take into 362 account constraints because these algorithms simply encode a limitation on the parent set for each 363 variable.

364
One could extend constraint graphs in several interesting ways. The first is to assign weights to edges 365 so that the weight represents the prior probability that the variables in the parent set are parents of 366 the variables in the child set, perhaps as pseudocounts to take into account when coupled with a 367 Bayesian scoring function. A second way is to incorporate "hidden nodes" that are variables which