Strengthening Probabilistic Graphical Models: The Purge-and-merge Algorithm

Probabilistic graphical models (PGMs) are powerful tools for solving systems of complex relationships over a variety of probability distributions. However, while tree-structured PGMs always result in efficient and exact solutions, inference on graph (or loopy) structured PGMs is not guaranteed to discover the optimal solutions. It is in principle possible to convert loopy PGMs to an equivalent tree structure, but this is usually impractical for interesting problems due to exponential blow-up. To address this, we developed the purge-and-merge algorithm. This algorithm iteratively nudges a malleable graph structure towards a tree structure by selectively merging factors. The merging process is designed to avoid exponential blow-up by way of sparse structures from which redundancy is purged as the algorithm progresses. We set up tasks to test the algorithm on constraint-satisfaction puzzles such as Sudoku, Fill-a-pix, and Kakuro, and it outperformed other PGM-based approaches reported in the literature. While the tasks we set focussed on the binary logic of CSP, we believe the purge-and-merge algorithm could be extended to general PGM inference.


I. INTRODUCTION
We have successfully created flexible probabilistic graphical model (PGM) structures to solve constraint-satisfaction problems (CSPs) that cannot be solved with existing PGM inference techniques. This entailed the creation of an exact CSP solver that preserves all solutions.
We did not set out to explore modern constraintsatisfaction problem solving in general, but rather to incorporate constraint-satisfaction capabilities into PGMs. Central to this work is a PGM technique called purge-and-merge. It is the combination of three established probabilistic techniques: building cluster graphs [5], applying loopy belief propagation [6], and merging factors into a joint space. Together, these techniques enable purge-and-merge to allow the growth of factors via factor merging while also removing redundancies in the CSP problem space via loopy-belief propagation. We can thus solve a range of CSPs that would be too intricate for either loopy-belief propagation or factor merging. Our experimental study shows that purge-and-merge reliably solves problems too difficult for other belief-propagation approaches [2]- [5].
Purge-and-merge provides higher-order reasoning for PGMs and constraint satisfaction. This technique would therefore be of benefit to any area that incorporates both these domains, such as: • classification and re-classification problems -e.g. image de-noising [1] and scene classification [7]; • image segmentation -e.g. breaking an image up into superpixels using pixel similarity and boundary constraints; and • hybrid reasoning, -e.g. solving Sudokus visually by combining a handwriting input classifier with constraint satisfaction.
PGMs are tools that express intricate problems with multiple dependencies as graphs. PGM inference techniques such as message passing can then be used to solve these graphs. PGMs are integral to a wide range of probabilistic problems [8] such as medical diagnosis and decision making [9], object recognition in computer vision [7], as well as speech recognition and natural-language processing [10].
Constraint satisfaction in turn is classically viewed as a graph search problem that falls under the umbrella of NPcomplete problems. It originated in the artificial intelligence (AI) literature of the 1970s, with early examples in Mackworth [11] and Laurière [12]. Broadly, a CSP consists of a set of variables X = {X 1 , X 2 , . . . , X N }, where each variable must be assigned a value such that a given set of constraints (clauses) C = {C 1 , C 2 , . . . , C M } are satisfied. Typical applications of CSPs include resource management and time scheduling [13], parity checking in error-correcting codes [14], and puzzle games such as Sudoku, Killer-Sudoku, Calcudoku, Kakuro, and Fill-a-pix [15].
Many advances have been made in solving highly constrained PGMs, i.e. PGMs with a large number of prohibited outcomes specified in their factors. This includes PGMs constructed from CSP clauses. A popular approach is to transform such a PGM to another domain and then to solve it with tools specific to that domain. This includes converting PGMs to Boolean satisfiability problems (such as conjugate normal form (CNF) [16]), sentential decision diagrams (SDD) [17], and arithmetic circuits (ACs) [18]. For example, the ACE system [18] compiles a factor graph or Bayesian network into a separate AC. This AC is then used to answer queries about the underlying variables. The drawback lies in the fact that an ACE query will yield a marginal for each queried variable, but it does not yield the joint distribution over all queried variables. ACE thus acts as a heuristic for selecting a single probable CSP outcome.
In contrast, our CSP solver can return a joint solution to a CSP problem (i.e. find its joint distribution) using PGM techniques.
Of course, there are trivial ways to reformulate CSPs probabilistically and express them as PGMs [2]- [5], [19], [20]. Although most of the above citations are aimed at specific CSPs, they share the same basic approach. This amounts to (a) formulating the CSP clauses into PGM factors, (b) configuring the factors into a PGM graph structure, (c) applying belief propagation on this graph, and (d) using the most probable outcome as the solution to the CSP. Dechter [6] provides a bridge between CSPs and PGMs by proving that zero-belief conclusions made by loopy-belief propagation reduce to an algorithm for generalised arc consistency in CSPs.
There are limitations, however. Goldberger [3] highlights the difference between belief propagation (BP) with maxproduct and sum-product. They report that although maxproduct BP ensures the solution is preserved at all times, it is often hidden within a large spectrum of possibilities. This calls for additional search techniques. Meanwhile, sumproduct BP acts as a heuristic to highlight a valid solution, but can often highlight an incorrect one. Khan [4] tries to improve on the success rate of sum-product BP by combining it with Sinkhorn balancing. Although they report an improvement, the system could still not reliably solve high-difficulty Sudoku puzzles. Streicher [5] use a sparse representation for factors and promote the use of a cluster graph over the ubiquitous factor graph. However, although the cluster graphs improved the accuracy and execution time of the system, their approach is not reliable as a Sudoku solver or CSP solver in general.
The above approaches are all limited in one way or another. They are either ineffective in purging redundant search space -due to their loopy PGM structure -or they rely on an unreliable heuristic to select a probable solution. In this work, we propose techniques to sidestep these limitations and iteratively nudge the graph towards a tree-structured PGM while preserving the CSP solution.
Our proposed technique employs the purge-and-merge algorithm. Purge-and-merge starts by constructing a CSP into a cluster graph PGM with sparse factors [5]. It then purges redundancies from these factors by applying max-product belief propagation [6] and thereby propagating zero-belief conclusions. Next, it merges factors together to create cluster graphs that are closer to a tree structure. Finally, it constructs a new cluster graph from the factors. This process is repeated until a tree-structure cluster graph is produced. At this point, the exact solution to the CSP is found.
Purge-and-merge manages to reliably solve CSPs that are too difficult for the aforementioned approaches. We reason that a successful CSP approach such as purge-and-merge opens many new avenues for exploration in the PGM field. This may include hybrid models where rigid and soft constraints can be mixed. It may also be used in domains not previously suited for probabilistic approaches.
Our study is outlined as follows: • In Section II we introduce CSP factors and show how they can be structured into a PGM. We provide the design and techniques to build and solve a basic constraintsatisfaction PGM. • In Section III we investigate the limitations of PGMs as well as the trade-offs between the loopy-structured PGMs of small-factor scopes and the tree-structured PGMs of large-factor scopes. • In Section IV we provide a factor clustering and merging routine along with the purging methods found in the purge-and-merge technique. • In Section V we evaluate purge-and-merge on a number of example CSPs such as Fill-a-pix and similar puzzles, and compare it to the ACE system [18]. We found that with the purge-and-merge technique, PGMs can solve highly complex CSPs. We therefore conclude that our approach is successful as a CSP solver, and suggest further investigation into integrating constraint-satisfaction PGMs as sub-components of more general PGMs.

II. CONSTRAINT SATISFACTION USING PGMS
In this section we show how CSPs are related to PGMs. We express CSPs as factors, which can be linked in a PGM structure. We use graph colouring as an example, and expand the idea to the broader class of CSPs.
Most constraint-satisfaction problems are easily defined and verified, but they can be difficult to invert and solve. PGMs, by contrast, are probabilistic reasoning tools used to resolve large-scale problems in a computationally feasible manner. They are often useful for problems that are difficult to approach algorithmically -CSPs being one such example.

A. A GENERAL DESCRIPTION OF CSPS
Constraint-satisfaction problems are NP-complete. They are of significant importance in operational research and they are key to a variety of combinatorial, scheduling, and optimisation problems.
In general, constraint satisfaction deals with a set of variables X = {X 1 , X 2 , . . . , X N } and a set of constraints C = {C 1 , C 2 , . . . , C M }. Each variable needs to be assigned a value from the variable's finite domain dom(X n ), such that all constraints are satisfied. For example, if X n represents a die roll, then a suitable domain would be {1, 2, 3, 4, 5, 6}. Furthermore, if we define a CSP where two die rolls, X 1 and X 2 , are constrained to sum to a value of 10, then the CSP solution (X 1 , X 2 ) consists of the possible value assignments (4, 6), (5,5), and (6,4).
The CSP constraints can be visualised through a factor graph. This is a bipartite graph where the CSP variables are represented by variable nodes (circles) and the CSP clauses by factor nodes (rectangles). The edges of the graph are drawn between factor nodes and variable nodes, such that each factor connects to all the variables in its scope. The scope of a factor is the set of all random variables related to that factor. To illustrate, we present the map-colouring example in Figure 1: • Figure 1(a) shows a map with bordering regions. These regions are to be coloured using only four colours such that no two bordering regions may have the same colour. • In Figure 1(b) we represent this map as a graphcolouring problem, where the regions are represented by nodes and the borders by edges. • Figure 1(c) shows a factor graph where the factors represent the CSP clauses. Note that each of these factors has a scope of two variables. • In Figure 1(d) we show that the problem can also be expressed equivalently by combining factors differently.
Here we have multiple constraints captured by a single clause. As a result, we have fewer factors, but larger factor scopes. (This example uses the maximal cliques in (b) as factors.)

B. FACTOR REPRESENTATION
A factor graph representation will only show the clauses, the variables, and the relationship between the clauses and variables. The details of these relationships, however, are suppressed. To fully represent the underlying CSP, each factor must also express the relationships implied by the associated constraint. We do so by assigning a potential function to each clause in order to encode all valid local assignments concerning that clause. These assignments are captured by sparse probability tables. The tables list each local possibility as a potential solution, and assign a value to that possibility. For CSPs specifically, we work with binary probabilities, ascribing "1" to any (valid) possibility and "0" to any impossibility enforced by the constraint. As an example, see Figure 2 for the sparse table representing the factor scope {A, C, D, F } from Figure 1  tables as PGM factors is also referred to as flattening [6].) It is worth noting that the factor graphs presented here are then not only a visually appealing representation for CSPs, but are in fact PGMs. As such, PGM inference techniques such as loopy belief propagation [19] and loopy belief update [1], [21] can be directly applied to these factor graphs.
In order to perform belief propagation using sparse tables, it is important to introduce some basic factor operations; most importantly, see the literature on factor multiplication, division, marginalisation, reduction, damping, and normalisation [1; Defs. 4.2, 10.7, 13.12, and 4.5; Eq. 11.14; and Ch. 4].

C. PGM CONSTRUCTION
In essence, a PGM is a compact representation of a probabilistic space as the product of smaller, conditionally inde-pendent distributions. When we apply a PGM to a specific problem, we need to (a) obtain factors to represent these distributions, (b) construct a graph from them, and (c) use inference on this graph. In this section we focus on graph construction.
A cluster graph is an undirected graph where: is associated with a separation set (sepset) S i,j ⊆ C i ∩ C j , and • the graph configuration satisfies the running intersection property (RIP) [1]. RIP requires that for all pairs of clusters containing a common variable, X ∈ C i and X ∈ C j , there must be a unique path of edges, (î,), between C i and C j such that X ∈ Sî , ∀ (î,). Figure 3 provides two examples of a cluster-graph configuration for the CSP clauses in Figure 1. In (a) we have a trivial connection called a Bethe graph. This is a cluster graph with univariate sepsets, an equivalent of the factor graph in Figure 1(d). In (b) we have a cluster graph with multivariate sepsets. This graph is generated from the same factors as the Bethe graph, but using the LTRIP algorithm [5]. The result is also referred to as a junction graph [6]. Streicher [5] shows that cluster graphs with multivariate sepsets have superior inference characteristics to factor graphs, both in terms of both speed and accuracy. The same is argued by Koller [1, Sec. 11.3.5.3], where it is shown that cluster graphs are a more general case of factor graphs without the limitation of passing messages only through univariate marginal distributions. With factor graphs, correlations between variables are lost during belief propagation, which can have a negative impact on the accuracy of the posterior distributions and on the number of messages required for convergence.
The LTRIP algorithm is designed to configure factors into a valid cluster graph by following the RIP constraints. For each variable X ∈ X , LTRIP builds a subgraph out of all clusters C i , where X ∈ C i . These subgraphs are then superimposed in order to construct the sepsets of the final graph. In summary, the algorithm states that for each variable X ∈ X , do the following: • find all clusters C i such that X ∈ C i , • construct a complete graph over clusters C i , • assign edge weights wî , to represent the similarity between neighbouring clusters, 1 • connect the graph into a minimum spanning tree by using an algorithm such as the Prim-Jarník algorithm [22], and • populate the tree's edges with intermediate sepset results S X ı, = {X}. After the sepset results are populated for each variable, the sepsets Sî , of the final graph are taken as the union of the intermediate sepset results Sî , = ∪ X∈X S X ı, . An example of the LTRIP algorithm for the graph in Figure 3(b) can be seen in Figure 4. . An example of applying the LTRIP algorithm in order to achieve the cluster-graph construction from Figure 3(b). For each variable A, B, C, D, E, F and G, a minimum spanning tree is constructed from its associated clusters and is populated with univariate sepsets. The resulting cluster graph is then created by taking the superposition of these intermediate trees.

D. PGM INFERENCE
Our PGM approach extends the prior work of Streicher [5], which in turn is informed by the work of Dechter [6]. The specific design choices for our PGM implementation are as follows: 1) The factors consist of sparse tables similar to those of Figure 2. 2) Graphs construction is done using the LTRIP procedure from Streicher [5]. 3) We use inference via belief update (BU) message passing, also known as the Lauritzen-Spiegelhalter algorithm [21]. 4) We use the Kullback-Leibler divergence as a comparative metric (and deviation/error metric) between distributions. 5) Message-passing schedules are set up according to residual belief propagation [23]. Messages are prioritised according to the deviation between a new mes-sage and the preceding message at the same location within the graph. 6) Convergence is reached when the largest message deviation falls below a chosen threshold. 7) Throughout the system, we use max-normalisation and max-marginalisation, as opposed to their summation equivalents. Dechter [6] proved that zero-belief conclusions made by loopy belief propagation are correct and equal to inducing arc consistency. This is true in the case of using both sum or max operations [3], [5], [6]. This means that the basic PGM approach does not guarantee a solution to the CSP, but it does guarantee that all possible solutions are preserved.
We found convergence to be faster with the max operations than with sum operations. Furthermore, the max operations maintain a unity potential for all non-zero table entries. This is in line with the constraint-satisfaction perspective, where outcomes are either possible or impossible. Alternatively, if one is interested in a more dynamic distribution, the sum operations provide varying potentials that can be used as likelihood estimations [3].
Lastly, note that alternative message-passing techniques such as warning propagation and survey propagation [24] are available. These two approaches attempt to elevate the solution from the problem space, but cannot guarantee that the solution is retained. Our interest is in pursuing an approach where the full solution space is preserved.

III. THE LIMITATIONS OF PGMS
One of the main limitations of constructing CSP potential functions is the resources required to encode them. If the potential functions are encoded as probability tables, then at least all the non-zero potentials need to be listed. Such a list can grow exponentially with the number of factor variables. Therefore, not all CSPs are suitable to be expressed as sparse tables. A trivial example of an ill-suited problem would be a graph-colouring problem with n fully connected nodes, as in Figure 5. The full space of the problem is n n with n! entries in the probability table.
Inference on loopy graphs is non-exact; it cannot guarantee a complete reduction to the solution space of a CSP [6]. In exchange, however, loopy graphs provide a great advantage: the ability to handle problems that would have required infeasibly large probability tables if constructed into treestructured PGMs.
Consider the Sudoku puzzle. A player is presented with a 9 × 9 grid (with 91 variables) where each variable may be assigned a value of "1" to "9" and is constrained by the following 29 clauses: each row, each column, and each 3x3 non-overlapping subgrid may not contain any duplicates. Furthermore, a valid puzzle is partially filled with values such that only one solution exists. In Figure 6 we show the Sudoku-puzzle constraints as a graph-colouring problem.
Since a valid Sudoku should only have a single solution, the posterior distribution, p(X 1 , X 2 , . . . , X 81 ), can be expressed by a sparse table that covers the full variable scope  and contains only a single entry. Each marginal distribution, p(X 1 , . . . , X 9 ), p(X 10 , . . . , X 18 ), . . . , would then also hold only a single table entry. Yet, after the pre-filled values are observed and each factor is set up according to its local constraint (the no-duplicated rule), the prior distributions can result in tables that have as many as 9! entries. Therefore, a Sudoku solver would have to reduce these large initial probability tables into single-entry posteriors. Before we attempt such a solver, let us first consider two cases -(1) a loopy structure with small-factor scopes, and (2) a tree structure with large-factor scopes: 1) For model (1), we build a loopy-structured PGM directly from the prior distributions. Each factor, therefore, has the same variable scope as one of the clauses.
For an inference attempt to be successful, factors should pass information around until all sparse tables are reduced to single entries. In practice, however, the sparse tables are often reduced by very little. This is because inference on a loopy structure does not guarantee convergence to the final solution space [6]. 2) For model (2), we use the most trivial tree structure: multiply all factors together to form a structure with a single node. The resulting factor will now contain one single table entry as the solution. This approach will often (or rather usually) fail in practice, since we cannot escape exponential blow-up. In the process of multiplying factors together, the intermediate probability tables first grow exponentially large before the system settles on this single-entry solution.
Since we are confronted by the limitations of both smalland large-factor scopes, we propose a technique in the next section that mitigates these limitations.

IV. PURGE-AND-MERGE
In this section, we consider the various methods for purging factors and merging factors, and combine these methods into a technique called purge-and-merge. It concludes with a detailed outline of the technique in Algorithm 2.

A. FACTOR MERGING
Our aim in merging factors is to build tree-structured PGMs and to be able to perform exact inference. This can result in exponentially larger probability tables, so it is necessary to approach this problem carefully.
One approach is to cluster the factors into subsets that will merge to reasonably sized tables. To pre-calculate the table size of a factor product is, unfortunately, as memoryinefficient as performing the actual product operation. While we therefore cannot use exact table size as a clustering metric, we have investigated three alternative metrics. We propose: (1) variable overlap, as in the number of overlapping variables between factors, (2) an upper-bound shared entropy metric, and (3) a gravity analogy that is built on entropy metrics. These methods are experimentally tested in Section V (Figure 7), with the gravity analogy showing the most potential.

1) Variable overlap
For variable overlap, we define the attraction between f i and f j as with X i and X j the scopes of f i and f j respectively. Note that there is a symmetrical relationship in the sense that the attraction of f j towards f i can be defined as a i←j = a i,j = a j←i .

2) Upper-bound shared entropy
Upper-bound shared entropy is proposed as an alternative metric to variable overlap. The definition for the entropy of a set of variables X is with a maximum upper bound achieved at the point where the distribution over X is uniform. This upper bound is calculated asĤ We use this definition to define the attraction between clusters i and j as the upper-bound entropy of the variables they share: with symmetrical behaviour a i,j = a i←j = a j←i . Note that maximal entropy is used as a computationally convenient proxy for entropy, since calculating shared entropy directly is as expensive as applying factor product.

3) Gravity method
For the gravity method, we use gravitational pull as an analogy for attraction: The idea is to relate mass m i to how informed a factor is about its scope, and distance r i,j to concepts regarding shared entropy: Pseudo-mass: Mass equation m(f ) = m is based on how informed factor f is about its scope X . To parse this into a calculable metric we use the Kullback-Leibler divergence of the distribution of f compared to a uniform distribution over X .
Pseudo-distance: As a distance metric, we want to register two factors as close together if they have a large overlap and far apart if they have little overlap. We also do not want this metric to be influenced by a factor's size or mass.
We arrived at a metric using the entropy of the joint distribution, normalised by the entropy of the variables shared between the factors. By using upper-bound entropy in our calculations, we arrive at distance Attraction: Finally, we define the attraction of f j towards f i as analogous to acceleration Using the above metrics, we formulate a procedure for clustering our factors according to the mergeability between factors, as shown in Algorithm 1. Although the algorithm is specialized for the gravity method, it can easily be adjusted for different attraction metrics.
Via this procedure, we can cluster factors f 1 , f 2 , . . . , f n into clusters C 1 , . . . , C m , where m ≤ n. These clusters can then be incorporated into a PGM by calculating new PGM factors f 1 , . . . , f m , by simply merging each cluster f i = fj ∈Ci f j .

B. FACTOR PURGING
In this section, we show some methods for purging the probability tables of a constraint-satisfaction PGM. We use the inference techniques from Section II-D along with some additional purging techniques: a j←i := m j /r(X j , X i ) 9: // Dynamically merge clusters together 10: while any a i←j are still available do 11:î, := argmin i,j (a i←j ) 12: ifĤ(Xî ∪ X) ≤Ĥ τ then 13: Cî := Cî ∪ C

Reducing variables:
If for any factor f j , a variable X is uniquely determined to be x i , i.e. there are no non-zero potentials with X = x i in that factor, then observe X=x i throughout all factors and remove X from their scopes. This is a trivial case of "node-consistency" [25].
Reducing domains: Likewise, if any domain entry x i ∈ dom(X) is not represented by factor f j with X in its scope, i.e. having p(X=x i |f j ) = 0 for any factor, remove x i from the domain dom(X) in all factors, and remove all probability table entries from the system that allows for X=x i .
Propagating local redundancies: For any two factors, f i and f j , which have common variables, say {A, B, . . .}, any zero outcome in f i , i.e. p(A=a, B=b, . . . |f i ) = 0, should also be zero for f j , i.e. as p(A=a, B=b, . . . |f j ) = 0. The PGM inference from Section II-D is, in fact, an algorithm to enforce this relationship, as Dechter [6] proved this to be an algorithm for generalised arc consistency.
We can now combine these techniques along with our merging techniques to build a PGM-based CSP solver.

C. THE PURGE-AND-MERGE PROCEDURE
Having outlined all the building blocks needed for purgeand-merge, we can now describe the overall concept in more detail. We start our model with factors of small-variable scopes by using the CSP clauses directly. We then incrementally transition towards a model with larger-factor scopes by clustering and merging factors. More specifically, we start with a PGM of low-factor scopes, purge redundancies from this model, progress to a model of larger-factor scopes, and purge some more redundancies. We continue this process until our PGM is tree structured and thus yields an exact solution to the CSP.
This incremental-factor growth procedure dampens the exponential blow-up of the probability tables and allows the model to incrementally reduce the problem space. The full procedure is outlined in Algorithm 2. X s := X s ∪ X // add solved variables 15: if G is a tree structure then 16: return X s , F

D. ALGORITHMIC CONSISTENCY
As a final reflection on purge-and-merge, we state that all steps taken by this algorithm are correct and result in a consistent CSP solver. Constraint satisfaction falls in the problem class of NP-complete [25], and any general CSP solvers such as purge-and-merge must therefore be at least NP-complete. Dechter [6] proves that loopy-belief propagation performed on cluster graphs with flattened tables and hard constraints reduces to generalised arc consistency. They also prove that zero-belief conclusions converge within O(n · t) iterations of loopy-belief propagation and that the algorithm results in a complexity of O(r 2 t 2 log t) -where n is the number of nodes in the cluster graph and t bounds the size of each sparse table. These metrics are, however, not particularly useful for purge-and-merge, since the values n and t are expected to change as the algorithm progresses.
The merging of factors is exponential in time complexity, with some merge orders performing better than others [1].
Finding an optimal merge order is unfortunately an NPcomplete problem and is as difficult as the actual inference [1]. Some algorithms such as variable elimination can aid in this process. With variable elimination, the merging order is determined according to the marginalisation of each variable from the factors and the predicted effect it will have on the system as a whole [1]. It should be noted that factor multiplication is commutative and any merge order converges to the same solution.
Since purge-and-merge is a combination of belief propagation and factor merging, the full algorithm is bounded by factor merging and is thus also NP-complete. Purge-and-merge does, however, mitigate between loopy belief propagation and factor merging with the aim of preventing exponential blow-up in the factor-merging process.
In the next section we investigate the performance of this procedure by applying it to a large number of CSP puzzles.

V. EXPERIMENTAL STUDY OF PURGE-AND-MERGE
In this section we investigate the reliability of the purge-andmerge technique by solving a large number of constraintsatisfaction puzzles. To compare results, we include Sudoku datasets used in other constraint-satisfaction PGM reports [2]- [5], [19], [20], as well as the most difficult Sudoku puzzles currently available [26].

A. PUZZLE DATA SETS
The Sudoku community has developed a large database of the hardest 9x9 puzzles [26] known to literature. The result is a unique collaborative research effort, spanning over a decade, using the widely accepted criterion of the Sudoku explainer rating (SER). This rating is applied by solving a puzzle using an ordered set of 96 rules ranked according to complexity. From the combination of rules required to solve the puzzle, the most difficult one of these rules is used as a hardness measure. The validity of SER as a difficulty rating is discussed by Berthier [15]. They found that the SER rating highly correlates to external pure-logic ratings and can thus be used as a proxy for puzzle complexity. They do note that SER is not invariant to puzzle isomorphisms, i.e. two puzzles from the same validity-preserving group [27] can result in two different ratings.
In addition to the above set, we also compiled a database of constraint puzzles from various other sources to be used as tests. We verified each puzzle to have valid constraints using either PicoSat [28] or Google's OR-Tools [29]. All puzzles are available on GitHub [30], and were sourced as follows: • 1000 Sudoku samples from the SER-rated set of the most difficult Sudoku puzzles, curated by Champagne [26], • all 95 Sudokus from the Sterten set [31] used in Streicher [5] and in Khan [4], • all 49151 Sudokus with 17-entries from the Royle's 2010 set [32] (note that an older subset of roughly 350000 puzzles was available to Goldberg [3] and Bauke [2]), • 10000 Killer Sudokus from www.krazydad.com (labelled according to five difficulty levels). • 4597 Calcudokus of size 9×9 from www.menneske.no, • 6360 Kakuro puzzles from www.grandgames.net, • 2340 Fill-a-Pix puzzles from www.grandgames.net, and • a mixed set of fairly high difficulty, containing one of each of the above puzzle types.

B. CLUSTERING METRICS
Section IV-A listed three metrics for the purge-and-merge procedure, namely (1) variable overlap, (2) upper-bound shared entropy, and (3) the gravity method. In order to select a well-adapted clustering method, we compared these three metrics on the Champagne data set. Our approach was to allow purge-and-merge to run for 10s under the different clustering conditions, and to then report on the largest table size for that run. Under the naive variable overlap metric, none of the puzzles came to convergence. When investigating further and rerunning the first 10 puzzles without time restriction, they all ran out of physical memory. This is not surprising, as this metric does not account for the domain sizes of the variables, which can have a considerable impact on table size.
Of the remaining metrics, the gravity method had a 100% convergence rating within the 10s threshold, whereas upperbound shared entropy had a 53% rating. Compared to upperbound shared entropy, the gravity method also resulted in a smaller maximum table size in 74.7% of cases. A histogram representing the maximum table size for each run can be seen in Figure 7. Since the gravity method performed better than the other metrics, we opted to use it in all further purge-and-merge processes.

C. PURGE-AND-MERGE
All tests were executed single-threaded on an Intel ® Core TM i7-3770K, with rating 3.50GHz and 4 cores / 8 threads in total. The purge-and-merge algorithm is available on GitHub [30] as a Linux binary with a command-line interface for running any of the puzzles used in our tests. Purge-and-merge can solve all the Sudoku, Killer Sudoku, Kakuro and Fill-a-Pix puzzles we have provided. In the case of Calcudoku, 1.4% of the puzzle instances reached the machine's physical RAM limitation of 32Gb. This indicates that purge-and-merge deals better with large numbers of small factors, as with Kakuro, rather than a small or medium number of large factors, as with Calcudoku and Killer Sudoku. Runtime metrics for the various puzzles can be seen in Figure 8.
Compared to the other available PGM approaches, purgeand-merge is the only method to achieve a 100% success rate with all the Sudoku puzzles it encountered. Moreover, purgeand-merge was tested on more cases than what is reported in any of the comparable literature.
If we compare purge-and-merge to Streicher's [5] results for the Sterten [31] set, purge-and-merge is slower (see Figure 8). However, the approach by Streicher [5] is equivalent to a single "purge" step in purge-and-merge. The full purgeand-merge method obtained a success rate of 100%, whereas Streicher [5] reported a success rate of only 36.8%.
In comparison with the other Sudoku PGM literature, Streicher [5] is the only other PGM approach that ensures the full CSP solution space is preserved (i.e. no valid solutions are lost). Additionally, purge-and-merge allows the scope of the solver to increase up to the point where only the solution space is left.  [2]. Note that this applies to puzzles far easier than our expansive set, which also includes the current most difficult Sudoku set [26].
The solution space is not preserved in the PGM approaches of Khan [4], Goldberger [3] and Bauke [2]; instead, they use sum-product BP to seek out a single likely solution from the problem space. Khan [4] provides us with a comparison between these three approaches, as shown in Table 1. From this table, it is clear that these reported PGM approaches are not well suited for Sudoku puzzles of medium and higher difficulty.

VI. COMPARISON TO THE ACE SYSTEM
The ACE system [18] is a related system for solving constraint-satisfaction PGMs. ACE works by compiling Bayesian networks and other factor-graph representations into an arithmetic circuit, which can then be used to answer queries about the input variables.
ACE focuses on the marginals of the variables of the system and not on finding the joint distribution of the system. This distinction is important -purge-and-merge produces all the solutions to a CSP, whereas ACE will only report on the marginal of each variable.
To illustrate, if we take the first 10 puzzles from the Champagne data set and arbitrarily remove a known entry, purgeand-merge finds 426, 380, 917, 799, 77, 476, 454, 1754, 777 and 796 answers for each puzzle respectively. ACE, on the other hand, only reports on the domain of each unknown variable and is therefore not able to find any valid solution.
ACE approaches the problem in two stages. It first compiles a network along with its unknown variables into an arithmetic circuit. Then it uses the compiled network to answer multiple queries with respect to the unknown variables. Note that a single ACE circuit to represent all Sudoku puzzles is too large to fit in 32GB of memory, due to the large number of possible solutions ≈ 6.67 × 10 21 [33].
To compare ACE with purge-and-merge, we parsed all the Champagne puzzles into ACE-compliant structure, and then compiled each puzzle into an ACE circuit. In plotting the result in Figure 9, we discarded the loading and query times, since all evidence was already incorporated into the structure.

VII. CONCLUSION AND FUTURE WORK
In general, the factors in a PGM can be linked up in different ways, resulting in different graph topologies. If such a graph is tree-structured, inference will be exact. However, more often than not the graph structure will be loopy, which results in inexact inference [1]. Transforming a loopy graph into a tree structure, unfortunately, is not always feasible -in all but the simplest cases the resultant hyper-nodes will exponentially blow up to impractical sizes. Hence we are usually forced to work with message passing on a loopy graph structure.
The ubiquitous factor graph is the structure most frequently encountered in the literature -its popularity presumably stems from its simple construction. Previous work has shown that inference on factor graphs is often inferior to what can be obtained with more advanced graph structures such as cluster graphs [5]. Nodes in cluster graphs typically exchange information about multiple random variables, whereas a factor graph is limited to sending only messages concerned with single random variables [1, p406]. The LTRIP algorithm [5] enables the automatic construction of valid cluster graphs. Despite their greater potency, however, they might still be too limited to cope with complicated relationships [5].
In our current work, we extend the power of cluster graphs by dynamically reshaping the graph structure as the inference procedure progresses. Semantic constraints discovered by the inference procedure reduce the entropy of some factors. Factors with high mutual attraction can then be merged without necessarily suffering an exponential growth in factor size. The LTRIP algorithm reconfigures a new structure that becomes progressively more sparse over time. When the graph structure morphs into a tree structure, the process stops with an exact solution. We refer to this whole process as purge-and-merge.
Purge-and-merge is especially useful in tasks that, despite an initially huge state space, ultimately have a small number of solutions. By hiding zero-belief conclusions from memory, purge-and-merge can perform calculations on sub spaces within an exponentially large state space.
The purge-and-merge approach is not suited to tasks where the number of valid solutions would not fit into memory, as this would preclude a sufficient reduction in factor entropy. However, as the above results show, purge-and-merge enabled us to solve a wide range of problems that were previously beyond the scope of PGM-based approaches.
In comparison with ACE, we find purge-and-merge more suited to constraint-satisfaction problems with multiple solutions, as well as puzzles with a problem space too large to be compiled into a single ACE network.
Our current approach relies on the increased sparsity of the resultant graphs to gradually nudge the system towards a tree structure. In future work, we intend to control that process more actively. This should result in further gains in efficiency, and it is our hope that it will conquer the couple of Calcudokus that still elude us.