Replicable Parallel Branch and Bound Search

Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches. This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack.


Introduction
Branch and bound backtracking searches are a widely used class of algorithms. They are often applied to solve a range of NP-hard optimisation problems such as integer and non-linear programming problems; important applications include frequency planning in cellular networks and resource scheduling, e.g. assigning deliveries to routes [26].
A good search order can improve the performance of branch and bound dramatically by finding a good incumbent early on, and highly optimised sequential algorithms following the branch and bound paradigm often rely on very specific orders for performance.
Branch and bound algorithms are hard to parallelise for a number of reasons. Firstly, while branching creates opportunities for speculative parallelism where multiple workers i.e threads/processors search particular branches in parallel, pruning counteracts this, limiting potential parallelism. Secondly, parallel pruning requires that processors share access to the incumbent, which limits scalability. Thirdly, parallel exploration of irregularly shaped search trees generates unpredictable numbers of parallel tasks, of highly variable duration, posing challenges for task scheduling. Finally, and most importantly, parallel exploration alters the search order, potentially impacting the effectiveness of pruning.
As a result of the last point in particular, parallel branch and bound searches can exhibit unusual performance characteristics. For instance, slowdowns can arise when the sequential search finds an optimal incumbent quickly but the parallel search delays exploring the optimal branch. Alternately, super-linear speedups are possible in case the parallel search happens on an optimal branch that the sequential search does not explore until much later. In short, the perturbation of the search order caused by adding processors makes it impossible to predict parallel performance.
These unusual performance characteristics make reproducible algorithmic research into combinatorial search difficult: was it the new heuristic that improved performance, or were we just lucky with the search ordering in this instance? As the instances we wish to tackle become larger, parallelism is becoming central to algorithmic research, and it is essential to be able to reason about parallel performance. This paper aims to develop a generic parallel branch and bound search for distributed memory architectures such as clusters. Crucially, the objective is predictable parallel performance, and the key to achieving this is careful control of the parallel search order.
The paper starts by illustrating performance anomalies with parallel branch and bound by using a Maximum Clique graph search. The paper then makes the following research contributions: • To address search order related performance anomalies, Section 2 postulates three parallel search properties for replicable performance as follows.
Sequential Bound: Parallel runtime is never higher than sequential (one worker) runtime. Non-increasing Runtimes: Parallel runtime does not increase as the number of workers increases. Repeatability: Parallel runtimes of repeated searches on the same parallel configuration have low variance.
• We define a novel formal model for general parallel branch and bound backtracking search problems (BBM) that specifies both search order and parallel reduction (Section 3). We show the generality of BBM by using it to define three different benchmarks with a range of application areas: Maximum Clique (Section 3), 0/1 Knapsack (Appendix B) and Travelling Salesperson (Appendix D).
• We define a new Generic Branch and Bound (GBB) search API that conforms to the BBM (Section 4). The generality of the GBB is shown by using it to implement Maximum Clique (Section 2), 1 0/1 Knapsack (Appendix C) and Travelling Salesperson (Appendix E). 1 This implementation being the first distributed-memory parallel implementation of San Segundo's bit parallel Maximum Clique algorithm (BBMC) [52]. • To avoid the significant engineering effort required to produce a parallel implementation for each search algorithm we encapsulate the search behaviours as a pair of algorithmic skeletons, that is, as generic polymorphic computation patterns [12], providing distributed memory implementations for the skeletons (Section 5). Both skeletons share the same API yet differ in how they schedule parallel tasks. The Unordered skeleton relies on random work stealing, a tried and tested way to scale irregular task-parallel computations. In contrast, the Ordered skeleton schedules tasks in an ordered fashion, closely following the sequential search order, so as to guarantee the parallel search properties.
• We compare the sequential performance of the skeletons with a class leading hand tuned C++ search implementation, seeing slowdown factors of between 1.9 and 6.2. We then assess whether the Ordered skeleton preserves the parallel search properties using 40 instances of the three benchmark searches on a cluster with 17 hosts and 200 workers (Section 7). The Ordered skeleton preserves all three properties and produces replicable results. The key results are summarised and discussed in Section 8.

The challenges of parallel branch and bound search
We start by considering a branch and bound search application, namely finding the largest clique within a graph. The Maximum Clique problem appears as part of many applications such as in bioinformatics [16], in biochemistry [9,15,18,24], for community detection [66], for document clustering [41], in computer vision, electrical engineering and communications [8], for image comparison [53], as an intermediate step in maximum common subgraph and graph edit distance problems [34], and for controlling flying robots [48].
To illustrate the Maximum Clique problem we use the example graph in Fig. 1. In practice the graphs searched are much larger, having hundreds or thousands of vertices. A clique within a graph is a set of vertices where each vertex in the set is adjacent to every other vertex in the set. For example, in Fig. 1 the set V = {a, b, c} is a clique as all vertices are adjacent to one another. {a, b, h} is not a clique as there is no edge between b and h. In the Maximum Clique problem we wish to find a largest clique (there may be multiple of the same size) in the graph. Here we are interested in the exact solution requiring the full search space to be explored.
One approach to solving this problem would be to enumerate the power set of vertices and check the clique property on each (ordering by largest set). While this approach can work for smaller graphs, the number of combinations grows exponentially with the number of nodes in the graph making it computationally unfeasible for large graphs.
A better approach, particularly for larger graphs, is to only generate sets of vertices that maintain the clique property. This is the essence of the branching function. In the case of clique search, given any set of vertices, the set of candidate choices is the set of vertices adjacent to all vertices in the current clique. Once there are no valid branching choices left we can record the size of the clique and backtrack.
Finally, we can go one step further with the addition of bounding. The idea of bounding is that a current best result, known as the incumbent, is maintained. For Maximum Clique this corresponds to the size of the largest clique seen so far. At each step we determine, using a bounding function, whether or not the current selection of vertices and those remaining could possibly unseat the incumbent and if it is impossible then backtracking can occur, reducing the size of the search space. For the Maximum Clique example the maximum size, given a current clique, may be estimated using a greedy colouring algorithm: clearly, if we can colour the remaining vertices using k colours (giving adjacent vertices different colours), then the current clique cannot be grown by more than k vertices.
Practical algorithms for the Maximum Clique problem were the subject of the second DIMACS implementation challenge in 1993 [22]. In 2012, Prosser [47] performed a computational study of exact maximum clique algorithms, focusing on a series of algorithms using a colour bound [58][59][60], together with bit-parallel variants [52,55] that represent adjacency lists using bitsets to gain increased performance via vectorised instructions. Since then, ongoing research has looked at variations on these algorithms, including reordering colour classes [36], reusing colourings [40], treating certain vertices specially [56], and giving stronger (but more expensive) bounding using rules based upon MaxSAT inference between colour classes [27,28,54]. (A recent broader review [64] considers both heuristic and exact algorithms.) There have been three thread-parallel implementations of these algorithms [15,35,37], the most recent makes use of detailed inside-search measurements to explain why parallelism works, and how to improve it. These studies have been limited to multicore systems. A fourth study [65] attempted to use MapReduce on a similar algorithm, but only presented speedup results on three of the standard DIMACS instances, all of which possess special properties which make parallelism unusually simple [37].
For simplicity this paper uses a bit-parallel variant of the MCSa1 algorithm [47], which is BBMC [52] with a simpler initial vertex ordering. Crucially the algorithm is not straightforward, and that unlike the naïve and overly simplistic algorithms typically used to demonstrate skeletons, is both close to the state of the art and a realistic reflection of modern practical algorithms.

General branch and bound search
Although we introduced branch and bound search in relation to the Maximum Clique problem, it has much wider applications. It is commonly seen for global optimisation problems [39] where some property is either maximised or minimised within a general search space. Two other examples where branch and bound search may be used are given in Sections 6.1 and 6.2.
The details and descriptions of these algorithms vary and we take a unifying view using terminology from constraint programming. In general, a constraint satisfaction or optimisation problem has a set of variables, each with a domain of values. The goal is to give each variable one of the values from its domain, whilst respecting all of a set of constraints that restrict certain combinations of assignments. In the case of optimisation problems, we seek the best legal assignment, as determined by some objective function.
Such problems may be solved by some kind of backtracking search. Branch and bound is a particular kind of backtracking search algorithm for optimisation problems, where the best solution found so far (the incumbent) is remembered, and is used to prune portions of the search space based upon an over-estimate (the bound function) of the best possible solution within an unexplored portion of the search space.
For example, when searching for a Maximum Clique (a subset of vertices, where every vertex in the set is adjacent to every other in the set) in a graph, we have a ''true or false'' variable for each vertex, with true meaning ''in the clique''. We may branch on whether or not to include any given vertex, reject any undecided vertices that are not adjacent to the vertex we just accepted, and then bound the remaining search space using the colour bound mentioned above.
In practice, selecting a good branching rule makes a huge difference. We must select a variable, and then decide the value to assign it first. There are good general principles for variable selection, but value ordering tends to be more difficult in practice.

Parallelisation and search anomalies
Search algorithms have strong dependencies: before we can evaluate a subtree, we need to know the value of the incumbent from all the preceding subtrees so we can determine if the bound can eliminate some work. Parallelism in these algorithms is speculative as it ignores the dependencies and creates tasks to explore subtrees in parallel. This approach can lead to anomalous performance, and specifically.
1. When subtrees are explored in parallel some work may be wasted, since we might be exploring a subtree that would have been pruned in a sequential run by a stronger incumbent. As the parallel version is performing more work than the sequential version, its runtime may exceed that of the sequential version. 2. Conversely, it may be that a parallel task finds a strong incumbent more quickly than in the sequential execution, leading to less work being done. In this case we observe superlinear speedups. 3. An absolute slowdown, where the parallel version runs exponentially slower than a sequential run. This can happen if introducing parallelism alters the search order, leading to it taking longer for a strong incumbent to be found.
The theoretical conditions where these three conditions can occur are well-understood [14,25,29,61]. In particular, it is possible to guarantee that absolute slowdowns will never happen, by requiring parallel search strategies to enforce certain properties [14].

Implementation challenges
The most obvious complicating factor when parallelising a branch and bound search tree is irregularity: it is extremely hard to decompose the problem up-front to do static work allocation, since some subproblems are exponentially more complicated than others.
To deal with irregular subproblems efficiently we require a form of dynamic load balancing that can re-assign problems to cores as they become idle. A common approach to dynamic load balancing in parallel search [42] (and general parallelism) is through work stealing: we start with a sequential search, but allow additional workers to ''steal'' portions of the search space and explore them in parallel. Popular off-the-shelf work stealing systems commonly employ a randomised stealing strategy, which has good theoretical properties [7].
Surprisingly, though, irregularity is not the most complex factor when parallelising these algorithms. Although non-linear speedups are called anomalies in the literature, anomalous behaviour is actually extremely common when starting with strong sequential algorithms, to the extent that if a linear speedup is reported, we should be suspicious as to why. Although such behaviour is relatively uncommon with small numbers of cores, e.g. four cores, our experience [37] is that as we start working in the 32 to 64 core range, anomalies often become the dominating factor in the results. We expect that as core counts increase, such factors will become even more important.
From an implementation perspective, anomalies cause serious complications, with inconsistent and hard-to-understand speedup results being common. Randomised work stealing schemes further complicate matters and recent research [11,37,38] has demonstrated a connection between value-ordering heuristic behaviour [20] and parallel work splitting strategies that explains anomalous behaviour. We now understand why randomised work stealing behaves so erratically in practice in these settings: it interacts poorly with carefully designed search order strategies [37]. For consistently strong results, we cannot think of parallelism independently of the underlying algorithm, and must instead use work stealing to explicitly offset the weakest value ordering heuristic behaviour. For this reason, the best results for parallel Maximum Clique algorithms currently come from handcrafted and complex work distribution mechanisms requiring extremely intrusive modifications to algorithms. It is not surprising that these implementations are currently restricted to a single multi-core machine.
To conduct replicable parallel branch and bound research it is essential to avoid these anomalies. To do so we propose that parallel branch and bound search implementations should meet the following properties. 2 Sequential Bound: Parallel runtime is never higher than sequential (one worker) runtime. Non-increasing Runtimes: Parallel runtime does not increase as the number of workers increases. Repeatability: Parallel runtimes of repeated searches on the same parallel configuration have low variance.
Engineering a parallel implementation that ensures these properties for each search algorithm is non-trivial, and hence in Section 5 we develop generic algorithmic branch and bound skeletons, which greatly simplify the implementation of parallel searches.

A formal model of tree traversals
This section formalises parallel backtracking traversal of search trees with pruning, modelling the behaviour of a multi-threaded branch-and-bound algorithm in the reduction style of operational semantics. This formal model, for brevity referred to as BBM, admits reasoning about the effects of parallel reductions, in particular how parallelism affects the potential to prune the search space.
Reduction-based operational semantics of algorithmic skeletons has been studied previously [3] for standard stateless skeletons like pipelines and maps. BBM does not fit this stateless framework since branch and bound skeletons maintain state in the form a globally shared incumbent. There are several theoretical analyses of parallel branch and bound search [6], often specific to a particular search algorithm. BBM is novel in encoding generic branch and bound searches as a set of parallel reduction rules.

Modelling trees and tree traversals
In practice, search trees are implicit. They are not materialised as data structures in memory but traversed in a specific order, for instance depth-first. In contrast, for the purpose of this formalisation we assume the search tree is fully materialised. This is not a restriction as the search tree is typically generated by a tree generator. In practice, the tree generator is interleaved with the tree traversal avoiding the need to materialise the search tree in memory. 2 We are interested in parallel searches that meet or fail to meet these properties due to search order effects. We ignore resource related effects such as problem size being too small or massive oversubscription.
We formalise trees as prefix-closed sets of words. To this end, we introduce some notation. Let X be a non-empty set. By 2 X , we denote the power set of X . We denote the set of finite words over alphabet X by X * , and the empty word by ϵ. We write |w| to denote the length of a word w ∈ X * .
We denote the prefix order on X * by ⪯. By ≤ lex , we denote the lexicographic extension of the natural order ≤ on N to N * . Note that ≤ lex is an extension of the prefix order ⪯, that is, being prefixordered implies being ordered lexicographically on words in N * .

Trees.
A tree T over alphabet X is a non-empty subset of X * such that there is a least (w. r. t. the prefix-order) element u ∈ T , and T is prefix-closed above u. Formally, T is prefix-closed above u if for all v, w ∈ X * , u ⪯ v ⪯ w and w ∈ T implies v ∈ T . When X and u are understood, we will simply call T a tree. We call the elements of T vertices. We call the least element u ∈ T the root; and we call v ∈ T a leaf if it is maximal w. r. t. the prefix order, that is, if there is no w ∈ T with v ≺ w. We call two distinct vertices w, w ′ ∈ T siblings if there are v ∈ X * and a, a ′ ∈ X such that w = va and w = va ′ . We call a function g : X * → 2 X a tree generator. Given such a tree generator g, we define t g as the smallest subset of X * that contains ϵ and is closed under g in the following sense: For all u ∈ t g and all a ∈ g(u), ua ∈ t g . Clearly, t g is a tree with root ϵ, the tree generated by g.
Subtrees and segments. Let T be a tree. A subset S of vertices of T is a subtree of T if S is a tree. Given a vertex u ∈ T , we call the greatest (with respect to set inclusion) subtree S of T with root u the segment of T rooted at u. The yellow vertices in Fig Two segments of T are overlapping if they intersect nontrivially, in which case one is contained in the other. A set of segments cover the tree T if the prefix-closure of their union equals T . That is, if for each u ∈ T there is a segment S and v ∈ S such that u ⪯ v.
Ordered trees. Trees as defined above capture the parent-child relation (via the prefix order on words) but do not impose any order on siblings. Yet, many tree traversals rely on a specific order on siblings. To be able to express such an order, we generalise the notion of trees to ordered trees. We do so by labelling trees over the natural numbers, using the usual order of the naturals (or rather, its lexicographic extension to words) to order siblings.
Formally, an ordered tree λ over X is a function λ : dom(λ) → X * such that • the image of λ is a tree over X , and • λ is an order isomorphism between the two trees, both ordered by the prefix order ⪯.
Since λ is an isomorphism of the prefix order the lengths of the words u and λ(u) coincide for all u ∈ dom(λ). In an abuse of notation, we write λ to denote both the ordered tree (i. e. the function from dom(λ) to X * ) as well as the corresponding tree over X (i. e. the image of the function λ). When X is understood, we will simply call λ an (ordered) tree. To avoid confusion, we will call the elements of λ vertices, and the elements of dom(λ) positions. Fig. 2 shows an example ordered tree where each node corresponds to the string of red numbers from the root to that node, i.e. a tree over N. The figure also depicts an ordered tree λ over As λ is an order isomorphism the lexicographic ordering on dom(λ) carries over to the tree λ. That is, we define for all u, v ∈ dom(λ), λ(u) ≤ lex λ(v) if and only if u ≤ lex v, and ≤ lex becomes a total ordering on λ.
We call a function g : X * → X * an ordered tree generator if all images of g are isograms, i. e. have no repeating letters. Given an ordered tree generator g, we define λ g : dom(λ g ) → X * as the function with smallest domain such that • λ g (ϵ) = ϵ, and • λ g is closed under g in the following sense: For all positions u ∈ dom(λ g ) and corresponding vertices v = λ g (u), if g(v) = a 0 a 1 . . . a n−1 and i < n then ui is a position in dom(λ g ) and λ g (ui) = va i .
By construction λ g is an order isomorphism as images of g are isograms, hence λ g is an ordered tree, the ordered tree generated by g.
Example: Tree generators for clique problems. Let G = ⟨V , E⟩ be an undirected graph. Given a vertex u ∈ V , we denote its set of neighbours by E(u).
Clearly, g is a generator for the tree t g over the alphabet X = V , enumerating all cliques of G. However, t g enumerates cliques as strings rather than sets and hence every clique of size k will be enumerated k! times.
To avoid enumerating the same clique multiple times, we need to generate an ordered tree where siblings ''to the right'' avoid vertices that have already been chosen ''on the left''. We construct an ordered tree over the alphabet X = V × 2 V , where the first component is the latest vertex added to the current clique and the second component is a set of candidate vertices that may extend the current clique. The candidate vertices are incident to all vertices of the current clique, but do not necessarily form a clique themselves. We define the ordered tree generator h : where U = U m if m > 0, and U = V otherwise. Typically, the ⟨v i , V i ⟩ are ordered such that the size of V i decreases as i increases; this order is beneficial for sequential branch and bound traversals.
Clearly, h is an ordered generator for an ordered tree enumerating all cliques of G exactly once (ignoring the second component of the alphabet). Fig. 2 shows a tree generated by h for the graph from Fig. 1.

Maximising tree traversals
The trees defined above materialise the search space and order traversals. What is needed for modelling branch-and-bound is an objective function to be computed during traversal that the search aims to maximise.
Let Y be a set with a total quasi-order ⊑, that is ⊑ is a reflexive and transitive, but not necessarily anti-symmetric, total binary relation on Y .
Given a tree T over X and an objective function f : X * → Y , the goal is to maximise f over T , i. e. to find some u ∈ T such that f (u) ⊒ f (v) for all v ∈ T . The objective function is required to be monotonic w. r. t. the prefix order, that is for all u, u ′ So far, we have modelled maximising tree search. To model branch-and-bound we introduce one additional refinement: A predicate p for pruning subtrees that cannot improve the incumbent. More precisely, the pruning predicate p : is a function mapping the incumbent (i. e. the maximal value of f seen so far) and the current vertex to 1 (for prune) or 0 (for explore).
The pruning predicate must satisfy the following monotonicity and compatibility conditions: Condition 1 implies that all descendants u ′ of a pruned vertex u are also pruned. Condition 2 implies a vertex pruned by incumbent y is also pruned by any stronger incumbent y ′ . Finally, Condition 3 states the correctness of pruning w. r. t. maximising the objective function: Vertex u is pruned by incumbent y only if f (u) does not beat y.
How exactly pruning will interact with the tree traversal will be detailed in the next section. Note that pruning is an optimisation and must not be used to constrain the search space. That is, the result of the tree traversal must be independent of the pruning predicate. In particular, the trivial pruning predicate that always returns 0 (and hence prunes nothing) is a legal predicate.
Example: Objective function and pruning predicate for clique problems. For maximum clique, we set Y = N, and the quasi-order ⊑ is the natural order ≤. We define the objective function f : X * → Y by f (w) = |w|. That is, maximising f means finding cliques of maximum size. We define the pruning predicate p : That is, pruning decisions rest on the size of the current clique, m, and the size of the set of remaining candidate vertices U m ; vertices will be pruned if adding these two sizes does not exceed the current bound l. 3

Modelling multi-threaded tree traversals
For this section, we fix an ordered tree λ over X , which we will traverse according to the order ≤ lex . We also fix an objective function f : X * → Y , and a pruning predicate p : where Y is a set with a total quasi-order ⊑. Finally, we fix a set SEG of pairwise non-overlapping tree segments that cover the tree λ; we call each segment S ∈ SEG a task.
State. Let n ≥ 1 be the number of threads. The state of a backtracking tree traversal is a (n + 2)-tuple of the form σ = ⟨x, Tasks, θ 1 , . . . , θ n ⟩, where is the ith thread's current task and v i ∈ S i the currently explored vertex of that task. We use Haskell list notation for the task queue Tasks. That is, [ ] denotes the empty queue, and S : Tasks denotes a non-empty queue with head S ∈ SEG.
The initial state is ⟨ϵ, Tasks is not idle and its current vertex v i beats the incumbent on f . Of the remaining four rules exactly one will be applicable to the ith thread (unless a final state is reached). Rules (schedule i ) and (prune i ) apply if the ith thread is idle and the task queue is non-empty. Which of the two rules applies depends on whether the root vertex v i of the head task S in the queue is to be pruned or not. If not, S becomes the ith thread's current task and v i the current vertex, otherwise task S is pruned and the ith thread remains idle.
Rules (advance i ) and (terminate i ) apply if the ith thread is not idle. Which of the two rules applies depends on whether all vertices of the current task S i beyond the current vertex v i (in the lexicographic order < lex ) are to be pruned according to predicate p. If so, the ith thread terminates the current task and becomes idle, otherwise the thread advances to the next vertex v ′ i that is not pruned.
It is easy to see that no rule is applicable if and only if all threads are idle and the task queue is empty, that is, iff a final state is reached.

Admissible reductions.
The reduction rules in Fig. 3 do not specify an ordering on the rules nor stipulate any restriction on the relative speed of execution of different threads. However, applying the rules in just any order is too liberal. In particular, not selecting rule (strengthen i ) when the incumbent could in fact be strengthened may result in missing the maximum. To avoid this, rule (strengthen i ) must be prioritised as follows.
We call a reduction σ → σ ′ inadmissible if it uses rule (advance i ) or (terminate i ) even though rule (strengthen i ) was ap- Admissible reductions prioritise rule (strengthen i ) over rules (advance i ) and (terminate i ).
By induction on the length of the reduction sequence, one can show that an incumbent x maximises the objective function f over the ordered tree λ whenever ⟨x, [ ], ⊥, . . . , ⊥⟩ is a final state reachable from the initial state ⟨ϵ, Tasks, ⊥, . . . , ⊥⟩ by a sequence of admissible reductions.
We point out that final states are generally not unique. For instance, a graph may contain several different cliques of maximum size, and a parallel maxclique search may non-deterministically return any of these maximum cliques. Therefore the reduction relation cannot be confluent.
Example: Reductions for maxclique. Consider the tree in Fig. 2 encoding the graph in Fig. 1 be a queue of tasks such that S i is the segment rooted at vertex i; for example the segment S 2 is determined by the set of positions {2, 20, 200, 201, 21, 22}. Clearly, the S i are pairwise nonoverlapping and cover the whole tree. In Fig. 4, we consider a sample reduction with three threads (with IDs 1 to 3) following a strict round-robin thread scheduling policy, except for selecting the strengthening rule eagerly (that is, as soon as it is applicable). For convenience, we display the reduction rule used in the leftmost column and index the reduction arrow with the number of reductions.
We observe that up to reduction 11, the three threads traverse the search tree segments S 0 , S 1 and S 2 in parallel. From reduction 12 onwards, the incumbent is strong enough to enable pruning according to the heuristic, i.e. prune if size of current clique plus number of candidates does not beat size of the incumbent. Column pruned lists the positions of the search tree where traversal stopped due to pruning; column cut off list the positions that were never reached due to pruning. The reduction illustrates that parallel traversals potentially do more work than sequential ones in the sense that fewer positions are cut off. Concretely, thread 3 traverses segment S 2 because the incumbent is too weak; a sequential traversal would have entered S 2 with the final incumbent and pruned immediately, as indicated by the dashed line in Fig. 2. The reduction also illustrates that parallelism may reduce runtime: a sequential traversal would explore first S 0 and then S 1 , whereas thread 2 locates the maximum clique in S 1 without traversing S 0 first.

Generic branch and bound search
This section uses the model in Section 3 as the basis of a Generic Branch and Bound (GBB) API for specifying search problems. The GBB API makes extensive use of higher-order functions, i.e. functions that take functions as arguments, and hence is suitable for parallel implementation in the form of skeletons (Section 5).
We introduce each of the GBB API functions, give their types and show an example of how to use them in a simple implementation of the Maximum Clique problem (Section 2). Later sections show that the API is general enough to encode other branch and bound applications (Sections 6.1 and 6.2).
We start by considering the key types and functions required to specify a general branch and bound search. The API functions and types are specified in Haskell [23] in Listing 1.

Types
The fundamental type for a search is a Node that represents a single position within a search tree (for example in Fig. 2 each box represents a node). This notion of a node differs slightly from the BBM where a single type, X * , is used to uniquely identify a particular tree node by the branches leading to it. For an efficient implementation, rather than store an encoding of the branch through the tree, the node type uses the partial solution to encode the branch history and the candidate set to encode potential next steps in the branch. The current bound is maintained for efficiency reasons but could alternatively be calculated from the current solution as in the BBM.
The abstract types are described below, and Table 1 shows how the abstract types map to implementation specific types for Maximum Clique (Section 2), Knapsack (Section 6.1) and Travelling Salesperson (Section 6.2) searches.
Space: Represents the domain specific structure to be searched. Solution: Represents the current (partial) solution at this node.
The solution is an application specific representation of a branch within the tree and encodes the history of the search. Candidates: Represents the set of candidates that may still be added to the solution to extend the search by a single step. This may be used to encode implementation specific details such as no non-adjacent nodes in a maximum clique search, or simply ensure that no variable is chosen twice. It is not required that the type of the candidates matches the type the search space.
This enables implementation-specific optimisations such as the bitset encoding found in the BBMC algorithm (Section 7.1.1). Bound: Represents the bound computed from the current solution. There must be an ordering on bounds, for example as provided by Haskell's Ord typeclass instance [19] to allow a maximising tree traversal to be performed implicitly using the type. Node: Represents a position within the search space. For efficiency it caches the current bound, current solution and candidates for expansion.

Function usage
It is perhaps surprising that the application specific aspects of a branch and bound search can be both precisely specified, and efficiently implemented, with just two functions. The GBB API functions rely on the implicit ordering on the bound type, but could easily be extended to take an ordering function as an argument.
orderedGenerator: generates the set of candidate child nodes from a node in the space. Search heuristics can be encoded by ordering the child nodes in a list. The search ordering may use these heuristics to provide simple in-order tree traversal or more elaborate heuristics such as depth based discrepancy search (Section 7.1.1). pruningHeuristic: returns a speculative best possible bound for the current node. If this bound cannot unseat the global maximum then early backtracking should occur as it is impossible for child nodes to beat the current incumbent. These functions correspond to the branching and bounding functions respectively. We chose to call them orderedGenerator and pruningHeuristic to highlight their purposes: to generate the next steps in the search and to determine if pruning should occur.
Listing 2 shows instances of these GBB functions that encode a simple, IntSet based, version of the Maximum Clique search. The orderedGenerator builds a set of candidate nodes based on a greedy graph colouring algorithm (colourOrder). The colourings provide a heuristic ordering and, by storing them alongside the solution's vertices, allow effective bounding to be performed. Candidates only include vertices that are adjacent to every vertex already in the clique. The pruningHeuristic checks if the number of vertices in the current clique and potential colourings can possibly unseat the incumbent. See Section 7.1.1 for instances of the GBB API that use a more realistic bitset encoding [52,55].

General branch and bound search algorithm
The essence of a branch and bound search is a recursive function for traversing the nodes of the search space. Algorithm 1 shows the function expressed in terms of the GBB API (Listing 1) where we assume that the incumbent and associated bound are read and written by function calls rather than being explicitly passed as arguments and returned as a result. Hence the final solution is read from the global accessor function instead of the algorithm returning an explicit value. As we are dealing with maximising tree traversals, bounds are always compared using a greater than (>) function defined on the Bound type.
Parallelism may be introduced by searching the set of candidates speculatively in parallel, as illustrated in Section 5. Parallel search branches allow early updates of the incumbent via (globally) synchronised versions of the currentBound and updateBest functions.

Implementing the GBB API
Although GBB can encode general branch and bound searches, various modifications improve both sequential and parallel efficiency.
Generally the search space is immutable and fixed at the start of the search. In a distributed environment we can avoid copying the search space each time a task is stolen by storing a read only copy of the search space on each host. It is also possible to remove the space argument from the API functions and add accessor functions in the same manner as bound access. The implementations used in Section 7 do pass the space as a parameter.
For some applications, such as Maximum Clique, if the local bound fails to unseat the incumbent then all other candidate nodes to-the-right (assuming an ordered generator) will also fail the pruning predicate. An implementation can take advantage of this fact and break the candidate checking loop for an early backtrack. This optimisation is key in avoiding wasteful search. In the skeleton implementations used in Section 7 we allow this behaviour to be toggled via a runtime flag.
Finally, an implementation can exploit lazy evaluation within the node type to avoid redundant computation. Taking Maximum Clique as an example we can delay the computation of the set of candidates vertices until after the pruning heuristic has been checked (as this only depends on having the bound and colour). Similarly if we use the to-the-right pruning optimisation, described above, we want to avoid paying the cost of generating the nodes which end up being pruned.

Parallel skeletons for branch and bound search
Algorithmic skeletons are higher order functions that abstract over common patterns of coordination and are parameterised with specific computations [12]. For example, a parallel map function will apply a sequential function to every element of a collection in parallel. Skeletons are polymorphic, so the collection may contain elements of any type, and the function type must match the element type. The programmer's task is greatly simplified as they do not need to specify the coordination behaviour required. The skeleton model has been very influential, appearing in parallel standards such as MPI and OpenMP [10,57], and distributed skeletons such as Google's MapReduce [13] are core elements of cloud computing.
Here the focus is on designing skeletons for maximising branch and bound search on distributed memory architectures. These architectures use multiple cooperating processes with distinct memory spaces. The processes may be spread across multiple hosts.
Although it is possible to implement skeletons using a variety of parallelism models, we adopt a task parallel model here. The task parallel model is based around breaking down a problem into multiple units of computation (tasks) that work together to solve a particular problem. In a distributed setting, tasks (and their results) may be shared between processes. For search trees, parallel tasks generally take the form of sub-trees to be searched.
Two skeleton designs are given in this section. The first skeleton, Unordered, makes no guarantees on the search ordering and so may give the anomalous behaviours and the unpredictable parallel performance outlined in Section 2.2. The second skeleton, Ordered, enforces a strict search ordering and hence avoids search anomalies and gives predictable performance. The unordered skeleton is used as an example of the pitfalls of using a standard random work stealing approach and provides a baseline comparison for evaluating the performance of the Ordered skeleton (Section 7).
We start by considering the key design choices for constructing a branch and bound skeleton. Using these we show how the 1 type Vertex = Int 2 type VertexSet = IntSet 3 type Colour = Int

Design choices
Three main questions drive the design of branch and bound search skeletons: 1. How is work generated? 2. How is work distributed and scheduled? 3. How are the bounds propagated?
The first two choices focus on task parallel aspects of the design and are common design features for algorithmic skeletons. Bound propagation is a specific issue for branch and bound search and takes the form of a general coordination issue rather than being tied to the task parallel model.
To achieve performance in the task parallel model, tasks should be oversubscribed, that is there should be more tasks than cores, while avoiding low task granularity where communication and synchronisation overheads may outweigh the benefits of the parallel computation. To achieve these characteristics in the skeleton designs a simple approach for work generation is used: generate parallel tasks from the root of the tree until a given depth threshold is reached. This method exploits the heuristic that tasks near the top of the tree are usually of coarse granularity than those nearer the leaves, i.e. they have more of the search space to consider. This threshold approach is commonly used in divide-and-conquer parallelism and allows a large number of tasks to be generated while avoiding low granularity tasks. The argument that tasks near the top of the tree have coarse granularity does not necessarily hold true for all branch and bound searches as variant candidate sets and pruning can truncate some searches initiated near the root of the tree: hence task granularity may be highly irregular.

Unordered skeleton
The type signature of the Unordered skeleton is: search :: Int --Depth to spawn to --Root node → Node Sol Bnd Candidates --orderedGenerator In the skeleton search tasks recursively generate work, i.e. new search tasks. If the depth of a search task does not exceed the threshold it generates new tasks on the host, otherwise the task searches the subtree sequentially.
Work distribution takes the form of random work stealing with exponential back-off [7] and happens at two levels. Intranode steals occur between two workers in the same process, the next sub-tree is stolen from the workqueue of the local process. Only if the worker fails to find local work does an internode steal occur, targeting some random other process. Only one internode steal per process is performed at a time. New tasks, either created by local workers or stolen from remote processes, are added to the local workqueue and are scheduled in last-in-first-out order.
The current incumbent, i.e. best solution, is held on every host, and managed by a distinguished master process. Bound propagation proceeds in two stages. Firstly when a search task discovers a new Solution it sends both the solution and bound to the master and, if no better solution has yet been found, they replace the incumbent. Secondly the master broadcasts the new bound to all other processes, that update their local incumbent unless they have located a better solution. This is a form of eventual consistency [62] on the incumbent. Using this approach, as opposed to fully peer to peer, the new solution is sent to the master once and only bounds are broadcast. While broadcast is bandwidth intensive, broadcasting new bounds provides fast knowledge transfer between search tasks. Moreover experience shows that often a good, although not necessarily optimal, bound is found early in the search making bound updates rare. In many applications the bounds are rangelimited, e.g. a Maximum Clique cannot be larger than the number of vertices in the graph.

Ordered skeleton
The type signature of the Ordered skeleton is as follows.
The additional first parameter enables discrepancy search ordering (Section 7.1.1) to be toggled; an alternative formulation would be to pass an ordering function in explicitly. The skeleton adapts the Unordered skeleton to avoid search anomalies (Section 2.2) and give predictable performance properties as shown in Section 1.
The Sequential Bound property guarantees that parallel runtimes do not exceed the sequential runtime. To maintain this property we enforce that at least one worker executes tasks in the exact same order as the sequential search. The other workers speculatively execute other search tasks and may improve the bound earlier than in the fully sequential case, as illustrated in Fig. 4. Discovering a better incumbent early enables the sequential thread to prune more aggressively and hence explore less of the tree than the entirely sequential search would, providing speedups. While there is no guarantee that the speculative workers will improve the bound, the property will still be maintained by the sequential worker.
Requiring a sequential worker is a departure from the fully random work stealing model. Instead of all workers performing random steals, the task scheduling decisions are enforced for the sequential worker. Our system achieves sequential ordering by duplicating the task information. One set is stealable by any worker, and the other is restricted to the sequential worker. There is a chance that work will be duplicated as some worker may simultaneously attempt to start the same task as the sequential worker. To avoid duplicating work, we use a basic locking mechanism where workers first check whether a task has already started execution before starting the task themselves.
With random scheduling adding a worker may disrupt a good parallel search order (Section 2.2), so to guarantee the nonincreasing runtimes property we need to preserve the parallel search order, just as the sequential worker preserves the sequential search order. Preserving the parallel search order means that if with n workers we locate an incumbent by time t pn , then with n + 1 workers we locate the same incumbent, or a better incumbent, at approximately t pn . The approximation is required as, in a distributed setting, t pn may vary slightly due to the speed of bound propagation.
It transpires that preserving the parallel search order is also sufficient to guarantee the repeatability property as all parallel executions follow very similar search orders. The parallel search order must be globally visible for it to be preserved, and we can no longer permit random work stealing. Instead all tasks are generated on the master host and maintained in a central priority queue. In our skeleton implementation we use depth-bounded work generation to statically construct a fixed set of tasks, with set priorities, before starting the search. Alternative work generation approaches, for example dynamic generation, are possible provided all tasks are generated on the master host.
The parallel search order may have dramatic effects on search performance [11,37,38]. In our skeletons any fixed ordering will maintain the properties, although it may not guarantee good performance. The GBB API in Section 4 relies on the user choosing an ordering of nodes in the orderedGenerator function. This ordering is generally, but not necessarily, based on some domain specific heuristic. One simple scheduling decision, and our default, is to assign priorities from left-most to right-most task in the tree. The skeleton may use any priority order rather than the default leftto-right order, for example the depth-bounded discrepancy (DDS) order [63]. This discrepancy ordering is used when evaluating the Maximum Clique benchmark (Section 7.1.1).
By augmenting the Unordered skeleton with a single worker that follows the sequential ordering and a global priority ordering on tasks we arrive at the Ordered skeleton that provides reliable performance guarantees while still enabling parallelism. Table 2 compares the key design features of the two skeletons.

Skeleton comparison
A key difference is where tasks are generated and stored. The Unordered skeleton adopts a dynamic approach at the cost of not giving the same performance guarantees as the Ordered skeleton due to a lack of global ordering. Many other skeleton designs are possible. An advantage of the skeleton approach that exploits a general API is that parallel coordination alternatives may be explored and evaluated without refactoring the application code.

Limitations
For most design choices we have selected a simple alternative. More elaborate alternatives might well deliver better performance. Here we discuss some of the limitations imposed by the simple alternatives selected.
One key limitation of both skeleton designs is the use of depth bounded work generation techniques. While this technique is a well known optimisation for divide and conquer applications, the need to manually tune the depth threshold reduces the skeleton portability as the number of tasks required to populate a system is proportional to the system size. Given the irregular structure of a branch and bound computation it is often difficult to know ahead of time how many tasks will need to be generated to avoid starvation and fully exploit the resources available. In practice we have not found this to be an issue, as for many problem instances such as the three benchmarks used in the skeleton evaluation (Section 6), even generating work to a depth of 1 can give thousands of tasks. However, for some instances, to achieve best performance one may need to split work at much lower levels [37]. An alternative would be to use dynamic work generation techniques where the parallel coordination layer manages load in the system [1]. Dynamic work generation can cause difficulty for maintaining a global task ordering in a distributed environment such as in the case of the Ordered skeleton.  A consequence of static work generation in the Ordered skeleton is that the runtime for the single worker case can be larger than that of a fully sequential search implementation. With static work generation, work is generated from nodes at a depth d ahead of time and the parent nodes are no longer considered (as they are already searched). This leads to the creation of additional tasks that a sequential implementation may never create due to pruning at the higher levels. The management and searching of these additional tasks causes the discrepancy between the single worker Ordered skeleton and purely sequential search. While this does not effect the properties, as we phrase property 1 in terms of a single worker, it would if a purely sequential implementation in property 1 is considered. The effects of this limitation could be mitigated by treating all nodes above the depth threshold as tasks and allowing cancellation of parent/child tasks. Such an approach complicates the task coordination greatly as tasks require knowledge of both their parent and child task states.
The Ordered skeleton requires additional memory and processing time on the master host to maintain the global task list and respond promptly to work stealing requests. In practice we have not found this to be a significant issue as most tasks near to top of the search tree are long running and the steals occur at irregular intervals. On large distributed systems, and for some searches, it is possible that a single master might prove to be a scalability bottleneck.

Implementation
The Ordered and Unordered skeletons are implemented in Haskell distributed parallel Haskell (HdpH) embedded Domain Specific Language (DSL) [30]. HdpH has been modified to use a priority queue based scheduler to enable the strict ordering on task execution. While HdpH cannot match the performance of the state of the art branch and bound search implementations it is useful for evaluating the skeletons for the following reasons.
1. HdpH supports the higher order functions, a commonly used approach for constructing skeletons. 2. The HdpH is small and easy to modify, allowing ideas to be rapidly prototyped. For example we experimented with priority-based work stealing. 3. The properties of the Ordered skeleton depend on relative runtime values, i.e. absolute runtime is not the priority.
Although our skeletons have been implemented in a functional language they may be implemented in any system with the following features: task parallelism; work stealing (random/ single-source); locking; priority based work-queues/task ordering. Distributed memory skeleton implementations will also require distribution mechanisms and distributed locking.

Maximum Clique representation
To end this section we show, using the functions and types defined in Listing 2, how the search skeletons are used within an application. Here we show how the skeleton is called for the Maximum Clique benchmark (Section 2):

Other branch and bound skeletons
While algorithmic skeletons are widely used in a range of areas from processing large datasets [13] to multicore programming [49] there has been little work on branch and bound skeletons. Two notable exceptions are MALLBA [2] and Muesli [45] that both provide distributed branch and bound implementations. Both frameworks are written in C++. Muesli uses a similar higher-order function approach to ourselves while MALLBA is designed around using classes and polymorphism to override solver behaviour. In Muesli it is possible to choose between a centralised workpool approach, similar to the Ordered skeleton but using work-pushing rather than work stealing, or a distributed method. Unfortunately the centralised workpool model does not scale well compared with our approach (Section 7). MALLBA similarly uses a single, centralised, workqueue for its branch and bound implementation. The real strength of the MALLBA framework is in the ability to encode multiple exact and inexact combinatorial skeletons as opposed to just branch and bound.
The Muesli authors further highlight the need for reproducible runtimes and note ''the parallel algorithm behaves nondeterministically in the way the search-space tree is explored. In order to get reliable results, we have repeated each run 100 times and computed the average runtimes'' [45]. By adopting the strictly ordered approach in this paper we avoid the need for large numbers of repeated measurements to account for non-deterministic search ordering.

Model, API and skeleton generality
To show that the BBM model and GBB API are generic, and to provide additional evidence that the Ordered skeleton preserves the parallel search properties (Section 7) we consider two additional search benchmarks: 0/1 Knapsack, a binary assignment problem, and Travelling Salesperson, a permutation problem.

0/1 Knapsack
Knapsack packing is a classic optimisation problem. Given a container of some finite size and a set of items, each with some size and value, which items should be added to the container in order to maximise its value? Knapsack problems have important applications such as bin-packing and industrial decision making processes [51]. There are many variants of the knapsack problem [32], typically changing the constraints on item choice. For example we might allow an item to be chosen multiple times, or fractional parts of items to be selected. We consider the 0/1 knapsack problem where an item may only be selected once and fractional items are not allowed.
At each step a bound may be calculated using a linear relaxation of the problem [33] where, instead of solving for i ∈ {0, 1} we instead solve fractional knapsack problem where i ∈ [0, 1]. As the greedy fractional approach is optimal and provides an upper bound on the maximum potential value. Although it is possible to compute an upper bound on the entire computation by considering the choices at the top level [31], we do not implement this here. The primary benefit of this method is to terminate the search early when a maximal solution is found. A formalisation of the 0/1 Knapsack problem in BBM and the corresponding GBB implementation are given in Appendices B and C respectively.

Travelling Salesperson problem
Travelling Salesperson (TSP) is another classic optimisation problem. Given a set of cities to visit and the distance between each city find the shortest tour where each city is visited once and the salesperson returns to the starting city. We consider only symmetric instances where the distance between two cities is the same travelling in both directions.
A formalisation of TSP in BBM and the corresponding GBB implementation are given in Appendices D and E respectively.

Parallel search evaluation
This section evaluates the parallel performance of the Ordered and Unordered generic skeletons. It starts by outlining the benchmark instances (Section 7.1) and experimental platform (Section 7.2). We establish a baseline for the overheads of the generic skeletons by comparing them with a state of the art C++ implementation (Section 7.3) of Maximum Clique. Finally we investigate the extent that the Ordered skeleton preserves the runtime and repeatability properties (Section 2.3) for the three benchmarks.
The datasets supporting this evaluation are available from an open access archive [4].

Benchmark instances and configuration
This section specifies how the benchmarks are configured and the instances used. We aim for test instances with a runtime of less than an hour while avoiding short sequential runtimes that do not benefit from parallelism. These instances ensure we (a) have enough parallelism and (b) can perform repeated measurements while keeping computation times manageable.

Maximum Clique
The Maximum Clique implementation (Section 2) measured uses the bit set encoded algorithm of San Segundo et al.: BBMC [52,55]. This algorithm makes use of bit-parallel operations to improve performance in the greedy colouring step (orderedGenerator in the GBB API), and ours is the first known distributed parallel implementation of BBMC. We do not use the additional recolouring algorithm [52]. Maximum Clique is one example where prunes can propagate to-the-right (Section 4.3) and we make use of this in the implementation. The instances are given in Table 3 and come from the second DIMACS implementation challenge [22]. For many applications, search heuristics are weak and tend to perform badly near the root of the search tree [20]. To overcome this limitation, the Maximum Clique example makes use of a non left-to-right search ordering in order to make the search as diverse as possible. The new order is based on depth-bounded discrepancy search [63] with the algorithm extended to work on n-ary trees by counting the nth child as n discrepancies. An example of the discrepancy search ordering is shown in Fig. 5. 4 This further shows the generality of the skeleton to maintain the properties even when custom search orderings are used. Uncorrelated span(2,10) 50 knapPI_12_50_1000_23 Weakly correlated span(2,10) 50 knapPI_12_50_1000_34 Weakly correlated span(2,10) 50 knapPI_13_50_1000_10 Strongly correlated span(2,10) 50 knapPI_13_50_1000_32 Strongly correlated span(2,10) 50 knapPI_14_100_1000_88 Multiple strongly correlated 100 knapPI_14_50_1000_64 Multiple strongly correlated 50 knapPI_15_500_1000_47 Profit ceiling 500 knapPI_15_50_1000_20 Profit ceiling 50 knapPI_16_50_1000_62 Circle 100 knapPI_16_50_1000_21 Circle 50

Fig. 5.
Discrepancy search priorities -lower is higher priority.

The 0/1 Knapsack problem
The 0/1 Knapsack implementation (Section 6.1) uses ascending profit density ordering as the search heuristic and a greedy fractional knapsack implementation for calculating the lower bound. As with Maximum Clique we take advantage of the prune tothe-right optimisation. The bound is uninitialised at the start the search. This simple implementation does not match the performance of state of the art solvers.
Although the knapsack problem is NP-hard, many knapsack instances are easily solved on modern hardware. Methods exist for generating hard knapsack instances [44]. We make use of the subset of the pre-generated hard instances [43] shown in Table 4.

Travelling Salesperson
The final application is the Travelling Salesperson problem (Section 6.2). A simple implementation is used that assumes no ordering on the candidate cities and uses Prim's minimum spanning tree algorithm [46] to construct a lower bound. The initial bound comes from the result of a greedy nearest neighbour search.
Like the Knapsack application, this is a proof of concept implementation, based on simple branching and pruning functions, and does not perform as well as current state of the art solvers which go beyond simple branch and bound search.
Problem instances are drawn from two separate locations: the TSPLib instances [50] and random instances from the DIMACS TSP challenge instance generator [21]. A list of benchmarks is given in Table 5.

Measurement platform and protocols
The evaluation is performed on a Beowulf cluster consisting of 17 hosts each with dual 8-core Intel Xeon E5-2640v2 CPUs (2 Ghz), 64 GB of RAM and running Ubuntu 14.04.3 LTS. Exclusive access to the machines is used and we ensure there is always at least one physical core per thread. Threads are assigned to cores using the default mechanisms of the GHC runtime system.
The skeleton library and applications are written in Haskell using the HdpH distributed-memory parallelism framework as outlined in Section 5.6. Specifically we use the GHC 8.0.2 Haskell compiler and dependencies are pulled from the stackage lts-7.9 repository or fixed commits on github. 5 The complete source code for the experiments is available at: http://dx.doi.org/10.5281/ zenodo.254088.
In all experiments, each HdpH node (runtime) is assigned n threads and manages n − 1 workers that execute the search. Unless otherwise specified, all results are based on the mean of ten runs. The spawnDepth is always set to one, causing child tasks to be spawned for each top level task. This spawnDepth setting provided good performance for most instances, however it may not be optimal for each individual instance.

Comparison with a class-leading C++ implementation
To establish a performance baseline for the generic Haskell skeletons we compare the sequential (single worker) performance of the skeletons with a state of the art C++ implementation of the Maximum Clique benchmark [37]. Only instances with a (skeleton) sequential runtime of less than one hour are considered.
The C++ results were gathered on a newer system featuring a dual Intel Xeon E5-2697A v4, 512 GBytes of memory, Ubuntu 16.04 and were compiled using g++ 5.4. A single sequential sample is used for comparison. Table 6 compares the C++ implementation to the Ordered skeleton. To keep the skeleton execution as close to a fully sequential implementation as possible, work is generated only at the top level and is searched in decreasing degree order. As there is no communication, the HdpH node is assigned a single thread and a single worker. As expected, the Ordered skeleton is between a factor of 1.9 and 6.2 slower than the hand crafted C++ search. A primary contributor to the slowdown is Haskell execution time: with the slowdown widely accepted to be a factor of between 2 to 10, but often lower for symbolic computations like these. The slowdown is due to Haskell's aggressive use of immutable heap structures, garbage collection and its lazy evaluation model. The generality of the skeletons means that they use computationally expensive techniques like higher-order functions and polymorphism. Finally, our skeleton implementations have not been extensively hand optimised, as the C++ implementation has.
The remainder of the evaluation uses speedup relative to the one worker Haskell implementation. We argue that the underlying performance in the sequential (one worker) is sufficiently good for the results to be credible.

Sequential Bound & Non-increasing Runtimes
As Sequential Bound and Non-increasing Runtimes are both runtime properties we evaluate them together. We investigate the relative speedup, or strong scaling, of the Ordered and Unordered skeletons using between 1 and 200 workers for each benchmark. If Sequential Bound holds then the speedup will be greater than or equal to 1, and if Non-increasing Runtimes holds the curves should be non-decreasing. Non-increasing Runtimes is still maintained even when a speedup curve becomes flat: we simply do not benefit from additional workers. Fig. 6 shows the speedup curves for the Maximum Clique Ordered and Unordered skeletons. Scaling curves are not given for the brock800 series and the p_hat700-3 instances as instances with a one worker baseline of greater than one hour are not considered. For all Maximum Clique instances both skeletons preserve Sequential Bound, i.e. no configuration has greater runtime than the single worker case. The skeletons achieve good parallel speedups for symbolic computations, delivering a maximum parallel efficiency of around 50%.
The Ordered Skeleton maintains Non-increasing Runtimes for most instances, exceptions being brock400_4, p_hat100-2 and MANN_a45, shown by non-decreasing speedup curves. brock400_4 appears to have the largest slow down between 128 to 200 workers, however the runtime at these scales is tiny (4.5 s), and we attribute the slowdown to a combination of (small) parallelism overheads and variability. While the final runtimes for p_hat1000-2 and MANN_a45, once parallelism stops being effective, are larger (15 s and 27 s respectively) the mean runtimes for 64, 128 and 200 workers are never more than 2.5 s apart and this we again attribute to parallelism overheads rather than search ordering issues. Even for instances such as MANN_a45, where there is limited performance benefit for adding additional workers due  to a large maximum clique causing increased amounts of pruning, using additional cores never increases runtime significantly.
While the mean speedups reported in Fig. 6 suggest that the Unordered skeleton also preserves Non-increasing Runtimes they disguise the huge runtime variance of the searches. Fig. 7 illustrates this by showing each individual speedup sample from the brock400_3 instance. The unpredictable speedups for the Unordered skeleton are in stark contrast to the Ordered skeleton. We attribute the high variance of the Unordered skeleton to the interaction between random scheduling and search ordering.
The speedup curves for the Knapsack and TSP applications are given in Figs. 8 and 9 respectively. Again any instances with sequential runtimes greater than an hour are excluded.
Non-increasing Runtimes is maintained by the Ordered skeleton in all Knapsack and Travelling Salesperson cases except ulysses16 which shows a slowdown when moving from 32 to 64 workers. As with the brock400_4 slowdown, the runtime at this scale is small (10 s), and deviations are likely caused by parallelism overheads rather than search ordering effects.
While it is difficult to directly compare Ordered and Unordered executions due to search ordering effects, in some instances the Unordered skeleton is more efficient than the Ordered skeleton. This is caused by a variety of factors including reduced warm-up time due to not requiring upfront work generation, distributed work stealing reducing the impact of a single node bottleneck and the potential for randomness to find a solution quicker than the fixed search order of the ordered skeleton.

Repeatability
The Repeatability property aims to ensure that multiple runs of the same configuration have similar runtimes. To give a normalised variability measure we use relative standard deviation (RSD), 7 i.e. the ratio of the standard deviation to the mean [17]. To compare the variability of the benchmark instances using the Ordered and Unordered skeletons we plot the RSD as a cumulative distribution function (CDF) for each worker configuration. Here the key metric is how quickly the curve reaches 1, i.e. the point that covers all RSD values. A disadvantage of this type of plot is that it is not robust to outliers. These plots contain all benchmarks including those where the sequential run timed out but a parallel run was successful in less than an hour. Benchmarks with mean runtime less than 5 s are removed as a high RSD is expected. 7 Also known as coefficient of variation. Fig. 10 shows the CDF plot for both skeletons for all maximum clique benchmarks run with 1, 8, 64 and 200 workers. With a single worker the maximum RSD of both skeletons is less than 3% showing that they provide repeatable results. This is expected as in the single worker case the Unordered skeleton behaves like the Ordered skeleton, following a fixed left-to-right search order. With multiple workers the Ordered skeleton guarantees better repeatability than the Unordered skeleton, with median RSDs given in Table 7. For the 64 worker case the long tails are caused by outliers in the data and we see a low RSD maintained in almost 90% of cases. The issues with identifying outliers are discussed in Appendix A. The cause of these outliers is unknown but, given the large discrepancy, is probably spurious behaviour on the system rather than a manifestation of search order anomalies.
Figs. 11 and 12 show the CDF plots for the Knapsack and Travelling Salesperson benchmarks, and the results are very similar to those for Maximum Clique. With a single worker both Ordered and Unordered skeleton implementations deliver highly repeatable results, i.e. a maximum RSD of less than 3%. The Knapsack application has poor repeatability in the Unordered skeleton cases; half of them suffering over 100% RSD. As with Maximum Clique, outlying data points make the Ordered skeleton appear to perform badly on one or two of the TSP benchmarks in the 64 and 200 worker cases, as discussed in Appendix A. Nonetheless the Ordered skeleton maintains a low RSD.

Conclusion
Branch and bound searches are an important class of algorithms for solving global optimisation and decision problems. However,  they are difficult to parallelise due to their sensitivity to search order, which can cause highly variable and unpredictable parallel performance. We have illustrated these parallel search anomalies and propose that replicable search implementations should avoid them by preserving three key properties: Sequential Bound, Non-increasing Runtimes and Repeatability (Section 2). The paper develops a generic parallel branch and bound skeleton and demonstrates that it meets these properties. We defined a novel formal model for general parallel branch and bound backtracking search problems (BBM) that is parametric in the search order and models parallel reduction using smallstep operational semantics (Section 3). The generality of the model was shown by specifying three benchmarks: Maximum Clique, 0/1 Knapsack and Travelling Salesperson.
We presented a Generic Branch and Bound (GBB) API as a set of higher order functions (Section 4). The GBB API conforms to the BBM and its generality was shown by using it to implement the three benchmarks. The Maximum Clique implementation is the first distributed-memory parallel implementation of San Segundo's bit parallel Maximum Clique algorithm (BBMC) [52].
We factored the elaborate parallel search behaviours as a pair of sophisticated algorithmic skeletons for distributed memory architectures (Section 5). While the Unordered skeleton does not guarantee the parallel search properties the Ordered skeleton does. For example to guarantee the Sequential Bound one thread is assigned to follow the sequential search order.
We have evaluated the parallel performance of the skeletons with 40 instances of the three benchmark searches (Tables 3-5) on a cluster with 17 hosts and up to 200 workers. The sequential performance of the generic skeletons, implemented in Haskell, is between 1.9 and 6.2 times slower than a state of the art Maximum Clique solver implemented in C++. The slowdown is primarily due to the relative execution speeds of Haskell and C++, but also due to the generality of the skeletons and the lack of hand optimisation (Section 7.3). We evaluated the properties using speedups relative to the sequential runtime of the generic skeletons. We find that the Ordered skeleton preserves the Sequential Bound property for all benchmark instances with non-trivial runtimes. In contrast the Unordered skeleton violates the property for 5 TSP instances (Section 7.4). The Ordered skeleton preserves the Non-increasing Runtimes property for all benchmark instances with non-trivial runtimes. In contrast the Unordered skeleton violates the property for many instances of all three benchmarks (Section 7.4 and Fig. 7). The Ordered skeleton delivers far more repeatable performance than the Unordered skeleton with a median relative standard deviation of 1.78% vs 5.56%, 1.83% vs 87.56% and 1.96% vs 8.61% over all Maximum Clique, Knapsack and TSP instances respectively (Section 7.5). In ongoing work we are developing more general and higher performance search skeletons in the parallel C++ [5]. Table A.8 illustrates some of the issues with outliers in the runtime measurements. The dataset is from the rand_34_22137 TSP instance with 32 workers. The Ordered skeleton runtimes have a potential outlier (in bold) with a runtime 237 s greater than any of the other runtimes. The other 9 runtimes have a range of just 15.5 s (i.e. 207. 3-191.8). In this case it is almost certain that the outlier is not due to some search order effect, but rather to some external system factor, e.g. network contention or some daemon process running. We have not been able to reproduce this effect in additional experiment runs. The Unordered skeleton runtimes also appear to have an outlier (in bold) with a runtime 316.1 s greater than any of the other runtimes. It is, however, harder to be certain that this is an external factor as the variability of the other 9 measurements is far higher, i.e. 183 s (303.4-120.4).
As a result of the difficulties of identifying them we have not attempted to eliminate any outliers, even where there is a strong case. That is, the cumulative distribution function (CDF) plots in Figs. 10-12 show all measurements recorded. In addition we use median relative standard deviation (RSD) to eliminate the effects of any outliers when comparing the repeatability of the Ordered and Unordered skeletons (Table 7).

Appendix B. 0/1 Knapsack -BBM Formalisation
The input parameters for the 0/1 knapsack problem consists of a fixed capacity C and a set of m items I = {i 1 . . . i m }. For each item i 1 ∈ I we define p : I → N and w : I → N to be projection functions giving the profit and weight of an item respectively. If the input is considered as a vector of items, then the problem is equivalent to creating an output vector X where x j ∈ {0, 1}, encoding the inclusion or exclusion of each item i j in the input vector: The BBM model uses finite words to represent tree branches.
We can trivially extend the profit and weight functions, p and w, to those operating on words (p : I * → N and w : I * → N) by taking the sum across each component. An unordered tree generator, g : One ordering heuristic is profit density, and an ordered generator simply orders the results of the unordered generator by ascending profit density.
As the aim is to maximise total profit the objective function, f : I * → N is simply p, the profit function over words. The ordering on objective functions ⊑ is given by the natural ordering ≤.
Finally, the pruning predicate, p : N × I * → {0, 1}, is defined as follows, where fractionalKnapsack greedily solves the residual knapsack problem using continuous relaxation to obtain an optimal solution and hence a bound on the maximal profit.

/1 Knapsack -GBB and Skeleton Representation
A Knapsack implementation using the GBB API and the Unordered skeleton is shown in Listing 3 . The implementation comes directly from BBM, generating only candidate items which do not exceed the capacity constraint and using continuous relaxation to compute an upper bound. The Unordered.search function on line 32 invokes the Unordered skeleton implementation with an empty root node. The Ordered skeleton invocation is very similar. A higher performance version supporting distributed execution is used for evaluating the skeletons in Section 7.1.2.

Appendix D. Travelling Salesperson -BBM formalisation
The TSP input consists of a set C of n cities and a metric on C , given by a symmetric non-negative distance function d : C × C → R.  Tours are modelled as words over C where the word t = c 1 c 2 . . . c k ∈ C * represents a (partial) tour starting at c 1 and ending at c k if all cities c i are pairwise distinct. The tour t is complete if k = n, that is, if every city in C is visited exactly once.
We generalise the distance function d to words in C * in the obvious way: The (unordered) tree generator function, g : C * → 2 C , extends a partial tour with each city that has not been visited yet, enumerating all possible tours. Due to symmetries -rotations to change the start, reflections to change the direction -each tour is enumerated 2n times. This symmetry can be broken by fixing the starting city.
g(c 1 . . . c k ) = C \ {c 1 , . . . , c k } The objective function, f : C * → R, maps complete tours to the total distance travelled (including the distance from the last city back to the starting city) and incomplete tours to infinity 8 :  8 Formally, the co-domain of f should be R ∪ {∞}; we ignore this detail. In practice, ∞ can be replaced by any real number larger than the total distance of the longest possible tour.
We aim to minimise the objective function with respect to the standard order ≤ on the reals; in BBM this corresponds to maximising f with regard to the dual order ≥ on R where ∞ is the minimal element w. r. t. ≥.
Finally, the pruning predicate, p : R × C * → {0, 1}, prunes a partial tour if the distance travelled along the tour plus the weight of a minimum spanning tree covering the remaining cities exceeds the distance of the current shortest tour: Here, weightMST (C \ {c 2 , . . . , c k−1 }) is the weight of a minimum spanning tree covering the not-yet visited cities as well as the starting city c 1 and the most recently visited city c k . The weight of this MST is a lower bound of the distance covered in the shortest partial tour from c k through the not-yet visited cities in C \{c 1 , . . . , c k } and back to the start c 1 .

Appendix E. Travelling Salesperson -GBB and skeleton representation
A TSP implementation using the GBB API and Unordered skeleton is shown in