Learning Interpretable Error Functions for Combinatorial Optimization Problem Modeling

In Constraint Programming, constraints are usually represented as predicates allowing or forbidding combinations of values. However, some algorithms exploit a finer representation: error functions. Their usage comes with a price though: it makes problem modeling significantly harder. Here, we propose a method to automatically learn an error function corresponding to a constraint, given a function deciding if assignments are valid or not. This is, to the best of our knowledge, the first attempt to automatically learn error functions for hard constraints. Our method uses a variant of neural networks we named Interpretable Compositional Networks, allowing us to get interpretable results, unlike regular artificial neural networks. Experiments on 5 different constraints show that our system can learn functions that scale to high dimensions, and can learn fairly good functions over incomplete spaces.


Introduction
Twenty years separate Freuder's papers [1997] and [2018], both about the grand challenges Constraint Programming (CP) must tackle "to be pioneer of a new usability science and to go on to engineering usability" [Freuder, 2007]. To respond to the lack of a "Model and Run" approach in CP [Puget, 2004;Wallace, 2003], several languages have been developed since the late 2000's, such as ESSENCE [Frisch et al., 2008], XCSP [Boussemart et al., 2016] or MiniZinc [Nethercote et al., 2007]. However, they require users to have deep expertise on global constraints and to know how well these constraints, and their associated mechanisms such as propagators, are suiting the solver. We are still far from the original Holy Grail of CP: "the user states the problem, the computer solves it" [Freuder, 1997].
This paper makes a contribution in automatic CP problem modeling. We focus on Error Function Satisfaction and Optimization Problems we defined in the next section. Compare to classical Constraint Satisfaction and Constrained Optimization Problems, they rely on a finer structure about the * Contact Author problem: the cost functions network, which is, for this work, an ordered structure over invalid assignments that constraint solver can exploit efficiently to improve the search.
In this paper, we propose a method to learn error functions automatically; a direction that, to the best of our knowledge, had not been explored in Constraint Programming. We focus here on "easy-to-use" aspects of Constraint Programming.
To sum up, the main contributions of this paper are: 1. to give the first formal definition of Error Function Satisfaction Problems and Error Function Optimization Problems, 2. to introduce Interpretable Compositional Networks, a variant of neural networks to get interpretable results, 3. to propose an architecture of Interpretable Compositional Network to learn error functions, and 4. to provide a proof of concept by learning Interpretable Compositional Network models of error functions, using a genetic algorithm, and to show that most of models give scalable functions, and remain fairly effective using incomplete training sets. not necessarily the case: it depends on the cost structure. For instance, the classical cost structure S t/f = {true, f alse}, ∧, true, f alse makes the cost function network equivalent to a classical constraint network, so dealing with hard constraints.
Here, we consider particular cost functions that represent hard constraints only, by considering the additive cost structure S + = R, +, 0, ∞ . The additive cost structure produces useful cost function networks capturing problems such as Maximum Probability Explanation (MPE) in Bayesian networks and Maximum A Posteriori (MAP) problems in Markov random fields [Hurley et al., 2016].
In this paper, an error function is a cost function defined in a cost function network with the additive cost structure S + . Intuitively, error functions are preferences over invalid assignments. Let f c be an error function representing a constraint c and x c be an assignment of variables in the scope of The goal of this paper is not to study the advantages of such cost function networks over regular constraint networks. Without formally defining EFSP and EFOP problems, some studies illustrate that solvers (in particular, metaheuristics) can exploit this structure efficiently leading to state-of-the-art experimental results, both in sequential [Codognet and Diaz, 2001] and parallel solving [Caniou et al., 2015]. In addition, our Experiment 3 shows that error functions representing the classic AllDifferent constraint gives models that clearly outperformed a model based on a regular constraint networks in terms of runtimes, for models with either hand-crafted or learned error functions.
Let x be a variable assignment, and denote by x c the projection of x over variables in the scope of a constraint c. We can now define the EFSP and EFOP problems.
Problem: ERROR FUNCTION SATISFACTION PROBLEM Input: A cost function network V, D, F, S + . Question: Does a variable assignment x exist such that ∀f c ∈ F, f c ( x c ) = 0 holds? Problem: ERROR FUNCTION OPTIMIZATION PROBLEM Input: A cost function network V, D, F, S + and an objective function o. Question: Find a variable assignment x maximizing or minimizing the value of o( x) such that ∀f c ∈ F, f c ( x c ) = 0 holds.
With the system we propose in this paper, users provide the usual constraint network V, D, C , and it computes the equivalent cost function networks V, D, F, S + . Learned error functions composing the set F are independent of the number of variables in constraints scope, and are expressed in an interpretable way: users can understand these functions and easily modify them at will. This way, users can have the power of EFSP and EFOP with the same modeling effort as for CSP and COP.

Related works
This work belongs to one of the three directions identified by Freuder [2007]: Automation, i.e., "automating efficient and effective modeling and solving". To the best of our knowledge, few efforts have been done on the modeling side.
Another of these three directions which is slightly related is Acquisition described by Freuder to be "acquiring a complete and correct representation of real problems". Remarkable efforts on this topic have been done by Bessiere's research team, for instance with constraints learning by induction from positive and negative examples [Bessiere et al., 2005] and with interactive queries asked to users , and with constraint network learning also through with interactive queries [Bessiere et al., 2013].
Model Seeker [Beldiceanu and Simonis, 2012] is a passive learning system taking positive examples only, which are certainly easier for users to provide. It transforms examples into data adapted to the Global Constraint Catalog [Beldiceanu et al., 2007], then generate and simplify candidates by eliminating dominated ones. Model Seeker is particularly efficient to find a good inner structure of the target constraint network. Teso [2019] gives a good survey on Constraint Learning with this interesting remark: "A major bottleneck of [constraint-based problem modeling] is that obtaining a formal constraint theory is non-obvious: designing an appropriate, working constraint satisfaction or optimization problem requires both domain and modeling expertise. For this reason, in many cases a modeling expert is hired and has to interact with domain expert to acquire informal requirements and turn them into a valid constraint theory. This process can be expensive and time-consuming. " We can consider that Constraint Acquisition, or Constraint Learning, focuses on modeling expertise and puts domain expertise on background: users would not be able to understand and modify a learned model without the help of a modeling expert. The goal of these systems is mainly to simplify the interaction between the domain and the modeling experts.
Our work is taking the opposite direction: we focus on domain expertise and put modeling expertise on background. With our system, users always have the control over constraints' representation, which can be modified at will to fit needs related to their domain expertise. Constraint Implementation Learning is what best describes this research topic.

Method design
The main result of this paper is to propose a method to automatically learn an error function representing a constraint, to make easier the modeling of EFSP/EFOP. We are tackling a regression problem since the goal is to find a function that outputs a target value. Before diving into the description of our method, we need to introduce some essential notions.

Definitions
We propose a method to automatically learn an error function from the concept of a constraint. As described in Bessiere et al. [2017], the concept of a constraint is a Boolean function that, given an assignment x, outputs true if x satisfies the constraint, and false otherwise. Concepts are the predicate representation of constraints referred at the beginning of Section 2.
Our method learns error functions in a supervised fashion, searching for a function computing the Hamming cost of each assignment. The Hamming cost of an assignment x is the minimum number of variables in x to reassign to get a solution, i.e., a variable assignment satisfying the considered constraint. If x is a solution, then its Hamming cost is 0.
Given the number of variables of a constraint and their domain, the constraint assignment space is the set of couples ( x, b) where x is an assignment and b the Boolean output of the concept applied on x. Such constraint assignment spaces can be generated from concepts. These spaces are said to be complete if and only if they contain all possible assignments, i.e., all combinations of possible values of variables in the scope of the constraint. Otherwise, spaces are said to be incomplete.
In this work, we consider an error function to be a (nonlinear) combination of elementary operations. Complete spaces are intuitively good training sets since it is easy to compute the exact Hamming cost of their elements. We also consider assignments from incomplete spaces where their Hamming cost has been approximated regarding a subset of solutions in the constraint assignment space, in case the exact Hamming cost function is unknown.

Main result
To learn an error function as a non-linear combination of elementary operations, we propose a network inspired by Compositional Pattern-Producing Networks (CPPN). CPPNs [Stanley, 2007] are themselves a variant of artificial neural networks. While neurons in regular neural networks usually contain sigmoid-like functions only (such as ReLU, i.e. Rectified Linear Unit), CPPN's neurons can contain many other kinds of function: sigmoids, Gaussians, trigonometric functions, and linear functions among others. CPPNs are often used to generate 2D or 3D images by applying the function modeled by a CPPN giving each pixel individually as input, instead of considering all pixels at once. This simple trick allows the learned CPPN model to produce images of any resolution.
We propose our variant by taking these two principles from CPPN: having neurons containing one operation among many possible ones, and handling inputs in a sizeindependent fashion. Due to their interpretable nature, we named our variant Interpretable Compositional Networks (ICN). In this paper, our ICNs are composed of four layers, each of them having a specific purpose and themselves composed of neurons applying a unique operation each. All neurons from a layer are linked to all neurons from the next layer. The weight on each link is purely binary: its value is either 0 or 1. This restriction is crucial to obtain interpretable functions. A weight between neurons n 1 and n 2 with the value 1 means that the neuron n 2 from layer l + 1 takes as input the output of the neuron n 1 from layer l. Weight with the value 0 means that n 2 discards the output of n 1 .
Here is our method workflow in 4 points: 1. Users provide a regular constraint network V, D, C where C is a set of concepts representing constraints.
2. For each constraint concept c, we generate its ICN input space X, which is either a complete or incomplete constraint assignment space. Those input spaces are our training sets. If the space is complete, then the Hamming cost of each assignment can be pre-computed before learning our ICN model. Otherwise, the incomplete space is composed of randomly drawn assignments and only an approximation of their Hamming cost can be pre-computed.
3. We learn the weights of our ICN model in a supervised fashion, with the following loss function: where X is the constraint assignment space, ICN ( x) the output of the ICN model giving x ∈ X as an input, Hamming( x) the pre-computed Hamming cost of x (only approximated if X is incomplete), and R(ICN) is a regularization between 0 and 0.9 to favor short ICNs, i.e., with as few elementary operations as possible, such that R(ICN ) = 0.9 × Number of selected elementary operations Maximal number of elementary operations . 4. We have hold-out test sets of assignments from larger dimensions to evaluate the quality of our learned error functions.
Notice we also have a hold-out validation set to fix the values of our hyperparameters, as described in Section 4.3. Figure 1 is a schematic representation of our network. It takes as input an assignment of n variables, i.e., a vector of n integers. The first layer, called transformation layer, is composed of 18 transformation operations, each of them applied element-wise on each element of the input vector. Such operations are for instance the maximum between the i-th and i + 1-th elements of the input vector, or the number of j-th elements of the vector smaller than the i-th element such that j > i holds. This layer is composed of both linear and nonlinear operations. If an operation is selected (i.e., it has an outgoing weight equals to 1), it outputs a vector of n integers.
If k transformation operations are selected, then the next layer gets k vectors of n integers as input. This layer is the arithmetic layer. Its goal is to apply a simple arithmetic operation in a component-wise fashion on all i-th element of our k vectors to get one vector of n integers at the end, combining previous transformations into a unique vector. We have considered only 2 arithmetic operations so far: the addition and the multiplication.
The output of the arithmetic layer is given to the aggregation layer. This layer crunches the whole vector into a unique integer. At the moment, the aggregation layer is composed of 2 operations: Sum computing the sum of input values and Count >0 counting the number of input values strictly greater than 0.
Finally, the computed scalar is transmitted to the comparison layer with 9 operations. Examples of these operations are the identity, or the absolute value of the input minus a given parameter. This layer compares its input with an external parameter value, or the number of variables of the problem, or the domain size, among others.
All elementary operations in our model are generic: we do not choose them to fit one or several particular constraints. Due to the page limit, we cannot give a comprehensive list of the 18 transformation and 9 comparison operations here. Although an in-depth study of the elementary operations properties would be interesting, this is out of the scope of this paper: its goal is to show that learning interpretable error functions via a generic ICN is possible, and in the same way results with neural networks do not always use ReLU as an activation function, there is no reason to reduce ICN to its current 31 elementary operations or even a 4-layer architecture. Such elements can be changed by users to best fit their needs.
To have simple models of error functions, operations of the arithmetic, the aggregation, and the comparison layers are mutually exclusive, meaning that precisely one operation is selected for each of these layers. However, many operations from the transformation layer can be selected to compose the error function. Combined with the choice of having binary weights, it allows us to have a very comprehensible combination of elementary operations to model an error function, making it readable and intelligible by a human being. For instance, the most frequenly learned error function is Count >0 |{j | x[j] = x[i] and j > i}| for AllDifferent, and Euclidian p n i=1 x[i] for LinearSum, i.e., the Euclidian division of n i=1 x[i] − p by the maximal domain size, with a parameter p equals to the right hand side constant of the linear equation. Thus, once the model of an error function is learned, users have the choice to run the network in a feed-forward fashion to compute the error function, or to reimplement it directly in a programming language. Users can use our system to find error functions automatically, but they can also use it as a decision support system to find promising error functions that they may modify and adapt by hand.

Learning with Genetic Algorithms
Like any neural network, learning an error function through an ICN boils down to learning the value of its weights. Many of our elementary operations are discrete, therefore are not differentiable. Then, we cannot use a back-propagation algorithm to learn the ICN's weights. This is why we use a genetic algorithm for this task.
Since our weights are binary, we represent individuals of our genetic algorithm by a binary vector, each bit correspond-ing to one operation in the four layers indicating if the operation is selected to be part of the error function.
We randomly generate an initial population of 160 individuals, check and fix them if they do not satisfy the mutually exclusive constraint of the comparison layer. Then, we run the genetic algorithm to produce at most 800 generations before outputting its best individual according to our fitness function.
Our genetic algorithm is rather simple: The fitness function is the loss function of our supervised learning depicted by Equation 1. Selection is made by a tournament selection between 2 individuals. Variation is done by a onepoint crossover operation and a one-flip mutation operation, both crafted to always produce new individuals verifying the mutually exclusive constraint of the comparison layer. The crossover rate is fixed at 0.4, and exactly one bit is mutated for each selected individual with a mutation rate of 1. Replacement is done by an elitist merge, keeping 17% of the best individuals from the old generation into the new one, and a deterministic tournament truncates the new population to 160 individuals. The algorithm stops before reaching 800 generations if no improvements have been done in the last 50 generations. We use the framework EVOLVING OBJECTS [Keijzer et al., 2002] to code our genetic algorithm.
Our hyperparameters, i.e., the population size, the maximal number of generations, the number of steady generations before early stop, the crossover, mutation and replacement rates, and the size of tournaments have been chosen using ParamILS [Hutter et al., 2009], trained one week on one CPU over a large range of values for each hyperparameter. We use the same training instance used for Experiment 1, and new, larger instances as a hold-out validation set. These instances have been chosen because they are larger than our training instances and each of them contains about 4∼5% of solutions, which is significantly less than the 10∼20% of solutions in training instances.

Experiments
To show the versatility of our method, we tested it on five very different constraints: AllDifferent, Ordered, LinearSum, NoOverlap1D, and Minimum. According to XCSP specifications [Boussemart et al., 2016] 1 , those global constraints belong to four different families: Comparison (AllDifferent and Ordered), Counting/Summing (LinearSum), Packing/Scheduling (NoOverlap1D) and Connection (Minimum). Again according to XCSP specifications, these five constraints are among the twenty most popular and common constraints. We give a brief description of those five constraints below: • AllDifferent ensures that variables must all be assigned to different values.
• LinearSum ensures that the equation x 1 + x 2 + . . . + x n = p holds, with the parameter p a given integer.
• Minimum ensures that the minimum value of an assignment verifies a given numerical condition. In this paper, we choose to consider that the minimum value must be greater than or equals to a given parameter p.
• NoOverlap1D is considering variables as tasks, starting from a certain time (their value) and each with a given length p (their parameter). The constraint ensures that no tasks are overlapping, i.e., for all indexes i, j ∈ {1, n} with n the number of variables, we have To have a simpler code, we have considered in our system that all tasks have the same length p.
• Ordered ensures that an assignment of n variables (x 1 , . . . , x n ) must be ordered, given a total order. In this paper, we choose the total order ≤. Thus, for all indexes i, j ∈ {1, n}, i < j implies x i ≤ x j .

Experimental protocols
We conducted three experiments, with two of them requiring samplings. These samplings have been done using Latin hypercube sampling to have a good diversity among drawn assignments. When we need to sample the same number k solutions and non-solutions, we draw assignments until we get k of solutions and k non-solutions. Due to stochastic learning, all learning and testing have been done 100 times. We did not re-run batches of experiments to keep the ones with the best results, as it should always be the case with such experimental protocols.
All experiments have been done on a computer with a Core i9 9900 CPU and 32 GB of RAM, running on Ubuntu 20.04. Programs have been compiled with GCC with the 03 optimization option. Our entire system, its C++ source code, experimental setups, and the results files are accessible on GitHub 2 .

Experiment 1: scaling
The goal of this experiment is to show that learned error functions scale to high-dimensional constraints, indicating that learned error functions are independent of the size of the constraint scope.
For this experiment, error functions are learned upon a small, complete constraint assignment space, composed of about 500∼600 assignments and containing about 10∼20% of solutions. For each constraint, we run 100 error function learning over pre-computed complete constraint assignment space. Then, we compute the test error of these learned error functions over a sampled test set. Sampled test sets contain 10,000 solutions and 10,000 non-solutions, with 100 variables on domains of size 100, belonging to a constraint assignment space of size 100 100 = 10 200 , thus greatly larger than training spaces containing 500∼600 assignments.
We show normalized mean training and test errors: first, we compute the mean error among all assignments composing the training or the test set. Then, we divide it by the number of variables composing the assignments. Indeed, having a mean error of 5 on assignments with 100 variables and 10 variables is significantly different: the first one indicates a mean error every 20 variables, the second a mean error one in two variables. 2 https://github.com/richoux/LearningErrorFunctions/tree/1.1 Experiment 2: learning over incomplete spaces If, for any reasons, it is not possible to build a complete constraint assignment space, a robust system must be able to learn effective error functions upon large, incomplete spaces where the exact Hamming cost of their assignments is unknown.
In this experiment, we built pre-sampled training spaces by sampling 10,000 solutions and 10,000 non-solutions on large constraint assignment spaces of size between 10 12 and 10 13 , and with solution rates from 0.15% to 2.10 −7 %. Then, we approximate the Hamming cost of each non-solution by computing their Hamming distance with the closest solution among the 10,000 ones, and learn error functions on these 20,000 assignments and their estimated Hamming cost. Like for Experiment 1, we run 100 error functions learning of these pre-sampled incomplete spaces, so that each learning relies on the same training set. Finally, we evaluate the learned error functions over the same test sets than Experiment 1.

Experiment 3: learned error functions to solve problems
The goal of this experiment is to assess that learned error function can effectively be used to solve toy problems. Here, we use a local search solver to solve Sudoku.
Sudoku is a puzzle game where all numbers in the same row, the same column and the same sub-square must be different. There, it can be modeled as a satisfaction problem using the AllDifferent constraint only. We run 100 resolutions of random 9 × 9 and 16 × 16 Sudoku grids, with a timeout of 10 seconds. If no solutions have been found within 10 seconds, we consider the run to be unsolved.
We consider the mean and median run-time to compare different representations of the AllDifferent constraint. We have two baselines: 1., a pure CSP model where constraints are predicates, and 2., an EFSP model with an efficient handcrafted error function representing AllDifferent. We compare those with two models using error functions learned with our system to represent AllDifferent: a., our EFSP model using the most frequently learned error function from the previous experiments and run through our neural network in a fastforward fashion, and b., our EFSP model with the same error function but directly hard-coded in C++. The solver and its parameters remain the same: the only thing that is modified in these four different models is the expression of the AllDifferent constraint.

Results
Experiments 1 & 2 Table 1 shows the training errors of Experiment 1, where error functions have been learned 100 times for each constraint. The first column contains the normalized mean training error of the most frequently learned error function among the 100 runs, with its frequency in parenthesis. Next columns concern the median, the mean and the standard deviation. Table 2 contains the normalized mean test errors of error functions learned with Experiments 1 and 2, with their median, mean and standard deviation. The normalized mean test error of the most frequently learned error function for each constraint in each experiment has been isolated in the first column of number, for comparison.  Comparing Table 1 and the first half of Table 2 lead us to conclude that our system is able to learn error functions that scale for most constraint, namely AllDifferent, Linear-Sum and Minimum. Their median training errors in Table 1 are perfect of almost perfect, so as their median test errors on greatly larger constraint assignment spaces.
Results are not as good for NoOverlap1D and Ordered, which are clearly the most intrinsically combinatorial constraints among our five ones. One could think that our system is overfitting on its training set, but results from Experiment 2 lead us to another conclusion.
To see this, let's observe these Experiment 2's results by comparing the first and the second half of Table 2. Error functions learned over incomplete training spaces are as good as the ones learned over small complete spaces for LinearSum and Minimum. We observe significant improvements of the median and the mean for NoOverlap1D (36.07% and 55.76%) and Ordered (52.83% and 49.05%). This is due not because error functions from Experiment 1 were overfitting, but because spaces from Experiment 1 were too small for these highly combinatorial constraints, containing too few different combinations and Hamming cost patterns.

Experiment 3
The goal of this experiment is not to be state-of-the-art in terms of run-times for solving Sudoku, but to compare the average run-times of the same solver on four nearly identical Sudoku models presented in Section 5.1. For the model with a hand-crafted error function, we implemented the primal graph based violation error of AllDifferent from Petit et al. [2001]. This function simply outputs the number of couples with identical values within a given assignment. To run this experiment, we used the framework GHOST from , which includes a local search algorithm able to handle both CSP and EFSP models.  Table 3: Mean run-times in milliseconds over 100 runs to solve Sudoku with 4 different representations of the AllDifferent constraint. Rows in gray means that some runs hit the 10-second timeout. Table 3 shows that models with error functions clearly outperformed the model with the constraint represented as a predicate. Over 100 runs, no error function-based models hit the 10s timeout, but 4 runs of the regular constraint network model timed out on the 9 × 9 grid, and all of them on the 16 × 16 grid. Moreover, the learned error function hardcoded in C++ is nearly as efficient as the hand-crafted one (also coded in C++). The difference of runtimes between the learned error function hard-coded and computed through the ICN gives us an idea of the overhead of computing such a function through the ICN.

Discussions
Like Freuder [2007] wrote: "This research program is not easy because 'ease of use' is not a science." However, we believe our result is a step toward the 'ease of use' of Constraint Programming, and in particular about EFSP and EFOP. Our system is excellent for learning error functions of simple constraints over complete spaces. For intrinsically combinatorial constraints, learning over large, incomplete spaces should be favored. One of the most significant results in this paper is that our system outputs interpretable results, unlike regular artificial neural networks. Error functions output by our system are intelligible. This allows our system to have two operating modes: 1. a fully automatic system, where error functions are learned and called within our system, being completely transparent to users who only need to furnish a concept function for each constraint, in addition to the regular sets of variables V and domains D, and 2. a decision support system, where users can look at a set of proposed error functions, pick up and modify the one they prefer.
The current limitation of our system is that it struggles to learn high-quality error function for very combinatorial constraints, such as Ordered and, in particular, NoOverlap1D. By combining results from Experiments 1 and 2, we can conclude that: 1. our system is not overfitting but need more diverse and expressive operations to learn a high-quality error function for such constraints, and 2. the Hamming cost is certainly not the better choice to represent their assignment error.
An extension of our work would be to do reinforcement learning rather than supervision learning based on the Hamming cost. Learning via reinforcement learning would allow finding error functions that are more adapted to the chosen solver.