Explainable AI Insights for Symbolic Computation: A case study on selecting the variable ordering for cylindrical algebraic decomposition

In recent years there has been increased use of machine learning (ML) techniques within mathematics, including symbolic computation where it may be applied safely to optimise or select algorithms. This paper explores whether using explainable AI (XAI) techniques on such ML models can offer new insight for symbolic computation, inspiring new implementations within computer algebra systems that do not directly call upon AI tools. We present a case study on the use of ML to select the variable ordering for cylindrical algebraic decomposition. It has already been demonstrated that ML can make the choice well, but here we show how the SHAP tool for explainability can be used to inform new heuristics of a size and complexity similar to those human-designed heuristics currently commonly used in symbolic computation.


Machine learning and mathematics
Machine Learning (ML) refers to statistical techniques that learn rules from data. They allow computer systems to improve their performance on a task without any change to their explicit programming. ML underpins the prominent AI advances of recent years, driven forward by the growth in available data and computing power, new ML architectures, and hardware designed specifically to them.
There are increasing attempts to use ML in mathematics. For example, Lample and Charton [37] trained a transformer to integrate expressions and find analytical solutions to differential equations; while in a survey article He [28] details a variety of attempts to use ML to predict properties from mathematical structures such as groups and graphs. We note a recent special issue of the Journal of Symbolic Computation dedicated to Algebraic Geometry and Machine Learning [27] including ML approaches to e.g. find the real discriminant locus [2]. However, it has been observed that mathematical and logical reasoning is an area of particular difficulty for ML, especially the recent natural language tools such as ChatGPT 1 .
Symbolic computation is not an obvious domain in which to apply ML. The Computer Algebra Systems (CASs) which implement symbolic computation algorithms are mainly designed to produce exact answers; rarely using numeric computation in preference for symbolic solutions. Lample and Charton [37] reported that their Transformer 2 was able to integrate far more examples correctly within a time limit than various CASs. However, this analysis combined two very different cases of failure: when a solution could not be found within a time limit and when the wrong solution is found. While the CASs may not solve problems quickly, they should never produce the wrong answer. Thus there is little appetite amongst CAS developers to use ML for directly producing output.

Safe use of machine learning in computer algebra systems
This does not mean that ML has nothing to offer Symbolic Computation. CAS algorithms often come with choices that have no effect on the mathematical correctness of the end result but can have a big impact on the resources required. Further, they can affect how the end result is presented (i.e. two very different, but mathematically equivalent, expressions). Consider for example the algorithm of Buchberger [9] to produce the Gröbner Basis for an ideal. It does not specify the order in which S-pairs are studied, the order in which the corresponding S-polynomial is reduced by the generating set, the monomial ordering to be used, nor the underlying variable ordering. Any decision for these choices allows the production of a correct Gröbner Basis but each decision affects the size of the basis produced, the polynomials within, and the time taken to compute them.
Sometimes there are well-documented strategies on how CASs make such choices (e.g. sugar selection for S-pairs [25]) but in many cases, they rely on human-made heuristics or "magic constants" [10] which are often not scientifically validated or even documented. Such choices are good candidates for ML: the underlying relationships are complex but not the main object of study. The earliest examples of ML for CAS optimisation known to the authors are: in 2014 Huang et al. [31] used a support vector machine to choose the variable ordering for cylindrical algebraic decomposition; in 2015 Kuipers et al. [36] used a Monte-Carlo tree search to find the representation of polynomials that are most efficient to evaluate; and in 2016 Simpson et al. [53] used ML classifiers to pick from various algorithms that compute the resultant, Huang et al. [30] used ML to decide whether to precondition input for cylindrical algebraic decomposition, and Kobayashi et al. [34] used ML to decide the order of sub-formulae solving for real quantifier elimination.
In all of these examples, the ML solution outperformed the previous human-made heuristics for the choices, at least on the dataset used. The algorithms of symbolic computation are often exponential in their worst complexity but the average case complexity is less widely studied. It is possible that these ML optimisations may offer a route to avoid the worst case, at least for many applied examples. Thus there is great value in their continued study and exposition. However, the motivation of the present paper is whether ML can offer anything more than these efficiency gains. A recent article in Nature [16] suggested that ML can help pure mathematicians with the development of new theorems. Can ML offer such insight to the developers of symbolic computation algorithms?

Explainable artificial intelligence
Explainable AI (XAI) refers to ML techniques which offer an explanation as to how an AI decision was made. It has become a very active area of research, see for example DARPA's XAI programme [26]. The role of XAI is usually to allow for more effective use of the AI tools, and the building of user trust in the tools. But in our case, we hypothesise that XAI could reveal workings of the ML model that are useful to inform further development of the underlying symbolic computation.
We were motivated by the work of Peifer et al. [48] developing an ML solution to the problem of selecting S-pairs in Buchberger's algorithm. They trained an "agent" (ML model) to make the decision using reinforcement learning 3 based on the number of polynomial additions required to process the choice. This agent outperformed existing strategies such as [25]. We draw attention to [48,Section 5.1] which tried to explain the agents' strategy (albeit without employing XAI tools). It identified some simple components such as a preference for pairs whose S-polynomials are monomials and a preference for pairs whose S-polynomials are low degree. Firstly, these components suggested decisions made on the basis of the S-polynomials rather than the S-pairs 4 which are used by all the human-made heuristics. Thus this opens a whole new category of heuristics to consider, perhaps leading to a greater understanding of the problem. Secondly, experiments using these strategies alone showed they outperformed the existing heuristics (but not the full ML agent). Thus the analysis suggests new "human-level " heuristics, i.e. heuristics of a similar size to those that are human-made and that, once discovered, can be implemented without any AI tools.
The present paper seeks to identify if XAI tools can offer an automated way to make such analyses, to gain new understanding of these choices in symbolic computation algorithms.

Contributions and plan of the paper
In this paper, we return to the problem of selecting the variable ordering for cylindrical algebraic decomposition, the first CAS optimisation by ML that was considered in [31]. We introduce the necessary background information on this problem in Section 2. We apply the popular SHAP methodology for XAI to an existing ML pipeline developed for the problem. We introduce SHAP in Section 3 and report on the results of its application on this problem in Section 4. In Section 5 we process the SHAP data to suggest some XAI-recommended human-level heuristics for the problem and finally, in Section 6 we report on the performance of these.
We find that one of these is equal to the current state-of-the-art humandeveloped heuristic for the problem [50] and that a human-level combination of these XAI selected features allows us to outperform prior work. Thus our contributions are (a) a new state-of-the-art human-level heuristic and perhaps more importantly; (b) evidence for a new methodology to employ in CAS development that uses AI in algorithm design but not implementation.

Cylindrical algebraic decomposition
Cylindrical Algebraic Decomposition (CAD) is an important symbolic computation algorithm, proposed by Collins [13] in 1975. Given a set of multivariate polynomials and a variable ordering, CAD will decompose the corresponding real space into connected regions, called cells, such that within each the sign of each polynomial is invariant (i.e. positive, negative or zero). Each cell is semi-algebraic, meaning it may be defined by a set of polynomial constraints; and the cells are arranged in cylinders with respect to the variable ordering, meaning the projections of any two cells onto a lower dimensional space with respect to the ordering are either equal or disjoint.
CAD has found applications in various fields, ranging from robotics [41] through economics [45] to biology [51]. Its most commonly used for, and was proposed by Collins [13] as, a sub-method to perform real Quantifier Elimination (QE): i.e. to transform a formula in first-order logic whose atoms are Boolean constraints to an equivalent one without any quantifiers. The decomposition into cells allows for studying the problem at a finite number of points, while the cylindrical structure allows for easy projection and negation of the cells to construct the quantifier-free solution.
However, CAD has a worst-case theoretical complexity that grows doubly exponential in the number of variables [15]; something regularly encountered in practice and thus vastly reducing the scope of use cases for the algorithm. There have thus been many lines of research to improve CAD, such as improved projection operators [43], partial CAD [14], equational constraints Using ordering x y, we obtain a CAD with 57 cells (the 18 shaded areas, 27 lines and 12 points). Using the ordering y x generates only 3 cells (2 areas and one line). [19], symbolic-numeric lifting [54], alternative computation schemes via triangular decomposition [11] and incremental processing of polynomials [35], and entirely new repackaging of the CAD theory [7], [33].

CAD variable ordering
CAD, including all the optimised and extended forms listed above, requires there to be a declared ordering on the variables. This controls both the order of operations in the algorithms and the form of the output (being used in the definition of the cylindrical structure). Depending on the application, there may be a free or constrained choice for the ordering. For example, when using CAD for real QE the variables must be ordered as they are quantified: but variables in blocks of the same quantifier (and free variables) can be swapped with each other.
This choice of variable ordering has been shown to have a huge impact, both practically [17] and in terms of theoretical complexity: Brown and Davenport [8] showed there exists a family of polynomial sets such that the complexity of CAD can be constant or doubly exponential depending on the chosen variable ordering. It is possible to appreciate the importance of variable ordering in the simple two-variable example of Figure 1, where one of the orderings generates three times more cells than the other.

Prior human-designed heuristics for choosing the ordering
Once the importance of variable ordering was established, researchers began looking for strategies to choose a good variable ordering. The first attempts consisted of designing human-made heuristics. In 2004 Brown [6] documented a heuristic based on three simple hand-picked features of the set of polynomials, for the software QEPCAD. We call this heuristic Brown.
There have been other heuristics produced which can offer greater accuracy but at greater expense, performing increasing numbers of steps in the CAD algorithm: Dolzmann et al. [17] concluded it best to perform the projection stage of CAD and compare sums of the total degree of the polynomials produced (sotd); Bradford et al. [5] considered the initial decomposition of the real line; and Wilson et al. [55] the open cells in the decomposition. It is perhaps surprising that the original heuristic of Brown is able to perform competitively or even beat these others despite having access to less information [32]. This suggests the necessary information to identify a good ordering may be available from the input alone.
Most recently del Río and England [50] designed a new simple heuristic gmods based on properties of the input polynomials, selected by studying which features of the input polynomials have the greatest impact on the complexity analysis of CAD.

Prior AI-designed heuristics for choosing the ordering
In the last decade machine learning models have also been trained to select the variable ordering for CAD. The first attempt was made by Huang et al. [31] in 2014 who used a support vector machine for the task. Later England and Florescu [20] experimented with a wider range of models, methods for feature engineering [21] and improved metrics for training [22], culminating in a machine learning pipeline available to use for the task [23]. Separately Chen et al. [12] experimented with deep learning for variable selection.
In all cases, these ML models outperform existing heuristics. However, CAS developers may still be reluctant to include a method trained using ML in their software, possibly because they create a dependence on external code, or more likely due to concerns about whether over-fitting on the training data will lead to poor generalisation. Another issue is the lack of explanation these AI models are able to give about their choices.
We will experiment later using the tools for an AI selection of the CAD variable ordering presented in [23]. In that paper, the authors present a pipeline to train four ML models on the task. When specified to three variables polynomial sets this utilises the ideas described in [21] to represent such polynomials with 81 features (which evaluate to floating point values). It is these numerical feature vectors that are presented to the ML models.
In this paper, we propose an approach that is somewhere in between those of this subsection and the previous one. Rather than handpicking some features as in Section 2.3, will we start with the algorithmically produced features of Section 2.4 but then use XAI techniques to extract the most important features from the AI and use these to design heuristics that could easily be understood by humans and embedded into any CAD implementation without use of AI software.

Explaining Machine Learning Models
XAI is often motivated as a means to build trust in the tool among users. But there are certainly other reasons we may require an explanation of an ML model. Perhaps the model is making a poor decision and the data scientist wishes to understand why so as to improve it; or perhaps the explanation tells us more about the underlying trends of data leading to new research directions. The latter is the motivation for explainability in this work. Within the XAI field there is a growing emphasis on designing models that are intrinsically explainable 5 , however, the most common approach in use today is post-hoc explainability whereby a model is analysed after its creation and an explanation is generated. The review article by Du et al. [18] offers a good overview of XAI.
Since we work with the existing ML pipeline [23] already designed for the problem we do not consider editing the ML itself for explainability but instead revert to the more common post-hoc analysis. That pipeline works with several ML models, and so we desire a model-agnostic method for explainability. The following is the current state-of-the-art process for this.

SHAP
Molnar [44] gives an overview of many explainability methods along with their advantages and disadvantages. SHAP (SHapley Additive exPlanations), a model agnostic method, [40] is chosen as our method of postexplainability analysis for identifying the features of most use to the ML models. The SHAP approach was introduced in 2017 Lundberg and Lee [40] as a unification of six existing explanation methods already present in the literature: LIME, DeepLIFT, Layer-Wise Relevance Propagation, classic Shapley regression values, Shapley sampling values, and Quantitative Input Influence. Because SHAP unifies these various methods, it can be used for a global and local explanation of models: i.e. it can offer explanations for performance on an individual problem or an entire dataset.
The name and theory behind SHAP is based on the classical concept of Shapley values from game theory, which quantifies the contribution of each player to a game [52]. It makes use of an explainer model, which is an approximation of the original model. The idea is to figure out the contribution of each player to the score of the game by playing the game with all the combinations of players, or the power set of the players. Take an example game with three players: player 1, player 2, and player 3. Each player plays the game by themselves, then each player plays the game with each of the other two, and finally they all play the game together. We calculate the difference between the final scores of each game where player 1 was not present, and when player 1 was present under a weighting to get the contribution of player 1 to the game. In our application of SHAP, the features of our polynomial sets are the players and the ML models are the games. SHAP removes and adds each feature to the model to see what effect that feature has.
Since we are working over the power set of all the features in the model, such a calculation becomes very expensive quickly: exponential on the number of features in a model [40]. However, the SHAP open source package [39] uses many estimators and powerful shortcuts to make the extensive power set of calculations possible for larger models. For example, Kernel SHAP, used to analyse all the models other than decision tree in this work, uses a special weighted linear regression in computing a feature's importance, as outlined in the README of the GitHub page for SHAP. For the decision tree model, the tree explainer is used. This is a fast method to estimate SHAP values for tree models. Figure 2 shows a SHAP waterfall plot, which is one visualization in SHAP for understanding a model's decision on a single instance. The ML model studied here was to predict if a person's income is less than or greater than $50k annually [38]. Negative values imply probabilities of less than 0.5 that a person has income over $50k. In this case, the features that contribute to a prediction that a person makes less than $50k are the blue bars, while the features that contribute to a prediction that a person makes more than $50k are the red bars. The features from top to bottom are those that have the largest effect on the final model output. In the waterfall plots, E[f (X)] is the base value and the number that each decision starts at, because it is the average of all the output across all the instances, which would be the best guess a model would have for this dataset if there were no features. The grey values next to each feature are the values of the corresponding feature for that instance.

Multi-class SHAP
The ML pipeline we experiment with is designed for multi-classification, i.e. classification with more than two output classes. SHAP may also be applied for multi-classification. It simply creates a binary model for each output, with the binary SHAP model that reaches an output closest to one for an instance giving the prediction.

SHAP Applied to ML for CAD Variable Ordering
We work with the feature engineering approach to the problem developed by Florescu and England [21]. This algorithmically generates features to represent systems of polynomials to ML models. It was inspired by the earlier hand-picked features based on variable degree and sparsity [6], [31] and generalised this to the widest set of such features that could be generated in a similar way. When applied to three variable polynomial systems this framework creates 81 distinct features, which the pipeline in [23] uses as the input to train four ML models (implemented in Python from the scikit-learn package [47]) which are: Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP) 6 . The output that each model predicts is a class (1−6) representing one of the six possible variable orderings of three variables. In these papers, the experiments showed the ML models to all outperform the best performing human-made heuristic of the time, that of Brown [6].
The work described in this section uses SHAP to gain insight into how those ML models from [23] achieve this performance. We seek to understand whether the ML models find the most important features to be those selected by humans, which are then supplemented with additional information; or whether they instead rely on some previously overlooked feature. Any prominent features previously unconsidered will be studied later in the paper. This section first looks at the decision-making process for a single test instance to gain a local understanding of a particular ML model (the MLP). The process from the local decision to the global decision is then explained, and the top five features across the entire model, for each model are identified and discussed.

Methodology
The methodology used in applying SHAP to ML algorithms for picking the CAD variable ordering is described next, starting with the data set used to train the algorithms, the specific manner in which SHAP was implemented, and details on the features used as input to the ML algorithms.

Dataset
We started by recreating the experiments of Florescu and England [23]. These used a dataset of random problems generated with the same statistical properties as the industrial problems hosted in the QF NRA category of the SMT-LIB [1]. We identified that the dataset used in Florescu and England [23] had a majority of problems with class 5 or 6 as the output class. In this paper, we applied the additional step of balancing the dataset so that the classes are more equal in size. We take advantage of the fact that the nature of a polynomial set does not change with the names of its variables. This allows for the random permutation of variable names (x 1 , x 2 , x 3 ), and a corresponding change of the best ordering for the output, to obtain a balanced version of the dataset. As a reference of which output class number corresponds to which order of the variables, we refer to Table 1.

Ordering Name Ordering
Ordering 1 x 1 x 2 x 3 Ordering 2 Ordering 3 Ordering 4 Ordering 5 We note that while this renaming makes no difference to the meaning of the dataset, training an ML model on an unbalanced dataset can lead to the ML model simply choosing the class that occurs the most often, and being rewarded for higher accuracy in this way. In general, we might expect this to lead to overfitting, where the model may perform far worse on data from a different source. In our experiments the Decision Tree was found to have particularly suffered from this, making far poorer choices when trained with the balanced dataset, however, the other three models performed similarly well with balanced and unbalanced data and in particular, still outperformed the human-designed heuristics in the unbalanced case.
From the point of view of this paper, balancing is particularly important for explainability. Our initial SHAP analysis on the models trained on the unbalanced dataset clearly reflected a simplistic preference for the majority classes and gave us little insight into understanding the important features of this application.  Table 2 shows the very significant differences in SHAP run-time between the models. This is due to both the nature of how long a single ML model takes to train, and the approximations that the SHAP library is able to make use of. We note that SHAP is a procedure we are running once to gain insight, and so are not greatly concerned with its one-off cost. Due to the long-running time for SHAP, 500 training points are used for each SHAP calculation, chosen randomly from the 2947 total training instances used to train the full models described above, and the SHAP calculations are performed on just 100 test points. This gives sufficient insight into the model decisions without requiring more than 12 hours of calculation time. We refer back to Section 3.1 for information on how SHAP makes its calculations.

Features
The derivation of the features used in the dataset is explained in [21]. In brief, instead of handpicking a set of features we algorithmically generate multiple features. This is done by acting upon a list of lists of lists with the innermost list capturing the degrees of variables in a monomial, which are gathered together next in polynomials, and then finally for the whole system. The actions performed on these are simple (cheap) operations such as taking the maximum, sum or average of the numbers. By looking at all possible combinations this method generates 81 distinct features for three variable systems, from which the ML models can choose what they find most useful. We define in Table 3 some of the terminology used in creating these features.
For example, sum(max(v 1 (S))) is one of the features that multiple models place importance upon. Its direct translation into plain English is "the sum across all polynomials of the maximum of the degrees of its monomials for the variable x 1 ", which simplifies to, "the sum across all polynomials of their degrees for the variable x 1 ". Similarly, the meaning of avg(sg(max(v 1 (S)))) is "the average across all polynomials of the sign of the maximum of the degrees of its monomials for the variable x 1 ", which simplifies to "the average number of polynomials that contain the variable x 1 ".

SHAP Results
In this section, we interpret the results from SHAP when applied to our problem using the methodology above. First, we explain how it can be used to give an explanation for the classification of a single problem instance: a local SHAP analysis. Then we aggregate across all instances to give an overview of what SHAP tells us about the most relevant features for each model over the dataset: a global SHAP analysis.

Local SHAP Analysis
For each ML method, for each of the six binary classification models, and for each instance, SHAP can produce a waterfall plot (see 3.2) to show the effect each feature has on the final output. As an example, let us look in detail at the MLP model for a particular test instance. An explanation for the binary classifier of whether or not the instance belongs to class 5 (selects Ordering 5) is given via the waterfall plot in Figure 3. The blue arrows in the waterfall indicate that the feature is pushing the prediction towards an output of 0 (that this is the wrong ordering), and the red arrows that the feature pushes towards an output of 1 (that this is the right ordering).
The binary classifier suggested the instance should be classified as Ordering 5 with a probability of 0.71. This was the highest probability among the six binary classifiers and so Ordering 5 was selected as the classification. Orderings 1 − 4 get a score very close to 0. Ordering 6 receives a score much higher than 0, but not as high as Ordering 5. This is not surprising: Ordering 5 and 6 both select the same variable to project first and it has been observed previously that this has the greatest impact on the calculation time of the CAD.

Global SHAP Analysis
To get the global importance of the features for each model over the dataset we may aggregate the absolute values of the contribution of the features across all instances and binary classifiers. After this, the five most important features for each model found by SHAP are given in Table 4.
Plots of the relative importance of features to each model by class are given in Figure 4. These plots give a good insight into not only that feature's impact but also how a model trains and learns from the data. The KNN model, for example, creates little distinction between the impacts of the features. This is what we would expect from a KNN, because the model is looking for those instances which are close to it to make a decision, and it is the aggregate of the number of instances close to a certain test point that will decide the output. This applies similarly to the SVM. In contrast, the MLP and DT models show more distinction in the importance of these features: decision trees in particular are known to place importance on a few features.
Upon close examination of Figure 4, some interesting phenomena may be observed. We see that often the impact on the output class for features of a certain variable is largest if that variable is last in the ordering of that output class. For example, the feature with the greatest impact on the KNN model is an x 3 feature, and Orderings 1 and 3 have the highest proportional impact which (referring back to Table 1) are the orderings with x 3 last. We see this for the x 2 features and the SVM model as well. But the MLP model shows different behavior. Here, only x 2 and x 3 features are seen, and both Ordering 3 and 5 have the greatest impact.

Post SHAP analysis to identify the most relevant features
SHAP produced for us a global analysis explaining the features that are important to each of the four ML models we experiment with. We now work to create a unified ranking of the features across the four models for the purpose of creating new human-level heuristics in Section 6.

Merging features that generate the same heuristic
We aim for heuristics independent of particular variable names which are present in the ML features. I.e. the prominence of sum(sum(v 1 (S))) in ML will lead us to the same heuristic as sum(sum(v 3 (S))). So we now aggregate the contribution of the features across the three variables, resulting in rankings like Table 5 for each model. This reduces the original 81 features to 27. Here v i (S) is a short notation to denote the combination across the three variables, i.e. sum(sum(v i (S))) := 3 i=1 sum(sum(v i (S))).

Creating a unified ranking
We now wish to create a combined feature ranking across the four models. We will create this by viewing it as a voting problem, where each of the models has a different ranking for the features and we have to get a combined ranking. A usual method to combine such rankings is the Borda Count: where for each ranking we assign a penalisation of 1 to the first, 2 to the second, and so on; at the end ranking the options from lowest penalization to highest. However, this method strongly penalises being very badly ranked for just one of the models. Meaning that a feature that is the most relevant for three of the models while being the most irrelevant for one of the models would not end up in a good position in the final ranking.
Thus we use a modification, the Dowdall System, that assigns the first feature a reward of 1, the second a reward of 1/2, the third receives 1/3, and so until, at the end ranking the options from higher reward to lowest. See for example the work of Fraenkel and Grofman [24] on these different voting systems. We chose the Dowdall system because it allows merging the preferences without a big penalisation for being a very distant preference for one of the models, contrary to the Borda Count. Table 7 shows the features and their scores in our final ranking and their rewards from the Dowdall System. The most voted feature is the sum of the degrees of the polynomials in the set, this is precisely the feature used in gmods, which was found in [50] through the study of the complexity analysis of CAD. In this ranking: the third, tenth and eighth most voted features describe the tools used by the Brown heuristic [6].
The second most voted feature, the average across the polynomials of the average degree of the variable across the monomials, is a feature that has been never considered relevant in any prior work.
The top six features are chosen for further experimentation in the next section to find new human-level heuristics. We draw the line here as there is a big drop off and the lower features score more similarly to each other, i.e. the feature scores stabilize after the sixth score. This stabilization of the scores is illustrated in Figure 5.

New Human-Level Heuristics Motivated by XAI
In this section, we will first introduce the methodology we use to evaluate a heuristic, and then experiment with new human-level heuristics motivated by Section 5. We will compare these against the current state-of-the-art heuristics for choosing the variable ordering for CAD.
To evaluate the performance of these heuristics two things are needed: a set of meaningful benchmark examples and metrics to quantify the performance of the heuristics on them. We will use the benchmarks and metrics described recently by del Río and England [50] for this purpose.

Benchmarks
We source benchmarks from the three-variable problems in the QF NRA category of the SMT-LIB [1]. These examples are all satisfiability problems and so do not represent the full application range of CAD which can also address quantifier elimination. However, they do mostly emit from real applications making performance upon them particularly meaningful. Common sources are problems emitting from the theorem prover MetiTarski [46], attempts to prove termination of term-rewrite systems, verification conditions from Keymaera [49], and parametrized generalizations of geometric problems, as well as problems emitting from economics [45] and biology [4].
A sign-invariant CAD is built using Maple [11] for each example in each variable ordering: 5599 problems finished for at least one ordering before the time limit (60 seconds). Of these, the authors select 1019 "unique" problems where, as described in [50,Section 4.1], unique means the CAD tree structures built are not identical to another example for every ordering. This benchmark merging is needed since some application sources in the SMT-LIB produce examples that differ from each other only slightly, leading to identical CAD computations, and thus potentially unfair evaluation.
The benchmarks used here are identical to those in [50] and so readers may refer to the open source dataset released for that paper on Zenodo 7 .

Evaluation metrics
For evaluation metrics, we again follow del Río and England [50] and use the three selected there as follows.
• Accuracy: Percentage of benchmarks in which the optimal ordering (quickest in runtime) coincides with that chosen by the heuristic.
This is the most commonly used tool in ML classification and was used to train the ML models earlier 8 . However, it is not ideally suited to our problem: it considers as equally wrong a terrible ordering and an ordering that is slightly sub-optimal, even if the difference compared to the optimal ordering is so small that it could be due to computational noise.
• Total time: Amount of time needed to create a CAD for each benchmark when choosing the variable ordering given by the heuristic.
This metric corresponds most closely to the absolute performance achieved and would be most commonly used to rank a computer algebra implementation. However, it will not take into account the chosen orderings for the smaller problems in our dataset and could allow a small number of large problems to distort findings. Hence del Río and England [50] proposed the next metric to address the shortcomings of the above: a metric that does not ignore the simple problems, but at the same time distinguishes between different non-optimal orderings.
• Markup: The markup of the time needed by the heuristic choice with respect to using the best choice. Defined as the average across all benchmarks of markup = t heuristic − t optimal t optimal + 1 .
The +1 is added in the denominator to reduce the impact that the computational noise can have on problems that only require tenths of seconds. When a CAD could not be created within the time limit using the ordering chosen by a heuristic, the time of the heuristic was set in the dataset to twice the time limit. These runtimes are also in the Zendodo data release for [50].

Existing heuristics
The prior state-of-the-art heuristic for choosing CAD variable ordering is gmods, as presented by del Río and England [50]. This is based on a metric motivated by the complexity analysis of CAD. The analysis found that given a set of polynomials S n and the ordering x in · · · x i 1 , the maximum number of cells that can be generated is proportional to where S j−1 is the projection of S j with respect to x i j . The heuristic mods was first proposed by del Río and England [50] as a heuristic that computes the sets S 1 , . . . , S n−1 for each variable ordering and then chooses the ordering that minimises the worst case complexity, i.e. that minimises (1). This performed very well, in the sense that it chooses good orderings. However, it is an expensive heuristic and the cost of computing the CAD projection phase for all the orderings was not compensated by the improved quality of the ordering it chooses for the dataset in question. This is illustrated later by the experimental results in Table 8, where mods took into account the additional cost of the heuristic while free-mods does not.
To solve this issue del Río and England [50] decided to avoid doing projections for choosing the variable ordering that would not be utilised by the CAD. Instead, they proposed to pick the first variable of the ordering x in to be the one with minimal sum(max(v i (S n ))), choosing the one with the lowest index in case of a tie. Only then do they take a CAD projection of S n with respect to x in to obtain S n−1 , which is then studied to pick the second variable x i n−1 as the one with minimal sum(max(v i (S n−1 ))). This allows the heuristic to gather as much information as possible before choosing the next variable, continuing this process to determine the whole variable ordering.
This heuristic does not require any more CAD projections than those needed to run the CAD algorithm. In this paper we refer to heuristics that follow this structure as "greedy" (in the traditional algorithmic sense that they make a choice based on the local information). All the heuristics introduced below are greedy 9 .
Since the source benchmarks contain a bias towards certain orderings, the authors choose a variable randomly whenever there is a tie, instead of choosing the one with the lowest index or the first lexicographically which could allow that bias to effect the results.
For many years, a well-known and widely used heuristic is that of Brown [6]. Denoted Brown, this uses three criteria, in turn, breaking ties with the subsequent ones. It projects first the variable that minimizes: • max(max(v i (S))), (i.e. variable with lowest overall degree); breaking ties with • max(max(sv i (S))), (i.e. variable with the lowest total degree in the monomials containing it); breaking ties with • sum(sum(v i (S))), (i.e. variable with the lowest sum of all its degrees).
It is not specified what to do if there is a third tie: our implementation simply picks between the tied variables randomly. Moreover, it is not specified whether the whole variable ordering is chosen at once or one variable at a time. In this paper, we implement the greedy variant: i.e. we choose the first variable to be projected, and then use the projected polynomials to choose the next variable in the ordering. Table 8 evaluates these existing heuristics on our benchmark set. We include also the performance of a virtual best heuristic (the hypothetical heuristic which always picks the optimal ordering) to show the best possible figures for the dataset, and a heuristic that picks at random. As in [50] we find gmods to be the best of all metrics except accuracy: where it is beaten by mods but at a cost not worth paying for this dataset.
Note that for some heuristics the number of problems completed is not a natural number. This is because the heuristic is not deterministic (die to the random choices to break ties) and the experiments are repeated 5000 times to obtain an average to account for this.

Creating new greedy heuristics from XAI identified features
We will now build new heuristics solely based on a single feature from the earlier ML models (a category that gmods itself falls in). We aim to see how powerful these features are when taking decisions on their own.
The explainability analysis in Section 5 has found sum(max(v i (S))) to be the most impactful feature. We will create a greedy heuristic SumMaxV based on this feature: it will select the variable for which this feature is minimal for each projection (and in the case of a tie one of the tied variables will be chosen randomly).

Performance of new single feature heuristics
We now evaluate the new heuristics based upon the features that ranked in the top six in Table 7 (selected as discussed at the end of Section 5.2. AvgAvgSgV is based upon feature avg(avg(sg(v i (S)))), SumMaxSV is based upon sum(max(sv i (S))) and so on. The results are shown in Table 9.

Name
Accuracy  As can be observed in the table there are two features that stand out: SumMaxV and AvgAvgV. They achieve similar results with the best dependent on the metric used and none of them dominating the other.

New heuristics of triples of features identified by SHAP
We have now seen a number of single-feature heuristics doing well, without a clear winner. We will next combine these into triples, using subsequent ones as tie-breakers, similarly to the heuristic of Brown [6]. This should give a more informed choice than selecting a random variable and is still certainly at a human level in size and complexity.
We will denote these heuristics based on multiple features by concatenating their names with a >, where smaller indicates that it is used as a tie-breaker for the bigger features. Following this notation the Brown heuristic would be denoted as MaxMaxV>MaxMaxSV>SumSumSgV.
There are 120 possible ordered triples that can be created from the top six features of Table 7. For each we create a heuristic that chooses the variable that minimises the first feature, breaking ties using the second feature and the third if necessary, and finally breaking ties by choosing a random variable if there is a tie for all three features.
Performance statistics for the two triples that are best in one of the metrics are given in Table 10, along with Brown that makes choices with a similar quantity of information.  Table 10: Evaluation metrics for the different heuristics to choose the variable orderings for CAD. In bold, the best measure of the metric out of all the heuristics.
T1=SumMaxV>AvgAvgV>SumSumV and T2=SumMaxV>AvgAvgSgV>SumSumV Table 10 shows that the heuristic T1=SumMaxV>AvgAvgV>SumSumV wins in two of the four metrics: total time and the number of problems completed within the time limit. Note that this heuristic has been created from the top three features XAI suggested features, in the order they appear in the voting system.
• sum(max(v i (S))), (i.e. variable with the lowest degree in the product of the polynomials); breaking ties with • avg(avg(v i (S))), (i.e. variable with the lowest average of average degree in the polynomials); breaking ties with • sum(sum(v i (S))), (i.e. variable with the lowest sum of all its degrees).
This triplet also outperforms all the singleton heuristics. In Figure 6 we give a survival plot comparing the triplet T1 to the priorstate of the art and the virtual best for just the hardest problems in the dataset (those taking more than 30 seconds in all orderings). Observe the leap that exists between the prior state-of-the-art and the best possible in theory, with the new XAI informed triplet bridging a good deal of that gap.

Conclusions
In this paper, we have devised a new state-of-the-art human-level heuristic for choosing a CAD variable ordering. Previous approaches to the problem were of two types: either an expert handpicking features to create a humandesigned heuristic; or an ML model computing with many features under the hood. We presented an intermediate approach where we start with the ML model but then use XAI to select important features for later use independent on any ML technology. We note that SHAP was able to find the feature identified recently by del Río and England [50] that was the prior state-of-the-art heuristic; and identified also the tools of the heuristic in Brown [6]. But it also found an important feature not previously considered by experts.
There is of course scope for future work. Using multiple features clearly improved the results, and there are certainly others ways we could consider combining them. Perhaps this combination itself could also be ML informed? It would also be interesting to explore ML models that are inherently explainable, rather than relying only on the post-hoc explainability analysis.
The experiments presented in this paper are for selecting CAD variable ordering. However, the methodology could be applied almost directly to variable ordering choices for other algorithms, and could be adapted to many algorithm optimisation problems in symbolic computation and beyond. We have shown that ML has a role to play in symbolic computation, even for developers who prefer not to include a reliance upon it in their final code. It can be used as a tool to guide and develop algorithms at a human level. We thus hope and expect to see other applications of XAI to inform decisions in CASs and optimise symbolic computation algorithms in the coming years.