Trees and Forests in Nuclear Physics

We present a detailed introduction to the decision tree algorithm using some simple examples taken from the domain of nuclear physics. We show how to improve the accuracy of the classical liquid drop nuclear mass model by performing Feature Engineering while using a decision tree. Finally, we apply the method to the Duflo-Zucker mass model showing that, despite their simplicity, decision trees are capable of obtaining a level of accuracy comparable to more complex neural networks, but using way less adjustable parameters and obtaining easier to explain models.


I. INTRODUCTION
In recent years, we have assisted to the enormous growth and development of new statistical tools for data science [1]. Although these methods are extremely powerful to understand complex data and detect novel patterns, they are still rarely adopted by the nuclear physics community. Only few groups are currently pioneering the applications of these methods to the field. These topic have been recently discussed in a series of workshops on Information and Statistics in Nuclear Experiment and Theory (ISNET). Recent developments in this field are documented in the associated focus issue published in Journal of Physics G [2]. The aim of the current guide is to illustrate a widely used algorithm in data analysis. Similarly to our previous guide on bootstrap techniques [3], we present the decision tree starting from very basic models and finally applying it to more realistic problems like nuclear masses prediction.
Decision trees are already implemented within major experimental collaborations as MiniBooNE to improve the performances of particle detectors [4,5], but they are not yet widely used in low energy nuclear physics, where they could help to analyse both experimental data [6] and theoretical models.
The scope of the current guide is to provide a very detailed and pedagogical introduction to the topic of decision trees. To this purpose, all the numerical codes used here are provided as Supplementary Material.
Following the notation and terminology of Leo Breiman's paper Statistical Modelling: The Two Cultures [7], we want to investigate a process f that transforms some input X into an output Y . That is, f is a function: where the input X can be quite general, from images to a table of data, while the output Y can be a discrete or continuous set. In the first case, it is named a classification problem, in the latter a regression problem. Rather than focusing on investigating the fine details of the process f with many restrictive assumptions (an approach that is named data model culture in [7]), we consider f as a black box mapping X to Y , and we try to approximate it. The goal when using Machine Learning is to find a representation (or approximation)f of the function f .f is called a model and it is a function with the same domain and codomain of the process f : Since all models are wrong, but some are useful [8], it is necessary to introduce a definition of what a good model looks like in order to pick the best one out of a set of possible candidates. Or in other terms, we need to assess how faithfullyf represents the process f . Mathematically, the goal for training a modelf is to minimise a particular scoring function, sometimes improperly called "a metric". Without loss of generality, we are considering only the minimisation problem: changing the sign of a scoring function to be maximised reduce the problem to a minimisation task.
For example, a natural choice for the scoring function is the mean squared error (MSE) or variance, that is the difference between the predicted value (Ŷ ) and the observed, experimental data (Y ) [9]: It is worth noting, that this is not the only option and, within the literature, we encounter other scoring functions as the logarithmic mean squared error (logMSE): or the median absolute error [9]: Different scoring functions correspond to different modelling choices and the importance we assign to specific subsets of the database. The use of MedAE would be more appropriate to obtain a model that is robust to outliers: few poorly described experimental points will not alter significantly the performances.In the current work, we have chosen the mean standard error (or equivalently its square root, RMS) which is the default in most libraries. Given the high accuracy of measurements in nuclear physics, especially for masses as discussed here, we do not need to worry about possible outliers in our data sets and MSE represents as such a reasonable choice.
Another important aspect in building a model is the decision on the trade off required between model performances and explainability. That is, the choice between better performances with the chosen scoring and the easy interpretation of the results. Among the regressors that are usually considered to be interpretable, there are (generalised) linear regression and decision trees. However, some recent research allows to interpret even the results from traditionally difficult to interpret algorithms as neural networks or gradient boosting in terms of linear regressions [10] or simple decision trees [11].
In this guide, we chose to illustrate decision trees because they retain interpretability, they do not rely on linear assumption nor on linear independence of the features, and are not significantly affected by monotonous transformations (no input data scaling is required, nor monotonic transformations like taking the logarithm or the square of one variable). Also, decision trees are the key elements to build other important regressors like Random Forests [12] or Xgboost [13].
Last, but not less important aspect of modelling is the balance between the complexity of the chosen model and the generality of the results. As an analogy, consider the problem of approximating N experimental distinct points using a polynomial. A complex polynomial of degree N will be able to explain perfectly the data. Whenever some new data are added, the perfect description will (in general) no longer be true. A correct assessment of the performance of a regressor should be performed on unseen data, i.e. data that were not used during the train.
A common practice to estimate the performance on unseen data is the k-fold cross validation, with k ∈ N. In essence, the data are permuted and then separated in sets of size k with each subset (fold) roughly of the same cardinality. The model is then trained on k − 1 subsets and validated on the subset not used while training. As an extreme case, when k = N − 1, all data but one are used for training and the performances are assessed on only one datum. This schema is called "leave-one out validation" [14].
In the following of this guide, we will illustrate the behaviour of decision trees using some nuclear mass models. The article is organised as follows: in Sec.II, we provide an introduction to what is a decision tree providing very simple examples. In Sec.III, we introduce the nuclear models to which we will apply the decision trees. The results of our work are presented in Sec.V and we illustrate our conclusions in Sec.V.

II. DECISION TREE
With a decision tree, the function f : X → Y is approximated with a step function with n steps with Ω i ⊆ X, X ⊂ R d where d is the number of features and I(Ω i ) is the indicator function: Any measurable function can be approximated in terms of step functions [15], thus the approximation is justified as long as the function f is expected to be measurable. That is, using enough leaves we can approximate any measurable function. In order to determine the optimal values for the α i and Ω i of Eq.6, one should provide a splitting criterion; for example, being an extreme value (maximum or minimum) for a function L. Here, we decide to minimise the L 2 norm of the difference between f andf , that is: This function should be chosen to approximate the scoring function selected: then determining the extremes of L guarantee to have optimised the desired scoring function. Other possible choices are available in the literature to better take into account specific features of the data set. In the following we provide some examples of how the decision tree operates.

A. A single variable example
As an example training set, X tr , we consider the union of two sets X 1 and X 2 . X 1 contains random points uniformly distributed in [0, 0.5) and analogously X 2 with points in [0.5, 1]. Notice that X 1 ∩ X 2 = ∅. In this case, there is only one feature, so d = 1.
For the images through f of X 1 (X 2 ), here named Y 1 (Y 2 ), we use a Gaussian distribution with mean 0 (1) and a variance of 0.05(0.1). The training set is illustrated in Fig.1. All figures presented in this article have been realised with the Python [16] library matplotlib [17]. To train a decision tree means to determine the function given in Eq.6, in such a way that is minimal. The sets Ω 1 and Ω 2 are defined as Given the simplicity of the data, we limit the case to a single split. It is easy to proof that the constant better approximating in terms of L 2 norm a set of values is the average of the valuesx. The optimal value for x * (0.5) appears obvious knowing the process generating the data, but how to determine when f is not known? The answer is: through a brute force approach. All the possible values for X are sorted (say, in ascending order) and for all distinct values the function L is evaluated. The optimal value x * is then the one that optimise the splitting criterion, in this case the one that minimises L.
Since a sorting of the feature values and a single pass through all of them is required, the computational cost for estimating the optimal split for one variable is O(N log N ) with N being the number of observations.
For this particular case we obtain empirically x * = 0.493, α 0 = 0.003 and α 1 = 1.007. More details for reproducing the results are provided in the Supplementary Material. Thus the decision tree reads The decision tree is thus able to recognise the intrinsic structure of the data and divide them in two half-planes. This is represented by the solid line in Fig.1. We say that this tree has two leaves, since we have separated the data in two subgroups.

B. A two variables example
We now consider a slightly more complex problem with two variables x 1 , x 2 . The data points are generated according to: with response: Graphically, we represent this data set in Fig.2. We perform the split in two steps. Firstly, we separate the data along the x 1 direction. The algorithm provides x * = 0.482 as value where to divide the plane. By observing the data, we see that there is no gain by adding further splits in this direction. We can further refine the model by adding an additional separation along the x 2 direction. In this case we find that x * = 0.495. The result is reported in Fig.2.
The algorithm used to perform such a split can be represented as in Fig.3 using the scikit-learn package [9]. The standard output of such a package is a series of boxes counting basic information: the value of the variable at which the separation takes place, but also the mean value of the data, named value, and the variance (MSE). It also provides information concerning the amount of data grouped in each leaf. In this case the tree has a total of three leaves. As the MSE is zero (thus, minimal) on each leaf, adding extra splits to the model would not lead to any real gain in the description of the data, but it would only increase complexity.
An important quantity to analyse the model and to assess the importance of its input variables is to estimate the feature importance. This is particularly relevant for the ensemble methods that rely on decision trees as base learners. Considering Fig. 3, we want to assess the importance of the features to the building of the decision tree. Following [9], we calculate it in the following way. For each split s in the tree, we calculate the weighted impurity decrease as N is the total number of observations, N s is the number of observations at the current node, N s,L is the number of samples in the left leaf, and N s,R is the number of samples in the right leaf. I represent the impurity (in our case, MSE), with the subscripts having the same meaning as before. By inspecting Fig.3, we observe that for the first split there are in the current node (the root of the tree) as many observations as the total, that is N = N s = 100. The initial impurity (MSE) is 22.782, N s,R = N s,L = 50, right impurity is 0, left impurity 0. 25 For the second split, we get: Normalising the total weighted variation to 1, we obtain that the column x 1 has an importance equal to 99.5% for the model, while x 2 has an importance of 0.5%. If the variables entered in different splits, the relative importance would be summed. Estimating feature importance is fundamental for interpretability: by discarding irrelevant features, i.e. features that are not reducing the impurity in the tree, more parsimonious models can be trained. This is specially useful for models involving hundreds (or thousands) of features. For example, anticipating the results of the following section, we see that in Figure

C. The importance of Feature Engineering
In the previous examples, we have approximated the data using a simple step function; although this choice is mathematically justified, the problem is that the approximation may lead to a single observation per leaf, with the result that the generalisation on unseen data may be unsatisfactory. As stated previously, we aim at keeping a balance between explaining the training and predicting on unseen data.
To overcome the problem, and possibly to make the models easier to interpret, it is important to study the data and apply convenient transformations on the input variable for the model that may highlight some patterns. We consider as an example the case of a two variable data-set obtained as follow: where the response is chosen as In the left panel of Fig.4, we illustrate the data points using Cartesian coordinates. To approximate these data using a step function, we performed two cuts: one along the x 1 direction at x * = 0.742 and a second one along the x 2 direction at x * = 0.979. We have thus built a decision tree with three leaves. The result is reported in terms of solid and dashed lines on the left panel of Fig.4. By inspecting the data, we realise that by applying a unitary transformation from Cartesian to polar: we can highlight a very specific structure in the data. The result of such a transformation is illustrated in right panel of Fig.4. Using this new variables in the model obviously help the performances: using a single split,i.e. a tree with two leaves, we can achieve a better level of accuracy with respect to the previous tree, using less leaves and with features that are easier to understand. Notice that by construction, there is only radial dependence, and no dependency on the phase. Thus we obtain a more parsimonious model by applying Features Engineering.
After these examples that could be easily implemented with a few lines of code, for the next sections and for realistic problems in terms of number of features and number of possible leaves, we are going to rely on existing libraries. Among the various algorithms available in literature to train a decision tree, we focused on the CART algorithm as presented in [1,18].
The results in the next session were obtained using Python and scikit-learn, but analogous result could have been obtained using R [19] and the libraries rpart [20] for model training and rpart.plot [19] for visualisation.

D. Random forests
We conclude this section on decision trees by introducing the concept of Random Forest (RF). Following [1,12,21], we define a RF as an ensemble of decision trees. The first step for training a RF is to use bagging, also named bootstrap aggregating [22,23]. Given a training set X tr , a uniform sampling with replacement is performed, obtaining two data sets: one containing the sampled data (with repetitions), X 1 , and one containing the data that were never sampled, named out-of-the-box (OBB), X 2 . X 2 contains roughly one third of the initial observations.
In fact, given a number N of observation, assuming no repetitions in the data, the probability for a datum of not being extracted at the i-th draw is simply p i = 1 − 1 N . Thus the probability of not being extracted after N draws, having replacement and assuming all the draws to be independent, is: The second step for training a RF consists in selecting a random subset of the features of X 1 . The number of features used is an adjustable parameter of the algorithm. A decision tree is built on X 1 using only the selected features, and the performances are estimated on X 2 . Repeating the bagging and the tree training for a number T of trees (another adjustable parameter) and averaging the response Y over the ensemble of predictions, a Random Forest is obtained.
The bagging and the random selection of features are tools to inject noise in the training data. The noise can be reduced by averaging on the response of each tree, and this empirically improves the performance of the regressor [12,22]. For the theoretical reasons for which the performances are better after injecting some noise in the system, the reference is [12].
Intuitively, the trees in the forest should not all provide the same response, otherwise averaging on all the trees would be of no benefit. Consider for example the dataset X of Eq. 11: if all the trees are trained on the same data with the same input features, then they will all provide the same output, and the random forest will be equivalent to a single decision tree.
A reason that made Random Forests very popular is that they can be trained on data sets with more features than observations without prior feature selection, a characteristics that made then a relevant tool, for example, for gene expression problems in Bioinformatics. For more details, see [24]).

III. NUCLEAR MASS MODELS
Before applying the decision tree to a more realistic case, we now introduce the nuclear models we are going to study. According to the most recent nuclear mass table [25], more than 2000 nuclei have been observed experimentally. To interpret such an amount of data, several mass models have been developed over the years [26] with various levels of accuracy. For the purpose of the current guide, we selected two simple models: the Bethe-Weizsäcker mass formula [27] based on liquid drop (LD) approximation and the Duflo-Zucker [28]. The reason of this choice is twofold: the models contain quite different physical knowledge about the data, for example the lack of shell effects in LD formula, but they are relatively simple and non CPU intensive, thus giving us the opportunity to focus more on the statistical aspects of the current guide.

A. Liquid drop
Within the liquid drop model, the binding energy (B) of a nucleus is calculated as a sum of five terms as a function of the total number of neutrons (N ) and protons (Z) where A = N + Z. The set of optimal parameters have been tuned in Ref. [3]. These parameters are named volume (a v ), surface (a s ), Coulomb (a c ), asymmetry (a a ) and pairing term (δ) and they refer to specific physical properties of the underlying nuclear system [27].

B. Duflo-Zucker
The Duflo-Zucker [28] is a macroscopic mass model based on a generalised LD plus the shell-model monopole Hamiltonian and it is used to obtain the binding energies of nuclei along the whole nuclear chart with quite remarkable accuracy. The nuclear binding energy for a given nucleus is written as a sum of ten terms as We defined 2T = |N − Z| and ρ = A 1/3 1 − 1 . The ten different contributions can be grouped in two categories: in the first one we find terms similar to LD model as Coulomb (V C ), symmetry energy (V T , V T S ) and pairing V P . The other parameters originate from the averaging of shell-model Hamiltonian. See [29] for more details. The model described in Eq.20 is usually referred as DZ10 and its parameters have been recently tuned in [30]. Within the literature, it is also possible to find other versions with extra parameters [31], but we will not consider them here for sake of simplicity. In Fig.5, we illustrate the behaviour of the residuals E(N, Z) obtained with the two mass models. We assume that nuclear data [25] have negligible experimental error compared to the model, and we discard all data having an uncertainty larger than 100 keV. This is a reasonable assumption to be made, since the typical discrepancy between models and data is usually one or two order of magnitude larger than the experimental errors [26]. See discussion in [3] for more details.
In each panel of Fig.5, we also provide the root mean square (RMS) deviation σ. We thus see that DZ10 is roughly one order of magnitude more accurate in reproducing data than the simple LD. The detailed analysis of these residuals have been already performed in [3,30] showing that they do not have the form of a simple white noise, but they contain a degree of correlation.

IV. RESULTS
We apply the decision tree to the case of nuclear data. Using the same notation used in the previous example, the input X is now a matrix with 3 columns N, Z, A while the response Y is the residual, i. e. the difference between the empirical data and the models (LD, DZ10) predictions.
As specified before, the goal is to minimise the RMS on unseen data, or in other terms learning without overfitting. While it appears obvious that a tree with only one leaf, which means replacing all the values of Y with the average Y , or with as many leaves as there are observations are not very useful, determining the optimal value for the number of leaves is not straightforward. The approach is empirical: experimenting with a reasonable set of values for the number of leaves, and pick the best results according to the cross-validation. The adjustable parameters of the model are named in Machine Learning literature hyper-parameters. With only one adjustable hyper-parameter like the number of leaves, exploring the parameter space is straightforward: all the possible values are tested, with a cost of M cross-validated models, where M is the number of possible values for the number of leaves. In our example, exploring between trees with only two leaves and trees with 500, that is cross-validating for M = 499 models.
On the other hand, with regressors with many adjustable parameters, as for example Xgboost [13], exploring the hyperparameter space is more challenging. Imagine to have 10 hyperparameters, exploring M values for each of them means exploring a grid with 10 M points. In this case, it is better to use dedicated libraries [32].
As a first application, we train a simple decision tree over the residuals of the LD model as shown in the left panel of Fig.5. We find that the optimal number of leaves is four. The structure of the tree is reported in Fig.6. By inspecting the splits of the data, we notice that the main feature of the data is associated with the neutron number N . The tree splits the nuclei in super-heavy (A > 219) and non-super-heavy. Then it further splits in very neutron rich and not. Finally the tree separates out the remaining nuclei in two groups: light and heavy. Having the optimal tree, we now translate it into a simple code and calculate the nuclear binding energies as where B tree represents the binding energy calculated with the decision tree. By now comparing the predicted masses obtained with Eq.21 with the experimental ones, we obtain an RMS of σ = 2.467 MeV. This is what is called training error, that is the RMS between the response and the predictions of the model trained on all data. A more conservative estimation that should be preferred is the validation error on unseen data, is the RMS estimated on data that were not used during the training. In this case, σ v = 2.925 MeV. It is possible to further improve on this result, by using Feature Engineering as discussed previously. To this respect, we provide some additional information to the tree, i.e. N − Z, N/Z, Z/A and Z/N . These structures are already used to build the LD model, Eq.19, and as such we help the decision tree to identify new patterns in the data.
In Fig.7, we report the structure of this new tree. The optimal number of leaves is 9. By implementing this tree into a simple numerical code and applying it to the LD residuals we obtain a slight improvement. The total RMS over the entire nuclear chart now falls to σ = 2.069 MeV (on unseen data, σ v = 2.881 MeV).
Although the decision tree is not so performing in term of RMS as a more complex neural network [33], we can still use it to identify possible trends in the data set. In Fig.8, we illustrate the relative importance of the features of the LD model, calculated using Eq.13. We see that the proton fraction Z/A is more important than the individual number of neutrons and protons. It is interesting to note that the decision tree is not affected by even/odd nuclei. This may imply that either the simple pairing term in Eq.19 is enough to grasp the odd-even staggering, or the granularity required to the tree is too high, leading to a number of leaves comparable with the number of data point, or other features can surrogate the odd-even features. We did not investigated more in depth this aspect, we leave it for a future analysis.
Finally, in Fig.9, we present a graphical illustration of the energy corrections found by the decision tree for the different nuclei along the nuclear chart. This figure is just an alternative way to represent the various leaves respect of what has been shown in Fig.7. We observe that we have 6 major splits along the valley of stability where we have light, medium-heavy and heavy nuclei. The later are then still separated in 4 smaller groups. The other cuts occur along the region of proton rich and neutron rich, thus the edges of the chart. From this general overview, we may conclude that the tree finds that the residual of the LD model are quite homogeneous (only two separations) along the valley of stability up to medium-heavy nuclei. Outside this range the splits increase since the tree identifies a larger variation in the data. This may imply some missing physics in the model for these particular region of the chart.
Having seen how the decision tree works for a schematic model, we now apply it to the more sophisticated DZ10. We adopt the same features as shown in Fig.8 to obtain the best performances. For the DZ10 model, the optimal tree has now 11 leaves and it is illustrated in Fig.10. The optimal tree has 11 leaves. In Fig.11, we illustrate the importance of the features used to build such a tree. It is interesting to observe that the most important feature is the charge dependence and the isovector dependence of the model (N − Z). By comparing with Fig.8, we observe that the relative importance of the features strongly depends on the model. We could further simplify the tree by reducing the features used or exploring new ones. This investigation goes beyond the scope of the present guide, since we are only interested to illustrate how the algorithm works.
We implement such a tree within a numerical code and we calculate the new binding energies as  in [30] using a more complex neural network. It is worth noticing that the final model given in Eq.22 is fully specified by 43 parameters (the 10 original parameters from the DZ10 models, the 7 pairs describing the variable and the value to split on, the 9 values of the response on each leaves). This number is compatible with most of nuclear mass models [34][35][36][37] having similar performances to Eq.22.
As done previously for the LD model, we represent in Fig.12 the splits of the tree given in Fig.10. We see that the most important cuts take place along Z. This was also the most important feature of the model as shown in Fig.11. Interestingly, the decision tree has identified a large area in the residuals corresponding to the medium-heavy neutron rich nuclei for which the correction is very small. On the contrary the same mass range, but on proton rich side requires a much more important energy correction. This may be the symptom of a poor treatment of the isovector channel in the model.
In Fig.13, we represent the comparison between the original residuals obtained with DZ10 model and the improved one using the decision tree. The histogram have been normalised. We see that the new residuals are now more clustered around the mean value, although we see that there are still some heavy tails that we have not been able to eliminate. We have checked the normality of the residual using the standard Kolmogorov Smirnov test [38] and we can say that the residuals are not normally distributed with a 95% confidence, thus showing there is still some signal left in the data that we have not been able to grasp.
We conclude this section by summarising the impact of decision trees on the residuals of the various mass models and different features used in the calculations. The results are reported in Tab.I. We observe that using feature engineering, we have been able to reduce the RMS of the LD model of ≈30%. Adopting the same features for the DZ10 model, we have improved the global RMS of ≈ 18%.

V. CONCLUSION
In this guide, we have illustrated a well-known decision tree algorithm by providing very simple and intuitive examples. We have also shown the importance of analysing the data to improve the performances of the method.
We have then applied the decision tree to the case of two well known nuclear mass models: liquid drop and Duflo-Zucker. In both cases, using a small number of leaves (9 and 11 respectively), we have been able to improve the global RMS of the models.
In the case of DZ10 model, the decision tree has been able to improve the global description of nuclear masses at a value comparable with the one obtained with more complex models as neural networks [30,39,40], but at a lower computational cost and lower complexity. A very important advantage of decision tree compared to other algorithms is that the results can be easily implemented into numerical codes without relying on external libraries to estimate the residuals.