Interpretable semi-parametric regression models with defined error bounds☆,☆☆
Introduction
Consider the problem of learning a regression function from n training examples . Deploying such a data-driven model in safety-related applications requires ensuring that the model is correct for all possible inputs.
In practice training data are almost always limited and may not represent all relevant operating conditions. Thus it is crucial to understand what the model has learned and in particular how it will extrapolate to unseen data. An example is shown on the left side of Fig. 1, where a multi-layer perceptron (MLP) has been trained on 100 training samples generated by the sum of two sine functions without noise. The model has perfectly learned to reproduce the true function on the training interval. Beyond that interval the desired model behavior is not specified, leading to an arbitrary extrapolation. Note that there is no over-fitting issue here as the training data are noise-less and Bayesian regularization was used for avoiding unnecessary model complexity. The key point is that the model is tested in a range without training data.
The problem of unspecified extrapolation is particularly difficult to detect in higher-dimensional input spaces. There are two approaches for solving it: estimating the uncertainty of the model output and making the model as interpretable as possible.
A common way of estimating the model uncertainty is to use Bayesian model averaging. Here the model consists of an ensemble of individual models whose outputs are averaged and the variance σ2 of the outputs is used to quantify the uncertainty, see [2, p. 288]. The right side of Fig. 1 shows the interval around the mean output of an ensemble of 20 MLPs. The high variance beyond the training interval clearly indicates the random extrapolation behavior of the individual models. The drawback of this approach is that the models may not be truly independent, leading to a biased and thus unreliable estimate of the uncertainty. For example the extrapolation behavior of all models may be biased in a similar way by some outliers or by a particular choice of network design. Drawing individual training sets by resampling and using different random initializations of MLP weights improves the independence of the models, but still there is no guarantee that the uncertainty estimate is correct.
A further rather pragmatic approach is to combine the regression model with a density estimator which decides to discard the model output if the density of training inputs around the test input is below some threshold. The decision is only based on the input density and does not consider the variation in the output space. The latter is achieved by Gaussian Processes.
A Gaussian Process (GP) describes a distribution over functions given the data. If we ask a GP to predict the output of a test input , it returns a normal distribution , where the variance σ2 quantifies the uncertainty. The resulting distribution depends on hyper-parameters, which have to be optimized based on prior information and the data. An example is shown in Fig. 2, where 50 training samples were generated by the same function as in Fig. 1. If we expect a high noise level, we choose a higher initial value for the hyper-parameter σn defining the noise level. Then optimizing the hyper-parameters on the data and predicting the test outputs provides the left result. Here the predicted variance is high in general, possibly masking the extrapolation range. As the training data were generated without noise, the correct answer would be the result shown on the right. While both results are consistent with intuition, it can be difficult in practice to choose appropriate hyper-parameters.
The bottom line is that the predicted variance provides valuable information, but – as it is just an estimate – it may be necessary to combine or replace it with more deterministic methods, depending on safety requirements of the application. A possible approach is described in Section 2.
Traditionally approaches like CART [3] or rule-based methods are considered as being interpretable. Yet this is only true as long as the number of nodes or rules is rather small, which means that the model is quite coarse. A possible compromise is to partition the input space similar to CART or rule learners but to use more complex submodels in the different partitions of the input space. MARS (multivariate adaptive regression splines) [4] and an approach called GUIDE (generalized unbiased interaction detection and estimation) [5] follow this strategy; both are briefly compared in an experiment in this paper.
In statistics additive models [2] are often applied if the model shall be interpretable. The idea is to learn a multivariate function having p inputs as a sum of p univariate functions, that is, with constant α and noise random variable ϵ. The advantage is that the univariate functions may be easier to interpret. However interactions between variables cannot be modeled.
In [6] an additive model of a Gaussian Process is introduced, which allows modeling interactions between variables up to a specified degree . As in MARS interactions are modeled by the products of univariate functions. Hyper-parameters specify which degrees of interactions are important. Optimizing these hyper-parameters by gradient descent automatically reveals which additive structures exist in the data. So the resulting model has an increased interpretability as long as the maximal degree of interactions R is kept small. The computational complexity of evaluating the Gram matrix of an additive kernel is R times higher than for a squared exponential kernel, which can be problematic for large R. Note that the additive kernels can describe complex, non-local structures, which allows extrapolation in distant parts of the input space. This is advantageous if such structures do exist in the data. Otherwise it will lead to an undesirable extrapolation like the one in Fig. 1.
In this paper a different strategy is proposed similar to an additive model in the sense that two functions are added in the final model. The first function is used as an analytical model learned by symbolic regression. Symbolic regression [7] seeks for a symbolic representation (i.e. an equation) best matching the given data. The learned analytical model is easily interpretable but has only moderate accuracy. Thus a GP is learned on the residuals of the analytical model in order to improve the accuracy. The GP is not interpretable. But by limiting its output to a defined range a worst-case guarantee can be given in the sense that the maximal deviation from the analytical model is always below a defined limit. Unlike the additive kernels in [6], the GP here uses a squared exponential kernel, describing a local similarity measure. Although a local kernel requires more samples to learn a highly varying function [8], it ensures that the GP only contributes where training data actually exist.
The idea of combining analytical models with data-driven models can be traced back to hybrid neural networks considered mainly in the 1990s, e.g. [9], [10], [11]. However there are two key differences. First, here the analytical model is not considered as being given but is learned from data by symbolic regression. Second, Gaussian Processes are used as residual models, facilitating the model selection as there is no need to empirically decide on appropriate neural network architectures because the model is mainly given by the data points themselves.
The combined model can be considered as a semi-parametric model and is similar to semi-parametric GPs discussed in [12, p. 28]. Unlike in semi-parametric GPs, here the parametric model is trained separately by symbolic regression, which allows a larger variety of parametric (symbolic) representations instead of using a predefined set of fixed basis functions. A further difference is that here the contribution of the residual model is limited for being able to provide deterministic error bounds.
The next section describes the approach. Section 3 applies it to the SARCOS benchmark and shows that a better accuracy and interpretability is achieved in comparison to competitive methods. Section 4 concludes.
Section snippets
Safe and interpretable regression
Consider the semi-parametric modelwhere a is an analytical (that is, fully interpretable and verified) model, r is a residual model and provides a limitation. In this paper a is learned using the Eureqa tool for symbolic regression,1 where the training data are used to search for an explicit symbolic representation based on genetic programming [7]. Alternatively, any method providing a fully interpretable model may be used.
The
Experiment: SARCOS benchmark
The objective of the experiment is to compare different regression approaches in terms of their accuracy and interpretability. We consider learning the inverse dynamics of a seven degrees-of-freedom SARCOS robot arm. The task is to map from a 21-dimensional input space to a joint torque, that is, we learn a function . There are 44,484 training examples and 4449 test examples.2 All 21 inputs xi have been
Conclusions
The introductory example (Fig. 1) demonstrated the problem of unexpected extrapolation behavior even in rather close vicinity to the training set. Non-local models like MLP are particularly affected. Local models can mitigate the problem but may not always solve it.4
Clemens Otte holds a Diploma degree and a Ph.D. in Computer Science from the Technical University of Braunschweig and the University of Oldenburg, Germany. He is working as a Senior Key Expert Engineer for Siemens Corporate Technology. His research interests include machine learning with focus on interpretable and verifiable models, advanced data analysis, pattern recognition and probabilistic methods.
References (14)
- C. Otte, Learning regression models with guaranteed error bounds, in: ESANN 2013, 21th European Symposium on Artificial...
- et al.
The Elements of Statistical Learning: Data Mining, Inference and Prediction
(2009) - L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Statistics/Probability Series,...
Multivariate adaptive regression splines
Ann. Stat.
(1991)Regression by parts: fitting visually interpretable models with guide
- et al.
Additive Gaussian processes
- et al.
Distilling free-form natural laws from experimental data
Science
(2009)
Cited by (0)
Clemens Otte holds a Diploma degree and a Ph.D. in Computer Science from the Technical University of Braunschweig and the University of Oldenburg, Germany. He is working as a Senior Key Expert Engineer for Siemens Corporate Technology. His research interests include machine learning with focus on interpretable and verifiable models, advanced data analysis, pattern recognition and probabilistic methods.
- ☆
Work partially funded by German Federal Research Ministry, BMBF grant ALICE 01, IB10003 A-C. This paper is an extended version of [1].
- ☆☆
The above-referenced article is part of the special issue “ESANN-2013” published in issue 141. The publisher regrets that this article was accidentally omitted from the special issue published earlier. The publisher sincerely apologizes to the authors, guest editors and readers for any inconvenience this error may have caused.