Interpretable semi-parametric regression models with defined error bounds

doi:10.1016/j.neucom.2013.11.042

Neurocomputing

Volume 143, 2 November 2014, Pages 1-6

https://doi.org/10.1016/j.neucom.2013.11.042 Get rights and content

Abstract

Unreliable extrapolation of data-driven models hinders their applicability not only in safety-related domains. The paper discusses how model interpretability and uncertainty estimates can address this problem. A new semi-parametric approach is proposed for providing an interpretable model with improved accuracy by combining a symbolic regression model with a residual Gaussian Process. While the learned symbolic model is highly interpretable the residual model usually is not. However, by limiting the output of the residual model to a defined range a worst-case guarantee can be given in the sense that the maximal deviation from the symbolic model is always below a defined limit. The limitation of the residual model can include the uncertainty estimate of the Gaussian Process, thus giving the residual model more impact in high-confidence regions. When ranking the accuracy and interpretability of several different approaches on the SARCOS data benchmark the proposed combination yields the best result.

Introduction

Consider the problem of learning a regression function $f : R^{p} \to R$ from n training examples $(x_{i}, y_{i}), i = 1, \dots, n$ . Deploying such a data-driven model in safety-related applications requires ensuring that the model is correct for all possible inputs.

In practice training data are almost always limited and may not represent all relevant operating conditions. Thus it is crucial to understand what the model has learned and in particular how it will extrapolate to unseen data. An example is shown on the left side of Fig. 1, where a multi-layer perceptron (MLP) has been trained on 100 training samples generated by the sum of two sine functions without noise. The model has perfectly learned to reproduce the true function on the training interval. Beyond that interval the desired model behavior is not specified, leading to an arbitrary extrapolation. Note that there is no over-fitting issue here as the training data are noise-less and Bayesian regularization was used for avoiding unnecessary model complexity. The key point is that the model is tested in a range without training data.

The problem of unspecified extrapolation is particularly difficult to detect in higher-dimensional input spaces. There are two approaches for solving it: estimating the uncertainty of the model output and making the model as interpretable as possible.

A common way of estimating the model uncertainty is to use Bayesian model averaging. Here the model consists of an ensemble of individual models whose outputs are averaged and the variance σ² of the outputs is used to quantify the uncertainty, see [2, p. 288]. The right side of Fig. 1 shows the $\pm 2 σ$ interval around the mean output of an ensemble of 20 MLPs. The high variance beyond the training interval clearly indicates the random extrapolation behavior of the individual models. The drawback of this approach is that the models may not be truly independent, leading to a biased and thus unreliable estimate of the uncertainty. For example the extrapolation behavior of all models may be biased in a similar way by some outliers or by a particular choice of network design. Drawing individual training sets by resampling and using different random initializations of MLP weights improves the independence of the models, but still there is no guarantee that the uncertainty estimate is correct.

A further rather pragmatic approach is to combine the regression model with a density estimator which decides to discard the model output if the density of training inputs around the test input is below some threshold. The decision is only based on the input density and does not consider the variation in the output space. The latter is achieved by Gaussian Processes.

A Gaussian Process (GP) describes a distribution over functions given the data. If we ask a GP to predict the output $y = f (x)$ of a test input $x$ , it returns a normal distribution $p (y | X, y, x) = N (y | μ, σ^{2})$ , where the variance σ² quantifies the uncertainty. The resulting distribution depends on hyper-parameters, which have to be optimized based on prior information and the data. An example is shown in Fig. 2, where 50 training samples were generated by the same function as in Fig. 1. If we expect a high noise level, we choose a higher initial value for the hyper-parameter σ_n defining the noise level. Then optimizing the hyper-parameters on the data and predicting the test outputs provides the left result. Here the predicted variance is high in general, possibly masking the extrapolation range. As the training data were generated without noise, the correct answer would be the result shown on the right. While both results are consistent with intuition, it can be difficult in practice to choose appropriate hyper-parameters.

The bottom line is that the predicted variance provides valuable information, but – as it is just an estimate – it may be necessary to combine or replace it with more deterministic methods, depending on safety requirements of the application. A possible approach is described in Section 2.

Traditionally approaches like CART [3] or rule-based methods are considered as being interpretable. Yet this is only true as long as the number of nodes or rules is rather small, which means that the model is quite coarse. A possible compromise is to partition the input space similar to CART or rule learners but to use more complex submodels in the different partitions of the input space. MARS (multivariate adaptive regression splines) [4] and an approach called GUIDE (generalized unbiased interaction detection and estimation) [5] follow this strategy; both are briefly compared in an experiment in this paper.

In statistics additive models [2] are often applied if the model shall be interpretable. The idea is to learn a multivariate function having p inputs as a sum of p univariate functions, that is, $f (x_{1}, \dots, x_{p}) = α + \sum_{i = 1}^{p} f_{i} (x_{i}) + ϵ$ with constant α and noise random variable ϵ. The advantage is that the univariate functions may be easier to interpret. However interactions between variables cannot be modeled.

In [6] an additive model of a Gaussian Process is introduced, which allows modeling interactions between variables up to a specified degree $R \leq p$ . As in MARS interactions are modeled by the products of univariate functions. Hyper-parameters specify which degrees of interactions are important. Optimizing these hyper-parameters by gradient descent automatically reveals which additive structures exist in the data. So the resulting model has an increased interpretability as long as the maximal degree of interactions R is kept small. The computational complexity of evaluating the Gram matrix of an additive kernel is R times higher than for a squared exponential kernel, which can be problematic for large R. Note that the additive kernels can describe complex, non-local structures, which allows extrapolation in distant parts of the input space. This is advantageous if such structures do exist in the data. Otherwise it will lead to an undesirable extrapolation like the one in Fig. 1.

In this paper a different strategy is proposed similar to an additive model in the sense that two functions are added in the final model. The first function is used as an analytical model learned by symbolic regression. Symbolic regression [7] seeks for a symbolic representation (i.e. an equation) best matching the given data. The learned analytical model is easily interpretable but has only moderate accuracy. Thus a GP is learned on the residuals of the analytical model in order to improve the accuracy. The GP is not interpretable. But by limiting its output to a defined range a worst-case guarantee can be given in the sense that the maximal deviation from the analytical model is always below a defined limit. Unlike the additive kernels in [6], the GP here uses a squared exponential kernel, describing a local similarity measure. Although a local kernel requires more samples to learn a highly varying function [8], it ensures that the GP only contributes where training data actually exist.

The idea of combining analytical models with data-driven models can be traced back to hybrid neural networks considered mainly in the 1990s, e.g. [9], [10], [11]. However there are two key differences. First, here the analytical model is not considered as being given but is learned from data by symbolic regression. Second, Gaussian Processes are used as residual models, facilitating the model selection as there is no need to empirically decide on appropriate neural network architectures because the model is mainly given by the data points themselves.

The combined model can be considered as a semi-parametric model and is similar to semi-parametric GPs discussed in [12, p. 28]. Unlike in semi-parametric GPs, here the parametric model is trained separately by symbolic regression, which allows a larger variety of parametric (symbolic) representations instead of using a predefined set of fixed basis functions. A further difference is that here the contribution of the residual model is limited for being able to provide deterministic error bounds.

The next section describes the approach. Section 3 applies it to the SARCOS benchmark and shows that a better accuracy and interpretability is achieved in comparison to competitive methods. Section 4 concludes.

Section snippets

Safe and interpretable regression

Consider the semi-parametric model $f (x) = a (x) + g_{γ} (r (x))$ where a is an analytical (that is, fully interpretable and verified) model, r is a residual model and $g_{γ}$ provides a limitation. In this paper a is learned using the Eureqa tool for symbolic regression,¹ where the training data are used to search for an explicit symbolic representation based on genetic programming [7]. Alternatively, any method providing a fully interpretable model may be used.

The

Experiment: SARCOS benchmark

The objective of the experiment is to compare different regression approaches in terms of their accuracy and interpretability. We consider learning the inverse dynamics of a seven degrees-of-freedom SARCOS robot arm. The task is to map from a 21-dimensional input space to a joint torque, that is, we learn a function $f : R^{21} \to R$ . There are 44,484 training examples and 4449 test examples.² All 21 inputs x_i have been

Conclusions

The introductory example (Fig. 1) demonstrated the problem of unexpected extrapolation behavior even in rather close vicinity to the training set. Non-local models like MLP are particularly affected. Local models can mitigate the problem but may not always solve it.⁴

Clemens Otte holds a Diploma degree and a Ph.D. in Computer Science from the Technical University of Braunschweig and the University of Oldenburg, Germany. He is working as a Senior Key Expert Engineer for Siemens Corporate Technology. His research interests include machine learning with focus on interpretable and verifiable models, advanced data analysis, pattern recognition and probabilistic methods.

References (14)

C. Otte, Learning regression models with guaranteed error bounds, in: ESANN 2013, 21th European Symposium on Artificial...
T. Hastie et al.
The Elements of Statistical Learning: Data Mining, Inference and Prediction
(2009)
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Statistics/Probability Series,...
J.H. Friedman
Multivariate adaptive regression splines
Ann. Stat.
(1991)
W.-Y. Loh
Regression by parts: fitting visually interpretable models with guide
D.K. Duvenaud et al.
Additive Gaussian processes
M. Schmidt et al.
Distilling free-form natural laws from experimental data
Science
(2009)

There are more references available in the full text version of this article.

Cited by (0)

^☆: Work partially funded by German Federal Research Ministry, BMBF grant ALICE 01, IB10003 A-C. This paper is an extended version of [1].

^☆☆: The above-referenced article is part of the special issue “ESANN-2013” published in issue 141. The publisher regrets that this article was accidentally omitted from the special issue published earlier. The publisher sincerely apologizes to the authors, guest editors and readers for any inconvenience this error may have caused.

View full text

Neurocomputing

Interpretable semi-parametric regression models with defined error bounds☆,☆☆

Abstract

Introduction

Section snippets

Safe and interpretable regression

Experiment: SARCOS benchmark

Conclusions

The Elements of Statistical Learning: Data Mining, Inference and Prediction

Multivariate adaptive regression splines

Ann. Stat.

Regression by parts: fitting visually interpretable models with guide

Additive Gaussian processes

Distilling free-form natural laws from experimental data

Science