Functional form estimation using oblique projection matrices for LS-SVM regression models

Kernel regression models have been used as non-parametric methods for fitting experimental data. However, due to their non-parametric nature, they belong to the so-called “black box” models, indicating that the relation between the input variables and the output, depending on the kernel selection, is unknown. In this paper we propose a new methodology to retrieve the relation between each input regressor variable and the output in a least squares support vector machine (LS-SVM) regression model. The method is based on oblique subspace projectors (ObSP), which allows to decouple the influence of input regressors on the output by including the undesired variables in the null space of the projection matrix. Such functional relations are represented by the nonlinear transformation of the input regressors, and their subspaces are estimated using appropriate kernel evaluations. We exploit the properties of ObSP in order to decompose the output of the obtained regression model as a sum of the partial nonlinear contributions and interaction effects of the input variables, we called this methodology Nonlinear ObSP (NObSP). We compare the performance of the proposed algorithm with the component selection and smooth operator (COSSO) for smoothing spline ANOVA models. We use as benchmark 2 toy examples and a real life regression model using the concrete strength dataset from the UCI machine learning repository. We showed that NObSP is able to outperform COSSO, producing stable estimations of the functional relations between the input regressors and the output, without the use of prior-knowledge. This methodology can be used in order to understand the functional relations between the inputs and the output in a regression model, retrieving the physical interpretation of the regression models.


Introduction
Non-parametric regression is an important field of data analysis. These non-parametric models use some observations of the input data and the desired target to estimate a function and PLOS ONE | https://doi.org/10.1371/journal.pone.0217967 June 7, 2019 1 / 21 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 make predictions [1,2]. However, generally, these models focus on the prediction of the target variable of interest and not on the model interpretability. In this manuscript, we will refer to interpretability as the property of a model to express the output in additive terms of the partial nonlinear contributions of the input variables and their interaction effects. In several applications interpretability plays an important role in the construction of prediction models. In such cases, the main goal is not the prediction of the response of a system but to determine the underlying mechanisms and the relationship between the inputs and the output. For instance, in medical applications, this information can be used in order to identify treatment targets, support diagnosis, and facilitate the introduction of these models in clinical practice. As an example, Van Belle et al. proposed the use of a color code to enhance the interpretability of classification models for the clinicians [3]. Interpretability of black box models has already been addressed for classifiers and regression models using different strategies. On the one hand, Van Belle et al. proposed a clinical interpretable classification model based on a least squares support vector machine (LS-SVM) using the radial basis function (RBF) kernel [4]. In that work, the authors retrieved the interpretability of the classifiers by using a truncated multinomial expansion of the RBF kernel. On the other hand, for regression models, interpretability has been tackled by using sparse additive models [5], and functional ANOVA models [6,7]. In these models the target observations are modeled as a sum of a constant term, main effects, and interaction effects. In this context, the main effects refer to the direct relation between each input variable and the output, considering both the linear and nonlinear contributions of an input regressor on the output, while the interaction effect refers to the combined effect of 2 or more input variables on the output [6]. These models retrieve interpretability similarly to the case of generalized linear models [8], where a direct link between the contribution of each input in the output is estimated explicitly. However, these models require the use of prior knowledge in terms of which variables should be included in the regression models, as well as which interaction effects are of importance. This can only be achieved if the designer has a profound knowledge of the process and mechanisms underlying the changes in the target variable. To address this problem, some methodologies have been developed in order to select components that are relevant for the regression model in an automatic way, [9][10][11]. In particular, in the work from Lin et al., a new method for model selection and fitting in non-parametric regression models using smoothing spline ANOVA is proposed [11]. This method is referred as the Component Selection and Smoothing Operator (COSSO). This methodology is able to produce a model that is accurate in the prediction of the target variable, and it retrieves some of the interpretability by accurately identifying the components and functional forms of the relation between the input variables and the output. Nevertheless, it requires to specify a-priori if the user is interested in finding only main effects or interaction effects as well. More importantly, the solution of both problems does not converge, since the common terms in both cases are not the same. This is due to the fact that both problems, computing the main effect or including the interaction effects, lead to different cost functions with the same objective, fit the observed output. Additionally, this method has been developed for smoothing spline ANOVA models (SS-ANOVA) and, to the best of our knowledge, it has not been extended to other kernel based regression models.
From a geometry point of view, interpretability of non-parametric regression models can be addressed as the decomposition of a target observation vector into additive components [12]. Each of these components should lie in the subspaces that are spanned by the respective input regressors. If a basis for the subspaces of each individual input regressor and their interaction effects can be found, then appropriate projection matrices can be constructed in order to retrieve the interpretability of the models. This idea has been previously exploited in the case of linear models, where we have proposed to use a Hankel expansion of the input variables to construct a basis for their subspaces, and construct oblique subspace projectors in order to effectively decouple the dynamics of undesired input variables, representing the output as a sum of the partial linear contributions of each input [13]. Oblique projectors are particularly powerful when the variables that want to be decoupled are not orthogonal [14]. They have been applied as a preprocessing step in different hyper-spectral imaging [15], decoding [16], and biomedical applications [13].
Considering non-linear regression models using LS-SVM, due to the geometry of the RBF kernel, the subspaces spanned by the nonlinear transformation of the input regressors are not likely to be orthogonal [17]. Therefore, in order to decompose the output into interpretable additive components, oblique projectors should be used. In this paper we address this issue, by formulating a nonlinear extension to oblique subspace projections. We show how to create a basis for the subspaces spanned by the nonlinear transformation of the input regressors, main effects as well as interaction effects, using kernel evaluations. In addition, we suggest how to construct appropriate oblique projectors in order to effectively decompose the output of a kernel based regression model into interpretable additive components.
The manuscript is structured as follows, in section 1 we briefly describe an LS-SVM regression model. In section 2 we describe how to construct oblique subspace projectors and we present the proposed algorithm. In section 3 we evaluate the performance of the proposed algorithm using 2 toy examples and an application using the concrete strength dataset from the UCI machine learning repository. In section 4 we discuss the results and the potential use of the proposed algorithm, and we finalize with some concluding remarks in section 5.
Throughout this manuscript we will use the following notation: we will refer to scalars as italic letters (x), vectors will be represented as lowercase bold variables (x), matrices will refer to capital bold variables (Ω), subspaces will be represented by calligraphic capital letters (V), and spaces will be represented by blackboard bold capital letters (R). With some abuse of notation, we will refer to the projector matrix onto the subspace of the l th input regressors, along the subspace spanned by the other regressors as P l/(l) , where the subindex l represents the subspace of the regressor where the output will be projected, the subindex (l) represents all the input regressors excluding the l th , and the symbol / represents the oblique projection.

LS-SVM for nonlinear regression
LS-SVM is a kernel based methodology that can be used to solve nonlinear classification and regression problems [18]. Due to its flexibility to manage different kind of problems and produce an adequate mathematical model, LS-SVM has been used successfully in different application fields such as: the prediction of electricity energy consumption [19], estimation of water pollution [20], forecasting of carbon price [21], and the prediction of meteorological time series [22], among others. However, its applications have been hampered due to its blackbox model nature.
Let's consider the following nonlinear regression model: whereŷ 2 R represents the output of the model, x 2 R d , and x = [x (1) ; . . .; x (d) ], represents the input vector, x (l) is the entry for the l th input regressor, φð�Þ : R d ! R p represents the nonlinear mapping of x into a possibly infinite-dimensional feature space, w are the weigths, and b is the bias term. Given the following training data fx i ; y i g can be formulated as follows: where γ represents the regularization constant, e = [e 1 ; . . .; e N ] is the error vector with e i the error related to the i th observation. By taking the Lagrangian and solving for the Karush-Kuhn-Tucker conditions for optimality, the solution is given by: where y = [y 1 ; . . .; is the ij-th entry of the kernel matrix Ω, K(�, �) is the kernel function, and α = [α 1 ; . . .; α N ] are the Lagrange multipliers, and b is the bias term. The matrix form of the solution, for the training points, is then given by: Since the algorithm that is proposed in this paper makes use of projection matrices, it is important that the data that is projected is centered. This means, that the bias term in the regression model should be eliminated, as well as the mean value of the nonlinear transformation of the input regressors should be removed. These modifications lead to the following solution in matrix form [23]:ŷ where Ω C = MΩM, is the centered kernel matrix, and M ¼ I À 1 N 1 T N =N is a centering matrix. The ij-th entry of the centered kernel matrix is given by

Nonlinear regression decomposition using ObSP
In this section we present the propossed decomposition algorithm, which is a nonlinear extention to oblique subspace projections (NObSP). NObSP is not a regression method but an algorithm that allows to decompose the output of a regression model, using LS-SVM regression, into additive components that represent the partial nonlinear contributions of the input regressors on the output, and their interaction effects. NObSP uses oblique subspace projections to extract the partial contribution of an input regressor, while nullifying the contributions from other regressors. We will first introduce the concept of oblique subspace projections, then we will proposed their extention for nonlinear regression models.

Oblique subspace projections
Oblique subspace projection (ObSP) is a generalization of orthogonal projectors, where a given vector is projected onto a target subspace following the direction of a reference subspace [24]. Oblique subspace projectors can be used in order to decompose a signal into the partial contributions of some regressors, even when the signal subspaces of the different input regressors are not orthogonal [13,16]. An algorithm for the use of ObSP in signal decomposition for linear regression models has been proposed in [13]. Briefly, the algorithm is summarized as follows. Let's define V � R N as the subspace spanned by a matrix A = [A l A (l) ], with A 2 R N�p , A l 2 R N�q the partition of A that spans the subspace V l � V, and A ðlÞ 2 R N�ðpÀ qÞ the partition of A that spans the subspace V ðlÞ � V, V ðlÞ � SpanðA ðlÞ Þ, such that V ¼ V l � V ðlÞ . Now, let's consider V ¼ V 1 � V 2 � ::: � V d , with � being the direct sum operator, and d represents the number of signal subspaces embedded in A satisfying d � p; then the oblique projector onto V l along V ðlÞ ¼ V 1 � ::: � V lÀ 1 � V lþ1 ::: � V d , denoted by P l/(l) , is given by: where † represents the generalized inverse, and Q (l) is the orthogonal projector onto Null ðA T ðlÞ Þ � V ? ðlÞ , which is computed as: where P ðlÞ ¼ A ðlÞ ðA T ðlÞ A ðlÞ Þ y A T ðlÞ is the orthogonal projector onto V ðlÞ [24].

Nonlinear oblique subspace projection (NObSP)
Let's consider the following nonlinear regression problem: where e is the error term, f l (x (l) ) represents the partial contribution, linear and non-linear, of the l th input variable x (l) on the output, f lh (x (l) , x (h) ) represents the partial contribution of the interaction between the variables x (l) and x (h) on the output, the term G(x) represents the partial contribution of all the other higher order interactions between the input variables, with x = [x (1) ; . . .; x (d) ], and d the number of input variables. From Eq (8), it can be seen that the partial contributions of each input variable, their second order interactions, and the higher order interactions can be found if an appropriate projection matrix can be created. Such projection matrix will span the subspace of the nonlinear transformation of the (input signal)/(interaction effects) of interest, whilst its null subspace contains the nonlinear transformation of the other remaining input variables and their interactions. More especifically, if we define P i/(i) as the oblique projection matrix onto the subspace spanned by the nonlinear transformation of the i th input variable, along the direction defined by the other variables, then by multiplying Eq (8) by this projection matrix, we obtain: Since the oblique projection matrix nullifies all the nonlinear contributions of the input variables, except the i th input variable, then Eq (9) reduces to: where f i (x (i) ) is the partial contribution of the i th input variable on the output, and e (i) is an error term obtained by multiplying the original error by the projection matrix.
In the case of functional ANOVA models, it is assumed that these subspaces are orthogonal [6,11]. This leads to orthogonal projection matrices. However, there is no evidence to support this claim, especially when using kernel regression models. In such cases, an oblique projector will be more appropriate. The problem now lies in finding a basis for the signals subspaces that are required to construct the oblique projector operator.
In a kernel regression framework, the output of the regression model can be expressed as a linear combination of kernel evaluations, i.e.ŷðxÞ ¼ P N i¼1 a i Kðx; x i Þ þ b. Therefore, the kernel matrix spans the column space of the nonlinear transformation of the input variables. In order to see this, lets consider the nonlinear regression problem for the centered data in matrix form, , to solve that problem using least squares, the solution leads to the hat, or projection, matrix The projection matrix can also be written using the centered kernel matrix Ω C , in this case the projection matrix has the form P To prove this we can replace in the previous equation Taking this into account, if we consider the matrix Φ Cl representing the centered nonlinear transformation of the l th input regressor and defined as . . . ; 0Þ T À m φ �, then, the projection matrix onto its subspace is given by since the nonlinear transformation is not known, we cannot directly compute P Cl but we propose to use kernel evaluations as follows: is a matrix containing the centered nonlinear transformation of the regressor variables Proof.

Using LS-SVM the solution to this problem is given byŷ
T the projection matrix onto the subspace defined by the centered nonlinear transformation of the l th regressor variable. Defining Ω C l ¼ Φ Cl Φ C T , and using Using the SVD decomposition of the matrix Φ C , Φ C = UΛV T , and replacing in the definition of P Cl we In the same way we can use (1) ; . . .; x (l−1) ; 0; x (l+1) ; . . .; x (d) ] to form the kernel matrix Ω C(l) which can be used as a basis for the subspace spanned by the centered nonlinear transformation of all the other input regressor variables, excluding the l-th input regressor, as well as their interaction effects. It is important to notice that the subspace spanned by the nonlinear transformation of more than one input regressor will contain their interaction effects as well as their individual contribution, or main effect. Hence, if the interest is to find solely the interaction effect between 2 variables on the output, their individual contributions should be found first and subtracted. For instance, if we define P ij/(ij) as the oblique projection matrix onto the subspace spanned by the nonlinear transformations of the i th and j th input variables, along the direction defined by the other variables, then by multiplying Eq (8) by this projection matrix, we obtain: which reduces to: where f i (x (i) ) and f j (x (j) ) are the partial contribution of the i th and j th input variables on the output, f ij (x (ij) ) is the nonlinear contribution of the interacion effect of the i th and j th input variable, and e (ij) is an error term obtained by multiplying the original error by the projection matrix. Therefore, in order to obtain only the interaction effect, the individual partial contributions should be subtracted. Once the basis for the main contributions, and interaction effects, of given input regressors have been found, the output of an LS-SVM regression model can be decomposed into additive components using Algorithm I.

Out-of-sample extension
In Algorithm I we presented a way to derive the main and interaction effect contributions using training data. This idea can be extended to data that has not been seen during training, however some considerations need to be taken into account. Since the algorithm that is presented in this paper is based on projection matrices, it is important to notice that a proper set of basis vectors are needed in order to construct reliable projection matrices. For instance, if while training the algorithm, it is found that the dimensions of the subspace where the data lies is N d , then at least N d basis vectors are needed to construct a proper projection matrix. In addition, the number of data points used to evaluate the model will define the size of the kernel matrix, which defines the maximum number of basis vectors that can be obtained from it. Therefore, in order to construct projection matrices for data-points outside the training set, and considering N d as the dimension of the subspace of interest, then at least N d new data points are needed in order to properly decompose the output into the nonlinear partial contributions. In practice, while decomposing the training data, the dimension of each one of the subspaces that represent the nonlinear transformation of the input variables, as well as the interaction effects, should be computed. The maximum dimension of such subspaces can be taken as the minimum number of evaluation points that are needed in order to produce a proper decomposition. The steps needed to evaluate NObSP using new data points are summarized in the Algorithm II.

Algorithm I. Nonlinear oblique subspace projections (NObSP)
Input: regressor matrix X 2 R N�d , output vector y 2 R N , estimated output vectorŷ 2 R N . Output:ŷ ðlÞ 2 R N , and y ðlÞ 2 R N main effect and interaction effect contributions from given input regressors using the estimated output from the model and the real measured output, respectively. 1. Normalize the input regressors.  (6) using the centered Ω l and Ω (l) , . 6. Compute the corresponding partial contribution to the output aŝ y ðlÞ ¼ P l=ðlÞŷ , or y (l) = P l/(l) y with l = 1, . . ., d.
The output from both algorithms can be used in order to determine which input regressor (s) and/or interaction effects are of importance in the output of the model, by computing the percentage of power that each one of these contributions provide to the output.

Applications
In this section we present results from the proposed algorithm using 2 simulation examples, as well as a real life example using data from the Concrete Compressive Strength Data Set in the UCI machine learning repository [25]. In the first toy example we compare NObSP with the results given by COSSO, using the same example proposed in [11], where only main effects are included. Additionally, we evaluate the performance of NObSP to select relevant components, and we test the effect of the kernel selection in the decomposition. In the second example, we create an artificial dataset that includes interaction effects, we test for the robustness of the projections and the regression model by means of bootstrapping, we present the results in terms of their mean solution and 95% confidence intervals, in this example we also evaluated the performance of the model to unseen test data. In the third example we demonstrate the potential use of NObSP in a real life example.

Simulation study: Toy example I
In order to compare the performance of NObSP with COSSO, we use the first example presented in [11], where an additive model in R 10 is considered. The regression function is defined by f(x) = 5g 1 (x (1) ) + 3g 2 (x (2) ) + 4g 3 (x (3) ) + 6g 4 (x (4) ), where: g 1 (x) = x, g 2 (x) = (2 x − 1) 2 ,  (10) . We generated the input data, x, as uniformly distributed random numbers in [0, 1]. Additionally, we added noise such that the signal-to-noise ratio is 3: 1 as in [11]. In contrast with the original example, we also imposed a correlation between the input variables of 0.8 in order to increase the complexity of the problem. To impose this correlation we used another uniformly distributed variable u, and the following formula x ðiÞ and ρ is the desired correlation value that will be impossed on the variables.
Since we are using uniformly distributed data, due to the central limite theorem, the resulting input variables will be closer to a normal distribution and will not be uniformly distributed. We use COSSO in 2 ways, first to find only the main effects in the regression, i.e. decompose the output only in additive terms of the partial nonlinear contributions of the input regressors, we will refer to this as COSSO-I. Second, we also computed the output using COSSO for the second order interactions, retrieving not only the main effects but also the interaction effect, to compare if both approaches for COSSO converge to the same result. We refer to this as COSSO-II.
The results for the decomposition are shown in Fig 1. The results from NObSP are presented in a solid gray line, COSSO-I in a black dashed line and COSSO II in a black dotted line. The true output is shown as a black solid line. As can be seen in the Fig, the proposed algorithm is able to approximately retrieve the functional form for the contribution of each input regressor variable, with a performance similar to the one provided by COSSO. It is also possible to observe that the output provided by COSSO-I and COSSO-II are not the same. In addition, in contrast with COSSO, NObSP is not able to retrieve a response equal to zero for the contribution of the input variables that are not included in the final model. This is Nonlinear oblique subspace projections (NObSP) expected, since the proposed algorithm does not include any sparsity or feature selection criteria, instead it is solely based on the output of an existing regression model. However, the magnitude of the contribution of those regressors to the output is smaller than the contribution of the variables that are effectively included in the original model.
In Fig 2, the strength of the nonlinear contributions of the input variables and the interaction effects are shown. To compute this strength, first we computed the total power of the decomposed signal as the sum of the root mean squared values from all the components obtained in the decomposition model. Once this is obtained, the strength of each component is computed as the ratio between its root mean squared value and the total power of the decomposed signal. To visualize the magnitude of the main nonlinear contributions and the interaction effects, we present these values in a matrix form. In this matrix the diagonal represents the strength of the contributions for each input variable, and the upper triangular elements represent the strength of the contribution for second order interactions for the corresponding input regressors, which are indicated in the rows and columns of the matrix. The lower triangular elements are not taken into account since they represent redundant information. Finally, the fields in the matrix that belong to the contributions with a larger influence on the output are colored in black.
From Fig 2 we can observed that COSSO-I only the diagonal produces elements different from 0, since it only retrieves main effects, while COSSO-II and NObSP produce a total of 55 components. It can be seen that the results from NObSP indicate that the first 4 components have a higher contribution to the output than the other components, with a contribution between [4.13%-18.20%] of the total power, compared to [12.48%-43.01%] produced by COSSO-II. COSSO-I also produces components with a larger magnitude in the first 4 components with a root mean square value of [12.51%-44.47%] and 0 in the other components. In contrast with NObSP, COSSO-I and COSSO-II produce components with a contribution equal to zero, due to its sparsity properties. Since NObSP does not include sparsity, most of its components contribute to the output. This results in lower strength values for the components with a higher influence in the output for NObSP. However, by selecting an appropriate threshold, these components can be selected. For this example the threshold was set to 4% by visual inspection of the components.
In Fig 3 the results from the decomposition using NObSP with different kernels are shown. It can be seen that the linear kernel only produces the linear approximation of the nonlinear contributions of the input regressors on the output. Additionally, the contributions of the interaction effects, in the linear model, are equal to zero, which is expected since the interaction effects in linear models are given by a sum of the contribution of each input variable on Nonlinear oblique subspace projections (NObSP) the output. Therefore, by identifying such contributions and subtracting from the interaction effect the resulting component should be equal to zero. The performance of the polynomial kernel and the RBF kernel is quite similar, in both cases the decompositions were able to approximate the desired functional forms.
In Table 1 the root mean square errors for the regression, and the estimation of the functional forms using COSSO-I, COSSO-II and NObSP are shown. The simulations were performed 100 times and the values are presented as median [min-max]. It can be seen that COSSO-I and COSSO-II in general produced a lower error in the estimation of the overall function, with COSSO-I producing the lowest error. However, when estimating the functional forms of the main effects, NObSP, using the RBF kernel, outperforms the results from COSSO.

Simulation study: Toy example II
For the second toy example, we use the following additive model in R 3 : . with, x 1 , x 2 , x 3 2 [0, 1], and η is a Gaussian noise with a variance such that the signal-to-noise ratio on the output variable is equal to 4db. No contribution from x 3 was included. In addition, a correlation of 0.8 was imposed in the input signals.
We computed 400 realizations of the model. From this data, we extracted 1000 different training datasets using random sampling with replacement. For each generated dataset a model using LS-SVM was computed and tuned using 10-fold cross-validation. The partial contributions of the input variables and their second order interactions were calculated using NObSP and COSSO-II [11]. From the output of the 1000 randomizations, the mean and the 95% confidence intervals were computed. Additionally, we computed the output of the model for unseen data using Algorithm II. For this, we first found the rank for the different kernels obtained using Algorithm I, the maximum rank obtained was 45. Based on that rank we run simulations using a test set size from 1 data point up to 45 × 5 data points. For each test set size we generate 100 random sets. In order to identify the convergence of the error in the decomposition with respect to the test set size, we plotted the errors in terms of their mean and 95% confidence interval.
In  values plotted in the Figs do not represent the output of one decomposition of the regression models, but just the mean value of the output for that specific input value. This gives the impression that the models produce noisy estimates, but that is not the case. In the Fig it can be seen that NObSP is able to approximately retrieve the functional form of the simulated model. In addition, the contribution from the input x 3 is small in amplitude compared to the other components, indicating that this variable does not contribute largely on the output. The second-order interactions appear to produce larger confidence intervals, indicated by the gray area. However, NObSP was still able to retrieve the approximate functional form of the second-order interaction between the variables x 1 and x 2 , while producing a small output for the components that were not present in the original model. Fig 5 shows the results from the smoothing spline ANOVA models using COSSO-II. It can be seen that COSSO-II, in this case example, is not able to retrieve the desired functional forms. But, when looking at the error in the complete regression model the LS-SVM regression produce a root mean square error of 0.09 [0.02-0.11], while COSSO-II produce an error of 0.04 [0.03-0.05], indicating that COSSO II was able to fit better the model than NObSP. Here it is important to notice that NObSP is not a regression method, but just an algorithm that allows to decomposse the results from a regression model, using LS-SVM, into the partial nonlinear contributions of the input variables on the output, as well as their interaction effects. In contrast with NObSP, COSSO is embedded within the cost function of funtional ANOVA models. Therefore, COSSO migth lead to a better fit than the LS-SVM regression. However, this example shows that, enventhough the overal fit is worse, NObSP is able to retreive the funtional form of the nonlinear partial contributions of the input variables on the output, while COSSO fails.
In order to evaluate whether the problem with the estimation was caused by the correlation between the input variables, we repeated the simulation using an input with uncorrelated inputs. The results are shown in Fig 6. It can be seen that COSSO-II now is able to retrieve, approximately, the functional forms for the variables x 1 and x 2 , it also identifies that x 3 , as well as some interaction effects, do not have a contribution in the model. But, it still fails to identify the second-order interaction that was present in the original model. The output for the LS-SVM model using NObSP for this case is shown in Fig 7. The Fig shows  that in this case NObSP can still retrieve the approximate functional forms, as shown previously.
Additionally, the output for NObSP using the unseen test set can be seen in Fig 8 using 250 data points. The Fig shows that NObSP is able to retrieve the functional forms of the relationship between the input and the output of the model using new data samples. Furthermore, in Fig 9 we can see that the errors for the estimation of the functional components decrease as the size of the test set increases. This error converges around 45 samples, which is the maximum rank that was needed in order to construct appropriate projection matrices to decompose the output of the model.

Simulation study II: Concrete compressive strength dataset
In this section we used data from the concrete compressive strength database from the UCI machine learning repository. We tested the performance of NObSP and compared with COS-SO-II. In this dataset there are 8 different input variables that are used in order to predict the strength of the concrete samples. The input variables are: the amount of cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate, and the age of the cement in days. In total, the dataset contains 1030 samples. More information about the characteristics of the dataset can be found in [25]. For the simulations, we computed 100 realizations of the regression models, using random sampling with replacement. We express the output in terms of the mean and 95% confidence intervals. Fig 10 shows the results from NObSP. The Fig displays not only the decomposition but also the general performance for the regression model. Based on the magnitude of the responses, the elements that contribute the most to the strength of the concrete samples are: amount of cement, furnace slag, ash, and water. Other elements, such as superplasticizer, course and fine aggregate, do not contribute strongly to the final strength of the concrete sample.

Discussion
We have proposed an algorithm to decompose the output of an LS-SVM regression model into the sum of the partial nonlinear contributions, as well as interaction effects, of the input  regressors. We have demonstrated that the functional form of the relation between the inputs and the output can be retrieved using oblique subspace projections. We have also shown that through appropriate kernel evaluations it is possible to find a proper basis for the subspace representing the nonlinear transformation of the input regression variables.
In contrast with other methodologies, the proposed algorithm does not require to define apriori the model contributions for the decomposition. NObSP starts with a normal LS-SVM regression model, which solution, in terms of the kernel parameters, is used in order to implement the decomposition strategy. Other methodologies, such as COSSO, require to define apriori whether the user is interested to decompose the output in terms of the main effects, or also to include second-order interactions. This might lead to different functional forms since there is no guarantee that both approaches will converge to the same solution. This was demonstrated in section 3.1 where COSSO-I and COSSO-II produced different outputs. In NObSP this is not a problem since the decomposition scheme does not require to solve an additional optimization problem.
The basis for the subspace spanned by the nonlinear transformation of the data, which is computed using kernel evaluations, can be seen as an N-dimensional approximation of the true subspace, which might be embedded in a much larger space. When using a linear kernel, it can be seen that the subspaces of the nonlinear transformations, as expected, cannot be properly constructed, resulting just in linear approximations of the true nonlinear model, as shown in Fig 3. When using a polynomial kernel, the decomposition scheme performs better than the linear case, being able to produce a better estimation of the nonlinear influences of the input regressors. However, as shown in section 3.1, the best results were obtained using the RBF kernel. This might be explained by the fact that the RBF kernel can be considered as an universal kernel, which is able to approximate the responses produced by any other kernel by changing its bandwidth.
In the second toy example, it was shown that NObSP was able to approximately identify the functional forms for the main effects, and second-order interaction effects on the output, in both cases when the input variables were correlated and uncorrelated. COSSO-II was only able to retrieve the main effects when the variables were not correlated. However, COSSO-II was able to identify more accurately the components that are not part of the original model.
When considering the concrete strength example, it can be seen that the decomposition scheme using NObSP produces results with small confidence intervals in most of the components. More importantly, the results provided seem to agree with the literature, where the strength of the concrete sample is expected to increase with increasing amounts of cement and decreasing amount of water, and it increases during the first days [25,26]. Conversely, NObSP indicates that by increasing the amount of furnace slag, the strength of the cement increases, while Yeh et al., have reported the opposite [26]. However, the datasets used in both analyses are different, which might lead to different decompositions. In addition, it is important to notice that their analysis is done considering that several variables are kept constant, while in our case this restriction was not applied.
One of the main drawbacks of NObSP, in contrast with COSSO, is that it does not provide a closed form for the estimation of the nonlinear contributions and the interaction effects. However we have shown that it is possible to evaluate the decomposition scheme when a sufficient number of test samples are available. This is due to the fact that the decomposition is based on the projection of the observed data onto the respective subspace. Therefore, it is important that an adequate number of basis vectors is used in order to construct proper projection matrices. We have shown that the number of data points that are required in order to construct this projector is equal to the maximum rank of the kernel matrices that are created during training. In case that a closed form is required, one solution can be to introduce a multiple kernel learning problem similarly to the one proposed in [27]. This is out of the scope of this paper and will be addressed in future studies.
Additionally, there are some numerical issues that should be considered. Since kernel evaluations are needed to create a basis for the subspace spanned by the nonlinear transformation of a given variable, when using too many input regressors, it is possible that the computation of the Euclidean distances produce values close to machine precision. This is caused by the fact that in such cases several variables are set to 0, leaving only the variable of interest intact. In such cases, the results provided by NObSP are not reliable. Another problem that may arise is when the partial contribution of one variable on the output is much samaller than the contributions from other input variables or interaction effects. This will cause that this nonlinear contribution will not be found during the decomposition, since it will be absorved by the other components. In order to mitigate the impact of these issues we recommend to perform an initial input selection. In this way NObSP will be applied in a reduced set of input variables, this will not only select variables that might have a larger impact on the output, but will also reduce the risk of having numerical issues. Additionally, we also recommnd to normalize the input regressors to guarantee that the estimation of the distances is not biased by the magnitude of only one input variable.
We also performed simulations including a third order interaction component. The results from these simulation show that eventhoug NObSP was able to still retrieve the main effects and second order interaction effects, it failed to retrieve the third order interaction component. It produced results that were too noisy. This might be caused by the accumulation of errors when decomposing the target signal. It was not possible to compare the results from this simulation with other methods, since COSSO can only retrieve up to second oder interaction effects.
In summary, the advantages and disadvantages of NObSP and COSSO can be summarized as follows: • NObSP does not require the define a priori the kind of relations of interest between the input regressors and the output. COSSO does require to define if the user is interested in the main effects or also in the interaction effects.
• NObSP converge to the solution whether the user is interested in finding the main and interaction effects. COSSO produces different results.
• NObSP is able to retrieve the functional forms even in the presence of correlated input regressors.
• COSSO produce models that are sparse. NObSP does not include sparsity in its definition.
• COSSO is able to retrieve close forms for the evaluation of new samples. NObSP, requires to construct projection matrices which require the evaluation of a minimum amount of points to produce a proper output.
• In contrast with COSSO, NObSP suffers of numerical issues due to the evaluation of kernel functions with vectors that contain a considerable amount of zeros.
NObSP can also be extended to the analysis of dynamical systems. For instance, NObSP can be introduced naturally for the nonlinear identification of dynamical systems using subspace identification techniques such as the one proposed in [28]. In the actual formulation, the relation between the input variables and the output in the regression model is considered static. However, by using as input variables a Hankel expansion, or any other expansion, of the input regressors, it is possible to consider a dynamic nonlinear model, introducing this framework in the field of system identification.
Finally, it is interesting to notice that NObSP relies only on the tuning of a kernel in order to decompose the output in the partial nonlinear contributions of the input and the interaction effects. This provides insights about the geometry of the regression models using kernels, by linking the space where the original data is embedded and the subspaces generated by the nonlinear transformation of the data. This also indicates that this methodology is not limited to the use in LS-SVM model, but it can be adapted to any kind of kernel-based regression models, where the output can be represented as a linear combination of kernel evaluations, like in equation Eq (4). Such models include kernel principal component regression [29,30], kernel partial least squares [31], and kernel ridge regression [32].

Conclusions
In this manuscript we proposed a decomposition scheme using oblique subspace projections, called NObSP, which is able to retrieve relevant information about the input variables and their relationship with the output in an LS-SVM regression model, facilitating its interpretation. The performance of the proposed model was demonstrated using 2 different toy examples as well as a practical example from a public database. This methodology has a huge potential in many fields, including biomedical applications. For instance, it can be used for the study of the interactions between different physiological signals, providing extra information for the understanding of some underlying regulatory mechanisms, and supporting clinical diagnosis and treatment.