Rectangularization of Gaussian process regression for optimization of hyperparameters

Gaussian process regression (GPR) is a powerful machine learning method which has recently enjoyed wider use, in particular in physical sciences. In its original formulation, GPR uses a square matrix of covariances among training data and can be viewed as linear regression problem with equal numbers of training data and basis functions. When data are sparse, avoidance of overfitting and optimization of hyperparameters of GPR are difficult, in particular in high-dimensional spaces where the data sparsity issue cannot practically be resolved by adding more data. Optimal choice of hyperparameters, however, determines success or failure of the application of the GPR method. We show that parameter optimization is facilitated by rectangularization of the defining equation of GPR. On the example of a 15-dimensional molecular potential energy surface we demonstrate that this approach allows effective hyperparameter tuning even with very sparse data.


Introduction
The use of the Gaussian process regression (GPR) (Bishop, 2006;Rasmussen and Williams, 2006) approach is a powerful machine learning tool.GPR is easy to use, in particular, in high-dimensional spaces: being a non-parametric method (only a small number hyperparameters need to be selected), the increase in space dimensionality does not lead to a drastic increase in the number of (non-linear) parameters.This is in contrast to, for example, neural networks (NN) (Emmert-Streib et al., 2020;Montavon et al., 2012), where the number of nonlinear parameters rapidly proliferates when the numbers of neurons, layers, and space dimensionality increase.It has been argued that GPR's expressive power is equivalent to that of an infinite-width neural network (NN) (Neal, 1995;Williams, 1996).This increases the danger of overfitting as well as the calculation cost.Recent appearance of comparisons between NN and GPR on the same data and in applications requiring very high accuracy such as fitting of spectroscopically accurate potential energy surfaces (PES) (Kulik and et al., n.d.;Majumder et al., 2015;Manzhos et al., 2015bManzhos et al., , 2006;;Manzhos and Carrington, 2021) and kinetic energy functionals (Manzhos and Golub, 2020) highlighted GPR advantages, in particular, in obtaining highly accurate approximations of multidimensional functions from few data (Kamath et al., 2018).
One problem area, as will also be highlighted below, is overfitting and the optimal choice of GPR hyperparameters (of the GPR kernel) that allows to avoid it, which is especially difficult when data are scarce.While in high-dimensional spaces GPR is especially attractive, as growth of the space dimensionality need not lead to increased computational cost of the key matrix equations 1 and 2 below, high-dimensional applications need to deal with the issue of hyperparameter optimization under low density of training data.We note that data are always sparse in sufficiently high-dimensional spaces (Donoho, 2000).The problem of GPR hyperparameter optimization has been addressed with various methods; examples are random search related methods (Bergstra and Bengio, 2012), various versions of Bayesian inference (Brochu et al., 2010;Snoek et al., 2012) (with the commonly used maximum likelihood estimator, also used in this work, belonging to this type of methods (Myung, 2003)), simulated annealing (Fischetti and Stringher, 2019), genetic algorithms,(Alibrahim and Ludwig, 2021) so-called bandit-based methods (Li et al., 2018), and combinations thereof (Falkner et al., 2018).
However, the scarcity (low density) of available training (and test) data may not permit optimizing the kernel parameters for the best global (i.e. in all relevant parts of the descriptors space) quality of function representation (i.e. to avoid overfitting).The global quality of representation can be considered from the point of view of the completeness of the basis set when one views GPR as linear regression with a basis formed by the kernel functions (see below).
Overfitting is, however, a problem with GPR as well.An appropriate choice of hyperparameters is necessary to avoid it.We have recently shown (Manzhos and Ihara, 2022) that the commonly used maximum likelihood estimation (MLE) (Myung, 2003) that maximizes the log likelihood function, can fail to find appropriate hyperparameters when data are sparse.While in principle it is tempting to think that this issue could be resolved by adding more data, in practice, this is a dead end proposition: in high-dimensional spaces, not only does adding more data not significantly change data density (in terms of the number data of data points per dimension of a corresponding direct product grid), the cost of wielding Eqs.1-2 grows rapidly with M. We have shown in (Manzhos and Ihara, 2022) that even when using an additive model of (), () ≈ ∑   (  )  =1 (Duvenaud et al., 2011;Manzhos et al., 2022), where the component functions   (  ) are well-determined (no overfitting) with few data (in the example of (Manzhos and Ihara, 2022),   (  ) were well-determined with only 500 data in a 15dimensional spaceprecisely because they are low-dimensional as opposed to a 15-dimensonal ()), it took about 10,000 training data for the MLE method to be able to find good hyperparameters under fixed .Even with 10,000 training data, MLE did not find acceptable hyperparameters for a general non-additive GPR model or an additive model with simultaneous optimization of l and .This is in spite of the fact that hyperparameters do exist (and can be found manually) that result in an accurate GPR model without noticeable overfitting from 5,000 data (less than 1.8 data per dimension) (Manzhos and Ihara, 2022).
In this article, we explore an alternative way to determine optimal hyperparameters based on the view of GPR as a regularized linear model with a basis set made of the covariance functions.We view the problem of hyperparameter optimization as a basis completeness problem.In (Manzhos and Ihara, 2022), we proposed using RS-HDMR (random sampling high dimensional model representation) (Li et al., 2006;Rabitz and Aliş, 1999) to build, without the danger of overfitting, reference functions computable in all space which could be used to tune the hyperparameters (of the basis) by computing more data.That method has disadvantages that the basis completeness is optimized for a different target function than the function of interest (albeit close to it) and that more data need to be generated, increasing the cost of wielding the matrix equations of GPR.Here, we explore an alternative route of optimizing the basis completeness by modifying the governing equation of GPR, namely, by rectangularizing it.

Methods
We begin by recognizing that the equivalency between GPR and regularized linear regression (Bishop, 2006).Eq. 1 has the form of a basis expansion, (8) with basis functions   () = (,  () ) and linear coefficients   .The quality of this approximation is fully determined by the quality of the basis set {(,  () |)}, i.e. the extent of its completeness.
Ultimately it is the quality of the basis (in relevant parts of space) that hyperparameter optimization should improve.In the case of a squared exponential kernel used here, it is the quality of a Gaussiantype basis set with basis functions located at each of the M training points.
Eq. 8 can be written in matrix form where the lengths of the vector ′ corresponds to the number N of considered points ′.This could be all or a subset of points where the knowledge of () is required in an application (e.g. in the case of learning a molecular potential energy surfacewhich is the example of GPR application considered in this workthis could be points on a quadrature grid in a quantum dynamics calculation (Beck et al., 2000;Bowman et al., 2008)) or it could be a set of test points.B is then a rectangular matrix of size  ×  with elements   = ( ′() ,  () ).If { ′ } = {} (the same M known points) then B = K is a square matrix,  = ′ and c is obtained as  =  −  i.e. one obtains standard GPR (Bishop, 2006).
The specific nature of GPR compared to a generic linear regression of type of Eq. 8 lies in the use of a covariance function for the basis, which gives Eqs.1-2 the meaning of the estimates, respectively, of the mean and variance of a Gaussian distribution of values of ().
The defining equation of GPR, Eq. 1, is thus based on a square linear problem with as many training points as basis functions.With N = M, Eq. 9 is solvable exactly barring numeric instability of the inversion, and in this case there is no information inherent in Eq. 9 to optimize the basis.One then has to resort to heuristics such as MLE for hyperparameter optimization.When N > M, however, i.e. with a rectangular version of Eq. 9, it in general cannot be solved exactly, and the residual of a least-squared solution can be used to guide hyperparameter optimization to improve basis completeness (Chan et al., 2012;Manzhos et al., 2011bManzhos et al., , 2011a)), where  =  + ′ and  + is a Moore-Penrose pseudoinverse (Penrose, 1955) of B. In this case one selects (randomly in our case) M from the available N datapoints as basis centers.If one wishes to use  in this scheme, it can be introduced into the diagonal of matrix  of the singular value decomposition of B: where Σ + is formed from Σ by taking the reciprocal of all the non-zero elements and then transposing (Golub and Van Loan, 1996).However, as we are solving a rectangular problem in the least-squares sense, the parameter  is not needed.We found no advantages of using the parameter  with a rectangular Eq. 9; the rectangular approach achieves as good a global error without  as the traditional GPR with optimal  in (Manzhos and Ihara, 2022).As only M points serve as basis centers, we can still use Eq. 2 to approximate the variance of the predicted values with the K * and K of Eq. 2 computed based on those M points.If  is not used, one can use  + instead of  −1 in Eq. 2.
In (Manzhos and Ihara, 2022), we studied the optimization of hyperparameters of GPR from the perspective of basis optimization.In that work, to optimize the parameters of the basis in Eq. 8, we generated with the help of RS-HDMR a reference function   () which was sufficiently close to the original function () and such that (i) it avoided overfitting (was well defined with few data) and (ii) a basis which was optimal for   () was also "good enough" for ().The role of   () which was computable everywhere in space was to enable the calculation of a large test set of points which was then used to optimize basis hyperparameters.In (Manzhos and Ihara, 2022), that approach was also compared to MLE.The results of (Manzhos and Ihara, 2022) are therefore a natural comparison point for the present approach of rectangularization of GPR.We use the same dataset of ab initio samples of the 15-dimensional potential energy surface of UF6 molecule for ().The data set was made publicly available in the Supporting Information of (Boussaidi et al., 2020), where the plots obtain a similar global error as was obtained with square GPR in (Manzhos and Ihara, 2022) where the hyperparameters were tuned to a large (much larger than N) test set.Based only on the knowledge of the training point set of 10,000 points and with the help of RS-HDMR, somewhat suboptimal basis hyperparameters could be derived and a global error of 35.1 cm -1 could be obtained with square GPR (Manzhos and Ihara, 2022).With rectangular GPR, the hyperparameters optimal for the global error were obtained based on information contained in N points only.
Table 1.Root mean square errors obtained with the rectangular Eq. 9 (rmseres) and on the test set of points, in cm -1 .N is the number of rows in matrix B and M is the number of columns (number of basis functions).Results with the optimal length based on rmseres for each N are highlighted in bold.Important for this is the observation that choosing l by minimizing the residual of the rectangular matrix equation also minimizes the global error proxied here by the error on a large test set (much larger than the training set).The ratio of N to M that provides a good predictive power of optimal l is roughly 2. When M is more than about 1/2 of N, there is a danger of overfitting and the trend in rmseres as a function of l may no longer reflect the trend in test rmse.To put this in perspective, a ratio of about 2-3 of the number of points to the number of basis functions was found to be optimal for the rectangular collocation approach (to build a basis representation of vibrational wavefunctions) with Gaussian and other bell-like basis functions (Kamath and Manzhos, 2018;Manzhos et al., 2018).One can therefore use much fewer basis functions than known points, which decreases the cost of fitting and calling the model.Note that optimal l values with the "square" GPR (which were about 3.0-3.5)were larger than those we obtain with the rectangular GPR, and optimal  was on the order of 1 × 10 −4…−5 .These values of  were close to the threshold of numeric stability (calculations with  values below 1 × 10 −6 were unstable) (Manzhos and Ihara, 2022).With rectangular GPR, we dispense with  altogether which simplifies hyperparameter optimization (our tests with  added to the diagonal of  showed an effect on rmse but no additional improvements in the test set rmse).
In Figure 2, we show the three-sigma confidence intervals for selected cases (M, N, l combinations).
To make the figure readable, the plots are done for random subsets of 1,000 test points of which 500 show confidence intervals, to visually appreciate the relation of computed confidence intervals to the actual spread of the data.We observe that the confidence intervals depend strongly on l and may or may not correspond to the actual error of prediction.This behavior is the same as in the traditional square GPR (Boussaidi et al., 2020).We caution that confidence intervals computed with Eq. 2 should not automatically be used as error bars.See notes to this effect in (Boussaidi et al., 2020;Ren et al., 2021).
Figure 2. Difference between the GPR-computed and exact values of the potential energy, in cm -1 , for selected M, N, l combinations.For readability, a random subset of 1,000 test points is shown of which 500 show three-sigma confidence intervals, to visually appreciate the relation of confidence intervals computed with Eq. 2 to the actual spread of the data.

Conclusions
We explored rectangularization of the defining equation of Gaussian process regression to address the important problem of optimization of hyperparameters which determines success or failure of the application of the GPR method.Such optimization is difficult when data are scarce (data density is low), with failures of the MLE method likely and documented.Recognizing that GPR is equivalent to a linear regression with a basis formed by covariance functions, we proposed a hyperparameter optimization approach based on the quality of the least-squares solution of a rectangular matrix equation of size  × , where M is the basis size and N > M is the number of points used to tune hyperparameters.We demonstrated the advantages of the approach on the example of fitting a 15dimensional potential energy surface of UF6 molecule and showed that it allows effective hyperparameter tuning even with very scarce data.With less than 10,000 data points the method was able to determine hyperparameters which were optimal for the global quality of the function (proxied by a large test set).The rectangular GPR makes unnecessary the noise parameter typically added to the diagonal of the covariance matrix to improve stability and generalization which simplifies hyperparameter optimization.One can use much fewer basis functions than known points, which decreases the cost of fitting and calling the model.Using about a half of known points as basis centers can be recommended, and good results are obtainable when only a third of the data points are used for that purpose.

Figure 1 .
Figure 1.Example correlation plots between the exact and GPR-predicted values of the potential energy, in cm -1 , for different M, N, l combinations.Blue points are for the training set and red points for the test set.The values of Pearson correlation coefficient for the N points used in the training and the Nall-N test points are shown on the plots as well as corresponding rmse values.