Evolving Gaussian Process kernels from elementary mathematical expressions

Choosing the most adequate kernel is crucial in many Machine Learning applications. Gaussian Process is a state-of-the-art technique for regression and classification that heavily relies on a kernel function. However, in the Gaussian Process literature, kernels have usually been either ad hoc designed, selected from a predefined set, or searched for in a space of compositions of kernels which have been defined a priori. In this paper, we propose a Genetic-Programming algorithm that represents a kernel function as a tree of elementary mathematical expressions. By means of this representation, a wider set of kernels can be modeled, where potentially better solutions can be found, although new challenges also arise. The proposed algorithm is able to overcome these difficulties and find kernels that accurately model the characteristics of the data. This method has been tested in several real-world time-series extrapolation problems, improving the state-of-the-art results while reducing the complexity of the kernels.


Introduction
Gaussian Processes (GPs) [1] are one of the most used techniques in Machine Learning for regression and classification tasks. Furthermore, they have also been applied to optimization tasks under the umbrella of Bayesian optimization [2]. A GP is a collection of random variables, any finite set of which has a joint Gaussian distribution. It is completely defined by a mean function and a covariance function described in terms of a Positive Semi-Definite (PSD) kernel.
The assumption in a GP is that, as the similarity between two solutions increases, so does the similarity of the output function at these solutions. The kernel function encodes the particular manner in which the similarity between any two solutions is defined, which makes it a key element in any application of GPs.
While there is a repertoire of kernel functions available in the literature [1,3,4], the choice of the most appropriate kernel for a given problem is not straightforward. Moreover, kernels usually have some parameters that need to be arXiv:1910.05173v2 [cs.LG] 14 Oct 2019 adjusted, which hardens the kernel selection problem. These parameters, often called hyperparameters, are usually tuned by maximizing a given metric (e.g., the marginal likelihood) [5].
In early applications of GPs, the kernel function used to be designed by an expert [1], or selected from a predefined set [6]. However, some recent works tackle the question of automating the choice of the kernel [7,8,9]. Compositional kernel search is one of the most used techniques when automating the kernel choice. In this technique, the kernel is always the combination of a limited number of a priori defined kernels, and the kernel selection is reframed as a search in the space of possible kernel compositions. The compositional kernel search approaches take advantage of some operands (e.g., sum, product, ...) that guarantee the composed kernel is PSD as long as its components are also PSD. For example, in [7] and [9], the kernel search is carried out by means of a greedy search procedure. A similar approach is presented in [8], although in this work the search is guided by Genetic-Programming (GenProg 1 ) [10]. The solutions proposed by these methods have shown their ability to capture function properties such as smoothness, trends and periodicity [7,9]. In addition, as the behavior of the base kernels and the operands is well-known, the behavior of their composition may be guessed by an expert [7]. On the contrary, these kernel search methods usually end up with very complicated kernels, including many hyperparameters, which make them expensive to optimize. Moreover, it must be noted that these approaches rely on kernels that have already been proposed in the literature. There is no reason to believe that kernels obtained by composing a limited set of human-designed kernels are optimal for arbitrary problems. Furthermore, using previously designed kernels as building blocks could bias the search and prevent the exploration of more promising candidates.
In this paper, instead of considering a reduced set of base kernels as building blocks, we propose using a set of elementary mathematical expressions (e.g., product, sum, exponent, etc.) to serve as components of a wider set of kernels. Our hypothesis is that, by enlarging the space of possible solutions, the representation capability of GPs is expanded, allowing more accurate kernels with a lower number of hyperparameters to be found.
Searching for the most appropriate solution in the space of mathematical expressions is very challenging due to the vast number of kernels that can be generated and the lack of guarantee that these kernels satisfy the PSD property. Note that, before evaluating a kernel, its hyperparameters should be optimized, which limits the number of kernels that can be explored due to the computational effort required to find these hyperparameters. We propose a novel GenProg method, EvoCov, which is able to overcome these challenges and learn adequate kernel functions for each problem. This method does not rely on previously proposed kernels, and thus, new kernels may naturally arise.
Although we focus on regression problems in this work, our contribution can be easily extended to other GP applications, such as classification. Moreover, some of the components designed in EvoCov could be extended to other GenProg applications.
The remainder of the paper is structured as follows: In the next section, a background on GP regression is provided, including the presentation of the best known GP kernel functions. In Section 3, we present our kernel representation approach, based on elementary mathematical expressions. In Section 4, a novel GenProg method, EvoCov, is proposed to search for GP kernel functions based on such grammar. In Section 5, a review on related work is provided. Next, in Section 6, we present an empirical validation of the algorithm and comparisons with other methods. Finally, in Section 7, the conclusions and the future work are presented.
The predictive Gaussian distribution can be found by obtaining the conditional distribution given the training data and the test inputs: As in many previous works [11,6,12], we consider an a priori equal-to-zero mean function (m(x) = 0) to focus on the kernel search problem.

Covariance function
GP models use a PSD kernel to define the covariance between any two function values [3]: A PSD kernel is a symmetric function k : for any n ∈ N, x 1 , ..., x n ∈ R d and a 1 , ..., a n ∈ R.
If a kernel function holds Equation (5), then the covariance matrix C, where c ij = k(x i , x j ), n ∈ N and x 1 , ..., x n ∈ R d is: • symmetric, i.e., C = C T .
• a PSD matrix. A matrix is PSD if and only if v T Cv ≥ 0 for all real vectors v ∈ R n . It is equivalent to say that all its eigenvalues are non-negative.

Well-known kernel functions
In this section, we introduce some of the best-known kernel functions. These kernels can be divided into two main families: stationary and non-stationary kernels [4].
A stationary kernel is translation invariant. Among the stationary kernels, we focus on isotropic kernels, as they are the most used kernel functions in the literature. In such kernels, the covariance function depends on the norm: where θ l is the lengthscale hyperparameter andk a function that guarantees that the kernel is PSD. The lengthscale hyperparameter can be also a vector that expresses the relevance of each dimension d, as suggested in Automatic Relevance Determination (ARD) approaches [13,14].
On the other hand, non-stationary kernels are the ones that may vary with translation. Within this family, the most common kernels are those that depend on the dot product of the input vectors. These kernels are usually referred to as dot-product kernels: where θ l is again the lengthscale hyperparameter, θ s is the shift hyperparameter and 1 is a vector of ones. Table 1 shows eleven well-known kernels generally used in GP applications [1,3]. The Squared Exponential (SE) kernel, also known as radial basis function (RBF), is one of the most popular choices, and it is described as k SE in the table. This kernel is known to capture the smoothness property of the objective functions.
Kernel function expressions Constant Lineark LIN (s) = s Table 1: Well-known kernel functions. θ 0 and θ p are the kernel hyperparameters called amplitude and period respectively. δ is the Kronecker delta.

Model selection
The choice of the kernel function and its hyperparameters has a critical influence on the behavior of the model, and it is crucial to achieve good results in any application of GPs. This selection has been usually made by choosing one kernel a priori, and then adjusting the hyperparameters of the kernel function so to optimize a given metric for the data.
Although a variety of methods have also been proposed to optimize the hyperparameters [15,16,17,18], the most common approach is to find the hyperparameter set that maximizes the log marginal likelihood (LML): where θ is the set of hyperparameters of the kernel and n is the length of X.
Alternatively, a leave-one-out cross validation (LOOCV) metric was proposed [1], where the likelihood of the posterior distributions are added: where µ i and σ i are the posterior mean and variance for x i given X −i and f −i 2 .
The selection of the right set of hyperparameters is known to be a hard problem, particularly when few observations are available [12,19,20]. Although in most cases the gradient of the LML and the LOOCV has a closed-form expression, depending on the problem, these functions can be multi-modal and a greedy search procedure may lead to suboptimal results.

Kernel composition
When creating new kernels, it is usually difficult to prove whether they are PSD or not. However, there are certain operations which guarantee that if the source kernels are PSD, the result is also a PSD kernel [3,21]. Below we list some of the operations that are guaranteed to keep the positive semi-definiteness of its components: where p is a polynomial function with non-negative coefficients.
• Composition with a function:

Gaussian Process kernel representation as elementary mathematical expression trees
While previous approaches have proposed the composition of kernel functions, in this work we break down the well-known kernels of Table 1 into basic mathematical expressions, in order to use them as the building blocks for new kernels. Thus, these well-known kernel functions and their sum/product compositions are a subset of our search space.
The kernel functions are described as an expression tree, composed by the set of the operators and terminals shown in Table 2. The expression tree is strongly-typed [22], as the output of each node matches the input type of its ancestor.
Our grammar considers a pair of input vectors (their symbol is denoted as x) and a vector of hyperparameters (denoted as hp1, hp2, ...) as arguments of the kernel function. As in [23], the grammar also contains the spectral transformation, in order to allow periodic kernels. Furthermore, the square distance and dot product are included as described in Section 2.1. Note that the expressions denoted by the symbols euc and hp do not modify the input. These expressions are included to constrain the random kernel generation as explained in detail in Section 4.1.1. In addition, our grammar includes the +, ×, andˆarithmetic operators, having their usual meanings (addition, product and power, respectively). Note that we only allow hyperparameters as the exponent in the power operator. The same interpretation is given to the unary operators, such as the square root and the natural exponent. The power to the minus one is also added as an unary operator (denoted as div). Finally, our grammar considers some constants that are commonly found in kernel functions as terminals. Although the White Noise kernel could be included as a terminal in the grammar, it is added to all the kernels generated during the search in order to model the noise (See Section 6.2).

Evolving kernel functions based on the new grammar
Once the grammar has been introduced, we present our GenProg approach for GP kernel search, EvoCov. This algorithm takes into account two challenges related to this problem: The cost of evaluating the fitness function (mainly due to hyperparameter optimization) and the fact that many of the kernels generated during the search are not PSD.
Our GenProg approach is shown in Algorithm 1. First, an initial population of N kernels is generated. In order to do so, each individual is created at random, limited by a maximum (d max ) and a minimum (d min ) depth. Each generation, the whole population is evaluated. Next, the relative improvement is measured. If the relative improvement in the current population is greater than a threshold β, a new population is generated through selection and variation. After selecting the µ best individuals, the algorithm randomly chooses the variation method between a mutation or a crossover operator (with probability p m and p cx respectively, where p cx = 1 − p m ) to generate an offspring population of N − µ new individuals. The next population is made up of the selected individuals and the offspring population 3 . When the relative improvement is lower or equal to the threshold, the best individual is saved and the current population is replaced by a randomly generated one. This procedure is repeated for G generations, until the last population is evaluated and the best individual found during the whole process is returned.
In this section we describe each of the methods used by Algorithm 1. First, we address the issue of randomly generating new kernels for the initial population. The distinguished characteristic of our proposal for generating the random kernels is that it does not take into account any kernel proposed in the literature, while guaranteeing a minimum depth and a maximum depth. Then, we provide the variation operators conceived to generate kernels that are likely to inherit useful properties from the selected ones. A method to control the depth of the trees created by the variation operators is also Symbol Input Expression Output Args.
x Table 2: Set of expressions used by the grammar. Column "Symbol" refers to the symbol given to each expression in the grammar. In column "Expression" the assignments (=) of the mathematical expressions to the outputs are shown. Finally, Columns "Input" and "Output" indicate the input and output types of each expression respectively. In these columns, (x, x) represents the input vector type, while (x,x ) shows the transformed input vector type. The hyperparameter type is denoted as θ and the z, v, w variables belong to the cov type.
introduced. Next, we explain how the GenProg kernels are evaluated. Finally, for comparison proposes, we include two simpler methods: Random Search and Go With The First.

Initial population
We generate kernel functions at random and discard the non-PSD ones until the desired population size is reached.

Random generation
We propose a strongly-typed grow method to randomly generate kernel expression trees, based on the work done in [10]. This is achieved by a recursive process where, at each step, a random terminal or operator is added.
When randomly generating solutions, some of the solutions may be small and trivial, and some other may be too complex. Thus, we propose a method to control the depth of the generated trees by setting a minimum (d min ) and a maximum depth (d max ). As can be seen in Table 2, some of the expressions have at least one input that matches the output type. These expressions guarantee that, once selected, the iterative procedure can continue growing this branch, i.e., these expressions are nestable.
As can be seen in Algorithm 2, during the recursive process, we select a random expression depending on the current type. If the minimum depth has not been reached, only the operators that can be nested are used. Then, until the maximum depth is reached, any operator can be selected. Finally, when the maximum depth is reached, only the terminals, arguments and the operators that can not be nested are used, limiting the depth of the tree.

Checking positive semi-definiteness
Non-PSD kernels should be excluded from our search space as they are not valid for GP. Although it is computationally unfeasible to guarantee that every single kernel is PSD, we follow an efficient method to identify most of the non-PSD kernels, similar to the work carried out in the SVM literature [24,25,26].
As mentioned in Section 2.1, any covariance matrix generated by a PSD kernel has to be symmetric and also PSD. To identify non-PSD kernels, we generate w random uniformly distributed data sets X = ( .., n} and n ∈ N) and check the covariance matrix C produced by the kernel for each data set. If any covariance matrix matches the following cases, the kernel is rapidly discarded: • C = C T : As previously mentioned, the covariance matrix given by a PSD kernel should be symmetric.
• Any c ii is negative: It has been proved [27] that, if any of the elements in the main diagonal are negative, the covariance matrix is not PSD.
• Any of the eigenvalues of C is negative: Similarly, all the eigenvalues of the covariance matrix should be non-negative.
Not matching these cases is necessary but not sufficient for a kernel to be PSD. If after meeting this condition, during the evaluation step, we find out that the kernel is not PSD, the fitness value of the kernel is penalized. However, this validity check is severe enough to avoid most of the false positives. Among the kernels that were generated and validated during the experiments conducted in this paper, only 0.67% were not evaluable.

Variation operators for kernel generation
Our kernel search method is based on perturbation or variation methods that modify previous solutions to obtain new ones. We use two variation operators, which are randomly selected every VARIATE function call in Algorithm 1: A crossover operator, which combines two kernel functions to generate a new one that keeps some of the features of its parents, and a mutation operator, which introduces slight modifications to the original kernel to obtain a new individual.
We also explain how we control the depth of the trees generated by these variation methods.

Crossover
A purely random crossover operator hardly ever produces PSD kernels. Since kernel function evaluation is a computationally costly process, we would like to avoid non-PSD kernels. As explained in Section 2.4, the product or the sum of two PSD kernels is also PSD. Hence, a crossover method could just combine two PSD kernels with any of these operators to generate a new PSD kernel. However, this procedure rapidly increases the depth of the expression trees.
Therefore, we propose a crossover operator that randomly selects a subtree from each kernel and combines them with the sum or the product operator. As this method does not guarantee that the resulting kernel is PSD, this operation must be repeated if a non-PSD kernel is found. Nevertheless, the method increases the chance of obtaining a PSD kernel, since if both of the subtrees are PSD, the result is guaranteed to be also PSD.

Mutation
The mutation operator works by randomly selecting one of the following methods in a type-safe manner: Insert: Inserts an elementary mathematical expression (see Table 2) at a random position in the tree, as long as their output types agree. The subtree at the chosen position is used as the input of the created expression. If more inputs are required by the new primitive, new terminals are chosen at random.
Shrink: This operator shrinks the expression tree by randomly choosing a branch and replacing it with one of the branch inputs (also randomly chosen) of the same type.
Uniform: Randomly selects a point in the expression tree and replaces a subtree at that point by the subtree generated using our random generation method (Described in detail in Section 4.1.1). Note that the output type of the random generated subtree must match the output type of the replaced one.
Node Replacement: Replaces a randomly chosen operator from the kernel expression by an operator with the same number of inputs and types, also randomly chosen.
As these methods do not guarantee that the generated kernels are PSD, mutations are repeated if a non-PSD kernel is detected (see Section 4.1.2).

Bloat control
None of the variation methods limit the depth of the kernel expression. Depending on the operators, the depth of the expression trees may increase without any limit during the search, making the resulting kernel functions complex and useless for practical applications. This is a well known problem in GenProg literature, known as bloating [10]. In our work, when the depth of a kernel expression becomes larger than o max , we discard the expression. In this case, the mutation or the crossover method is repeated until a kernel with the desired depth is obtained, or a limit of ρ max trials is reached. If this number of trials is exceeded, one of the parent kernels is returned unchanged.

Evaluation
In our approach, in contrast to other GenProg applications, the solutions do not encode all the necessary information to be evaluated. In order to evaluate each kernel, we need to set the value of the hyperparameters. Thus, the fitness of the solutions depends on the results of the hyperparameter optimization. Both search procedures, the selection of the best hyperparameters for each kernel and the selection of the best kernel given these hyperparameters, are illustrated in Figure 1. As in [9], we use the Bayesian Information Criterion (BIC) [28] as a quality metric for each kernel. BIC is a metric for model selection which adds a regularization term to the LML to penalize the complexity of the kernels. This metric serves as the fitness function of our GenProg algorithm and it can be expressed as follows: where q is the number of hyperparameters of the kernel and n is the number of data points in X. θ i,best is the best hyperparameter set for the kernel k i according to a given metric.
Before computing the BIC associated to a given kernel, the hyperparameters have to be optimized. As we have seen in Section 2.3, several metrics (LML, LOOCV, ...) can be used to measure the quality of each hyperparameter set. Thus, we find the best hyperparameter set for kernel i as follows:

Hyperparameter Optimization Algorithm
The hyperparameters are optimized by means of Powell's local search algorithm [29]. As this algorithm is not bounded, the search space has to be constrained by penalizing non-feasible hyperparameter sets. On the other hand, as the function to optimize might be multi-modal, a multi-start approach was used, performing a restart every time the stopping criteria of the Powell's algorithm are met, and getting the best overall result. Note that, as a result of the inclusion of the randomized restarts, the hyperparameters found for a certain kernel in two independent evaluations may not be the same. In fact, this implies that the fitness function optimized by the GenProg algorithm, i.e., BIC, is stochastic.

Random Restarts and Hyperparameter inheritance
The initial solutions for the restarts of the hyperparameter optimization algorithm are sampled from two different distributions depending on the origin of the kernel. In the randomly generated kernels, the initial hyperparameters for these restarts are sampled from a uniform distribution within the search bounds. On the other hand, if a kernel is generated through any of the variation methods, we take advantage of the information gathered in previous hyperparameter optimization procedures by adapting the inheritance technique described in [9] and [7] to the particularities of GenProg. Instead of restarting the multi-start optimization from a uniform distribution, each restart is sampled from a Gaussian distribution centered on the hyperparameter values of the parent individuals and with a pre-defined variance (σ θ ). This inheritance method is particularly useful when the variation performs few changes to the expression tree.
Note that, in Algorithm 1, the selected individuals are kept for the next population and the whole population is evaluated each generation. Thus, some individuals may be evaluated several times during the search. This procedure, along with the hyperparameter inheritance, allows the selected individuals to keep optimizing their hyperparameters across generations, and compete fairly with the individuals in the offspring population, which inherit the hyperparameters.

Selection
We perform a search in the kernel function space to find the kernel that maximizes the BIC. Thus, the selection operator shown in Algorithm 1 selects the µ best kernels according to the BIC metric by applying truncation selection.

Alternative search methods
In order to verify that every component of the proposed GenProg algorithm is providing a benefit to the kernel search, we introduce two algorithms to be used as a baseline in the experiments. First, we describe a Random Search algorithm to test the contribution of the components of EvoCov with the exception of the random generation method. Then, we propose a Go With The First algorithm, which does not depend on the crossover operator, in order to measure the gain produced by this operator.

Random Search
This Random Search method generates a random population by iteratively following the random generation method described in Section 4.1.1 until the desired population size is achieved (N ). Next, it chooses the best solution according to the selection criterion described in Section 4.4.

Go With The First
This algorithm generates an initial population of size G and, for each individual, a hill-climbing procedure is applied for N evaluations. This procedure is carried out by generating a random mutation, and keeping the best solution between the original and the mutated one. Once all the individuals in the population have been optimized following this procedure, the worst one according to the selection criteria is discarded. This whole process is repeated until only one kernel is left.

Related work
In this section, we review the work carried out in the literature related to this paper. First, we discuss works that design ad hoc kernels based on expert knowledge. Subsequently, we review the works which propose an automatic design of kernels.
In ad hoc kernel approaches [1,30], the authors assume that the choice of the kernel function is clear from a priori knowledge about the problem. Then, the hyperparameters are optimized to adjust each kernel. In [1], an ad hoc kernel is introduced to fit the Mauna Loa Atmospheric CO 2 time-series, which is a well-known problem in the GP literature due to its several periodic patterns. On the other hand, in [31], the authors propose a product of a Squared Exponential and a periodic kernel to construct a control signal. Similarly, the authors of [30] designed an ad hoc kernel to predict the number of occurrences of certain hashtags in Twitter, given the past records. Finally, the authors of [32] took advantage of the Bochner Theorem [33] to design kernels that were able to model the periodical patterns of time-series. Regarding the hyperparameter optimization, Deep Learning methods have also been applied to pre-train and fine-tune the hyperparameters of the covariance functions [34].
Regarding the automatic design of kernels, many works have followed the kernel composition approach by using the properties shown in Section 2.4 to generate new kernels [8,9]. Authors of [8] propose a GenProg method for compositional kernel search, using the well-known kernels shown in Table 1 as building blocks. They also consider the sum, product and scale as primitives, along with a dimension mask. The hyperparameters where not included in the grammar, as only the hyperparameters present in the well-known kernels are considered. The experimentation of this work was limited to the Mauna Loa Atmospheric CO 2 time-series and some synthetic 2-dimensional data sets. In addition to GenProg, other search methods have been proposed to search for kernel composition structures in GPs, such as the greedy search procedure proposed in [9]. In that work, the best kernel function in terms of BIC is searched in the space of possible compositions (sums and products) of simpler kernels. In [7], the authors improve the previous approach by adding change-point and change-window kernels. Finally, in [35], Bayesian Optimization was used to search in the model space.
The idea of using elementary mathematical expressions as building blocks of kernel functions has been applied to other fields, such as Support Vector Machines (SVMs) [25] and Relevance Vector Machines (RVMs) [36]. Some of these approaches [36,37,38] do not guarantee that kernels are PSD. In SVM and RVM, although kernels theoretically have to be PSD in order to do the kernel-trick, in practice, many kernels can be used even if they are not of this kind. Some other approaches, such as [39] and [40], guarantee that the kernels are PSD by means of the kernel composition properties shown in Section 2.4, similar to the compositional kernel search methods in the GP literature. Finally, in [24], [25] and [26] the non-PSD kernels are penalized or discarded as in our approach. The authors of [25] and [26] propose a method to penalize (giving the worst possible fitness) the non-PSD kernels in evaluation time. On the other hand, in [24], if during the random generation a non-PSD kernel is found, it is discarded and the creation is retried. However, as stated by the authors, their approach was not able to improve the results of the standard kernels in SVM. The above mentioned approaches deal with hyperparameters by means of optimizing small grids or adding random constants to the GenProg grammar. To the best of our knowledge, more complex techniques such as the hyperparameter inheritance have not been applied in this context.

Experiments
In this section, we describe the experiments we carried out to analyze the performance of our proposal. We solve extrapolation problems from real-world time-series and compare our proposal to the main methods discussed in Section 5 (compositional kernel methods and ad hoc kernel approaches) in such tasks. The goal of our experiments is three-fold: • To compare EvoCov to state-of-the-art methods that rely on kernel composition.
• To compare our proposal to the ad hoc kernels proposed in the literature.
• To study the influence of the metric used to optimize the hyperparameters in time-series extrapolation problems.
First, an introduction to the time-series extrapolation problem is given, before describing the experimental setup. Then, three experiments are shown, one for each objective of the experimentation.

Time-series extrapolation problems
The objective in time-series extrapolation is to predict future time-stamp values given some previous data. While properties like the smoothness of the data have been extensively studied in GP literature for interpolation problems, other properties required in extrapolation have not been studied to the same extent, such as periodicities and trends.
Real-world time-series problems have been considered for the evaluation of our methods, as they are more realistic than the synthetic environments. The selected problems are characterized by a limited amount of usually noisy data with strong variations between training and test sets.
In Table 3, the real-world time-series used in the first two experiments are described 4 . Following the work done in [7], we have trained all the algorithms on the first 90% of the data, predicted the remaining 10%, and then computed the root mean squared error (RMSE) for that 10%.

Experimental Setup
Our algorithms were coded in Python, based on the EA software DEAP 5 [41]. For the GP regression, a noisy approach was used, adding a white noise kernel to all the generated kernels, including a noise hyperparameter. For the random generation method, d min = 5 and d max = 15 were used to limit the size of each expression tree. In order to discard

Metric comparison for hyperparameter optimization
In GP literature, hyperparameter optimization is considered a crucial task. Most of the works carried out in this field rely on the LML for hyperparameter optimization [8,9]. However, it has been reported that the LML may lead to suboptimal results under certain conditions, where LOOCV could be more robust [42]. Regarding the kernel optimization, having a consistent method to optimize the hyperparameters helps to obtain more reliable evaluations of the individuals. Hence, we decided to perform a preliminary experiment to test if other alternative metrics to LML and LOOCV can improve the results in time-series extrapolation problems.
Apart from the well-known LML and LOOCV, we tested other metrics specifically designed to optimize the hyperparameters in extrapolation problems. While the LML measures the probability of the training data given the prior GP, the goal in extrapolation is to increase the probability of the test data given the posterior GP. Thus, along with the prior LML, we also measure the posterior LML. By splitting the train set into some train-train and train-test sets, we get the posterior GP given the train-train set, and measure the probability of the train-test set given this model, i.e., the posterior LML. On the other hand, the LOOCV measures the ability of the GP to interpolate. To create a version of the LOOCV metric for extrapolation, we compute the sum of the posterior probabilities of each of the data points in the train-test set, and call it Sum of Posterior Likelihoods (SoPL). Finally, having the train-train and train-test sets, the RMSE in the train-test set is also used as a measure given the train-train set.
To compare the performance of these metrics, we considered the best kernels found in [9] for all the time-series described in Section 6.1. For each metric and time-series, we carried out an optimization process with the Powell's algorithm, in order to find which one leads to the best results in terms of the RMSE in the test set. Each optimization process starts from a random hyperparameter set and stops when 5000 samples have been taken. Due to the randomness of the process, 10 trials were carried out.
In Table 4, the average results are shown for the thirteen time-series. The SoPL outperforms the rest of the metrics in five of the problems, while LML gets the best overall results in four time-series and RMSE is the best choice in three. Those three metrics are superior to posterior LML and LOOCV, as the former is only able to obtain the best results in the Daily Minimum Temperatures in Melbourne time-series, and the latter is not able to beat other metrics in any of the problems. As expected, in these extrapolation problems LOOCV is the metric with the worst performance.
-  Statistical tests were used to determine if there is a metric that is more robust than the others in time-series extrapolation problems 6 . First, we averaged all the RMSE results for each metric and time-series, and applied the Friedman's test [44]. We found significant differences between all the metrics (p-value = 8.454e−5). Then, we applied a post-hoc test based on Friedman's test, and adjusted its results with the Shaffer's correction [45]. In Figure 2, the critical differences between the metrics are shown. As can be seen, there are no significant differences between SoPL, LML, RMSE and posterior LML. Similarly, the results between LOOCV and posterior LML do not differ significantly. Overall, it can be seen that there is no metric best suited for all the time-series, and the choice of the best metric depends on the problem.
As the differences between LML and SoPL are not significant, we used two variants of EvoCov for the following experiments, one using LML to optimize the hyperparameters (EvoCov LML) and the other using SoPL (EvoCov SoPL).

Comparing our proposal to compositional kernel search approaches
In kernel learning tasks, the compositional kernel search approaches hold the state-of-the-art results in the GP literature.
In this section we describe the exploratory experiment we carried out initially, and the benchmark where we compare our algorithms to the compositional kernel search approaches.

Initial Experiment
To begin with, we performed an initial experimentation to test whether better kernels can be found with our grammar, compared to the kernel composition approaches. Particularly, we would like to know: 1. Whether it is possible to improve the composed kernels by means of manipulating elementary mathematical expressions, as we propose in this paper. 2. Whether the tree-based representation of kernels, along with the variation operators, allow such an improvement.
Given the best kernels found in [9] for all the time-series, we generated 200 random mutations according to the method described in Section 4.2.2. Next, we performed a hyperparameter optimization process using the LML metric for each mutation, departing from the best hyperparameter values found in the original work. Finally, we measured the RMSE in the test set for all these mutations against the RMSE provided in [9].
As can be seen in Figure 3, there are many mutated kernels that obtain a better RMSE than the original one for the Mauna Loa Atmospheric CO 2 time-series. Moreover, some of those kernels have fewer hyperparameters than the best kernel achieved in [9] (9 instead of 10). In Table 5, similar results to the Mauna Loa Atmospheric CO 2 can be found for the rest of the time-series. Among the 200 randomly generated kernels in each of the time-series, there are some kernels that are better fitted according to the RMSE. Furthermore, in 12 out of 13 time-series there are mutated kernels that have better results in RMSE, with fewer hyperparameters.
As we have shown in this exploratory experiment, we conclude that it is possible to improve the results obtained by compositional kernel search approaches by means of manipulating elementary mathematical expressions. We can also confirm that the mutation operator presented in this work allows such improvements.

Benchmark
In view of results obtained in the previous section, we tested EvoCov, along with the proposed alternative search methods, in the benchmark presented in [7]. To the best of our knowledge, this work provides the most extensive comparison in the literature in time-series extrapolation with GPs. In this benchmark the following algorithms can be found:  Table 5: Results of the initial experiment. In the first column, the ratio of mutated kernels that are better fitted, in terms of RMSE, than the best kernel achieved in [9] is shown. The ratio of kernels that are simpler, according to the number of hyperparameters, is illustrated in the second column. In the last column, the ratio of kernels that are both better fitted and simpler can be found.
Eureqa: A Symbolic Regression engine that uses genetic algorithms to search in the space of the possible equations [46]. Although this approach may seem similar to our work, Eureqa learns the predictive function itself, while our approach provides a probabilistic prediction by means of a GP. Linear Regression (LIN): The basic linear regression is approximated by a GP with a Linear kernel. The hyperparameters are learned by the LML optimization. Squared Exponential (SE): A GP with the Squared Exponential kernel shown in Table 1 is used. The hyperparameters are also learned by optimizing the LML. Bayesian variant of multiple kernel learning (MKL): A weighted sum of base kernels is used to construct more complex ones [47]. Change Point (CP) Modeling: A GP based approach allowing changepoints in kernels, that is, a combination of kernels where the weight of each of the components depends on the inputs [48,49,50]. Spectral Mixture Kernels (SK): These kernels model the spectral density with a Gaussian mixture [32]. Trend-Cyclical-Irregular (TCI) Models: The statistical model described in [51] is approximated by means of a GP and combining the periodic kernels with linear ones as covariance function. GPSS: The greedy GP kernel search method described in [9] is used, as discussed in Section 5. ABCD accuracy: An improvement of GPSS, introduced in [7], which includes the ChangeWindow and ChangePoint kernels. ABCD interpretability: A modification of the previous approach that focuses on interpretability. This approach favors additive components as they are more interpretable by the practitioners. Similarly, the authors decided to remove the Rational Quadratic kernel as it is more difficult to describe automatically [7].
All the compositions of kernels that are included in GPSS can be represented in our grammar. Thus, the search space of EvoCov is a superset of the search space of GPSS. On the other hand, ABCD approaches include ChangeWindow and ChangePoint kernels that cannot be modeled with the current grammar of EvoCov. Hence, different kernels can be found by these approaches. Table 6 shows the numerical results of the experimentation for each time-series, while in Figure 4, the overall results are shown. EvoCov is presented in two variants, one using LML to optimize the hyperparameters, and the other variant using the SoPL metric. Note that RMSEs are standardized by dividing by the smallest RMSE achieved in the experiments for each data set, so that the best performance on each data set has a value of 1. Also, it is worth mentioning that, in the experiments conducted in [7], only one trial for each time-series and algorithm was carried out 7 , and, for our algorithms, the mean and the best of ten trials is shown.
As we can see in Table 6, EvoCov LML achieves the best average result and beats the rest of the algorithms in the Mauna Loa Atmospheric CO 2 time-series. Moreover, its median result is the second best, very close to the best one.   Table 7: Number of hyperparameters for each extrapolation problem and algorithm. The noise hyperparameter is also considered. In our approaches, the average number of hyperparameters is shown. In ABCD acc. some data is missing as it could not be found in the supplementary material of [7]. The best trials of this algorithm outperform the rest of the algorithms in three other problems. Our Go With The First algorithm is the best choice in Monthly critical radio frequencies time-series, and its best trials were better than the other algorithms in 6 problems. Although its average performance is not as good as EvoCov LML, it was able to achieve the third best median RMSE. EvoCov SoPL shows the best behavior in Solar time-series. On the contrary, this approach obtains poor results in Monthly average daily calls, compromising its average result. The Random Search was able to get the best result for the Solar time-series in its best trial. However, this algorithm has a worse performance than those already mentioned, confirming the contribution of the mutation and crossover operators. Regarding the compositional kernel search approaches, GPSS approach gets the second best mean result, and is able to beat the rest of the algorithms in Daily minimum temperatures in Melbourne time-series. ABCD accuracy obtains the third best mean result. This algorithm gets the best median result and holds the best results in three of the time-series. ABCD interpretability, with worse results than ABCD accuracy according to the mean, is the best approach in Monthly average daily calls and Number of daily births in Quebec problems. On the other hand, there are some other time-series, such as Beveridge Wheat Price Index, Monthly U.S. male unemployment and Airline, where the algorithms based on the CP, Linear and Spectral kernels, get better results than compositional kernel search approaches. Finally, out of the GP approaches, Eureqa outperforms the rest of the approaches in Real daily wages in England time-series. However, this symbolic regression engine is worse than EvoCov LML in the rest of the time-series. Table 7 shows the number of hyperparameters of the best kernel found for each algorithm in each problem. It can be seen that the compositional kernel approaches (ABCD interpretability, ABCD accuracy and GPSS) have always more hyperparameters than any of our approaches on average. The only exception can be found in Number of daily births in Quebec time-series, where ABCD interpretability uses 6 hyperparameters, and EvoCov LML and EvoCov SoPL use 6.8 and 7.0 hyperparameters respectively.
Overall, EvoCov LML is a competitive approach compared to the current state-of-the-art compositional kernel search approaches. In spite of not including the ChangePoint and ChangeWindow kernels, this approach is able to obtain comparable results to ABCD accuracy, with kernels that have fewer hyperparameters, which makes them easier to optimize.
6.5 Comparing our proposal to ad hoc kernel approaches As we have mentioned in Section 5, many works in GPs propose a human-designed specific kernel for each particular problem. One of the most recent works can be found in [30], where the number of tweets in the Twitter timeline that contain a given hashtag was predicted by means of a GP. The authors show that this information can be useful to predict the hashtags that a tweet has, given its content. They propose an ad hoc kernel, Periodic Spikes (PS), that captures the periodicities of these hashtag time-series. For example, the #goodmorning hashtag shows a clear periodic pattern, as it is more frequently tweeted in the mornings. On the other hand, there are some hashtags, such as #np (now playing), that do not follow the periodic pattern mentioned above, and according to the authors, there are kernels better suited than PS for these problems. Our proposal should be able to identify these situations, and offer the best possible kernel without human intervention.
Hence, we carried out the same experiment as in [30], where #goodmorning, #breakfast, #confessionhour, #fail, #fyi and #raw hashtags are predicted 8 . For each hashtag, the number of tweets per hour was collected, using one month for training and the other to test, except for the #goodmorning hashtag, as in the original paper, where 3 weeks were gathered, having 2 weeks to train and the last one to test.
In Figure 5, an example of the hashtag prediction is shown. The number of occurrences of the #goodmorning hashtag is shown per hour, along with the best model given by our approach in a single run. In this problem, a periodic trend can be appreciated, which is successfully captured by our model.  Table 8 shows a comparison between EvoCov LML, EvoCov SoPL and the PS kernel. The experiments with the PS kernel were carried out using our software, and ref _f un_call = 5000 samples were allowed to find the best hyperparameters for this kernel. As can be seen, EvoCov SoPL is the best approach on average, and it obtains the best results in the #confessionhour problem. EvoCov LML is able to get the best score in the #fail and #fyi hashtags, finding a complex periodic pattern. It is also worth mentioning that the best trials of this algorithm obtain the best results in four out of five problems. However, in simpler periodic time-series, such as #goodmorning, #breakfast and #raw, the PS kernel is the best choice, getting the best median result.  Standardized RMSE for each extrapolation problem and algorithm is shown. In our approaches, the mean and the best results are illustrated. The best results on average are shown in bold, while the best results overall are highlighted by an asterisk. Table 9 shows the number of hyperparameters used by different approaches. As expected, our approaches use more hyperparameters than the PS kernel, as this kernel is specifically designed for these problems.  Table 9: Number of hyperparameters for each extrapolation problem and algorithm. The noise hyperparameter is also considered. In our approaches, the average number of hyperparameters is shown.

PS
The PS kernel is still able to hold the best results in three out of six problems. On the other hand, the EvoCov approaches, without any expertise about the problem, are able to obtain similar predictions to PS, even improving the results of the PS kernel in the rest of the problems. Also note that the EvoCov approaches are able to get better average results than the PS kernel, showing a more adaptable behavior.

Conclusions
Kernel functions are widely used in several Machine Learning methods. GPs are one of these techniques, where a PSD kernel is used as a covariance function. This kernel function has to be carefully selected to achieve good results in any GP application. Although initial approaches used to rely on predefined kernels or ad hoc solutions for specific problems, there is an increasing interest in automatically learning these kernels. In this work, we have presented an evolutionary approach to learn kernel functions for GPs. While other approaches are based on kernel composition, in our approach, kernels are modeled by means of basic mathematical expressions.
This work has made the following contributions: • Basic mathematical expressions as building blocks for GP kernels: We propose to bring the progress made in other Machine Learning areas to the GPs by considering its covariance function as a program that can be learned.
• Fast PSD check for GP kernels: Although some of the kernels generated by this new random method are not PSD, we have defined a kernel validation procedure that rapidly discards most non-PSD expression trees based on the properties of the covariance matrix. • Hyperparameter inheritance: We have incorporated hyperparameter inheritance within GenProg, improving the efficiency of the algorithm. • Metric comparison for hyperparameter optimization: We provide valuable insights about the suitability and performance of several metrics for hyperparameter optimization in extrapolation problems. • Extensive benchmark in realistic problems: We have evaluated our proposal in an extensive benchmark of realistic problems, showing that our agnostic algorithm is competitive to a wide range of methods.
Altogether, these contributions enabled the design of a GenProg variant which is able to improve the state-of-the-art results in the application of GP to time-series extrapolation. We can conclude that there is no need to rely on a priori defined kernels for GP time-series extrapolation problems, as it is possible to learn simpler and better kernels by evolving mathematical expression trees that satisfy the PSD restrictions.
Further research in the grammar is suggested, extending it to ChangePoint and ChangeWindow kernels. On the other hand, we propose continuing the work carried out to measure the performance of the hyperparameter optimization metrics for GP extrapolation problems.