GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes

The Bayesian additive regression trees (BART) model is an ensemble method extensively and successfully used in regression tasks due to its consistently strong predictive performance and its ability to quantify uncertainty. BART combines"weak"tree models through a set of shrinkage priors, whereby each tree explains a small portion of the variability in the data. However, the lack of smoothness and the absence of an explicit covariance structure over the observations in standard BART can yield poor performance in cases where such assumptions would be necessary. The Gaussian processes Bayesian additive regression trees (GP-BART) model is an extension of BART which addresses this limitation by assuming Gaussian process (GP) priors for the predictions of each terminal node among all trees. The model's effectiveness is demonstrated through applications to simulated and real-world data, surpassing the performance of traditional modeling approaches in various scenarios.


Introduction
Bayesian additive regression trees [BART; 7] is a probabilistic machine learning model that has proved successful in both regression and classification settings [39,38,22]. BART uses a non-parametric approach which learns through sums of trees [6] where each terminal node contribution is constrained by a regularising prior distribution. Given a vector of predictors x i = (x i1 , . . . , x ip ), the target function f (x i ) is obtained by aggregating the small contributions of each tree, which is similar in flavour to the small step updates of gradient boosting algorithms [11].
Considering a univariate response and training observations denoted as {x i , y i } n i=1 , the standard BART model is given by where the function h assigns a sampled value µ t to x i within terminal node of the tree T t across all T trees and the vector L t = (µ t1 , . . . , µ tbt ) collects the sampled mean parameters from the b t terminal nodes in tree T t . Here, N(·) denotes the normal distribution and τ is a residual precision term. In standard BART, terminal node parameters µ t are assigned a N(µ µ , τ −1 µ ) prior, where the hyperparameters are selected to shrink the influence of each tree.
Our novel GP-BART method modifies the standard BART by using the function g (replacing h) which assigns sampled values ψ t to the n t observations in node of tree T t , rather than the single value µ t used by BART. This is achieved by assuming a Gaussian process (GP) prior over each terminal node with constant mean µ t and a covariance function whose parameters are defined at the tree level.
In recent years, several extensions and modifications to the original BART model have been proposed to cover different types of data and assumptions [21]. To deal with the lack of smoothness, Linero and Yang presented a soft version of the BART model by advocating probabilistic split rules at the tree-building stage [27]. Starling et al. presented a BART extension, also incorporating GPs, that guarantees smoothness over a single target covariate by applying Gaussian process priors for each terminal node over the targeted variable [33]. Prado et al. proposed model trees BART that considers piecewise sums of linear functions at the terminal node level instead of piecewise sums of constants, adding flexibility [30]. Our GP-BART considers GP models at the terminal node level, and can be seen as a piecewise sum of GPs which are inherently smooth.
Notably, our GP-BART approach is coherent with previous works which identified that the BART model is itself a GP, asymptotically. Linero showed that as T → ∞, BART becomes a GP with a non-parametric covariance structure, whereby each element of the covariance matrix is described by the proportion of times the two corresponding design points end up within the same terminal nodes across all trees [26]. Therefore, it is natural to assume GP priors over the terminal nodes directly to circumvent the need for large T . More specifically, Linero also show that the implied kernel under this relation between BART and GPs is a function of the L 1 distances between design points (similar results were also found in [2]). Following this, it is natural to allow kernels of other types, especially ones defined by different distance metrics. Here, we employ node-specific anisotropic exponentiatedquadratic kernels relying on squared Euclidean distances. Though these are parameterised, this enables covariance structures, more flexible than the one implied by the standard BART, to be learned non-parametrically when T > 1, which would be too difficult to pre-specify under a single GP, or even a sum of GPs without tree splits.
The treed Gaussian process [tGP; 16] is another treed approach to GPs which defines all hyperparameters of a single GP at the terminal node level, thereby making it possible to incorporate non-stationarity into the model by varying the residual precision parameter across terminal nodes. However, to deal with the changing dimensions of the parameter space associated with growing and pruning a tree, this model requires the use of a reversible jump [18] algorithm, which comes with increased computational costs. Our GP-BART can also be seen as an additive ensemble of these treed GPs; though we define our priors and associated hyperparameters differently, the additive nature of the sum of GPs is shown here to yield superior performance. Finally, another example of previous work combining BART and GPs is provided by Wang et al., who use node-level GPs differently, as an extrapolation strategy for improving BART's predictions for exterior points outside the range of the training data [34]. The authors describe their approach as a 'GPed tree', in contrast to the 'treed GP' of Gramacy and Lee, and by extension GP-BART's ensemble of treed GPs.
We envisage our novel GP-BART framework being particularly appropriate for spatial data where smoothness in space is expected for certain covariate combinations, and thus useful in situations where Gaussian processes are commonly used [e.g., 3,13,1,37]. For situations where the trees contain split rules over spatial variables, we introduce a further novelty to allow for rotated splits. Traditional tree-based models can be interpreted as hyper-rectangles since each one of the nodes is given in parallel-axis directions. This behavior leads to a staircase decision boundary which can inhibit the model's ability to approximate true boundaries. García-Pedrajas et al. [12] propose non-linear projections of the tree models used in ensemble approaches to overcome this limitation, while Menze et al. describe an oblique forest model which selects optimal oblique directions using linear discriminant analysis [28]. More recently, Blaser and Fryzlewicz proposed random rotation ensembles where the direction of rotation is selected randomly, yielding a more general decision boundary. In our GP-BART approach, the randomness from the rotated splitting rules permits the tree search to explore a wider sample space of the tree distribution and avoid poor mixing [36]. The rotation moves can also be interpreted as another way to represent and model complex interactions among pairs of continuous variables and should not be seen as strictly restricted to spatial variables. Figure 1 summarises the main idea of our proposed statistical model, highlighting the modified terminal node priors and the rotated splitting rules. Figure 1: Graphical representation of four example trees from a GP-BART model. The splitting rules in each tree can take the form of a univariate cut-off for continuous covariates, a subset of factor levels for categorical covariates, or rotated split rules obtained by random projections of a pair of continuous covariates. Gaussian process priors are assumed for the predicted values for each terminal node in each tree, such that ψ t ∼ GP t a priori.
The remainder of this paper is structured as follows. Section 2 describes the GP-BART model, with mathematical formulations and key specifications. Section 3 contains the sampling algorithm and describes prediction settings and uncertainty estimation. Sections 4 and 5 provide comparisons between GP-BART and other methods in simulated and real-data benchmarking scenarios, respectively. Finally, Section 6 presents conclusions regarding the proposed algorithm, some limitations, and potential future work.
Since the tree follows a binary structure, each new node will be determined by splitting rules of the form x (j) ≤ c x (j) vs. x (j) > c x (j) for continuous predictors (where c x (j) is a scalar uniformly randomly selected from the observed values of x (j) , a specific covariate in X, the matrix of predictors in the training set) and x (j) ∈ d x (j) vs. x (j) / ∈ d x (j) for categorical ones (where d x (j) represents one of the possible outcome levels for that variable). For a single tree T 1 with b 1 terminal nodes, the model is written as y i | x i ∼ N(g(x i ; T 1 , G 1 ), τ −1 ), where the function g assigns the predicted values ψ 1 from GP 1 to the observations belonging to terminal node . Expanding such a model into a sum-of-trees structure is achieved via where the parameters G t = ({µ t1 , φ t , ν}, . . . , {µ tbt , φ t , ν}) now characterise the terminal node GPs of each tree T t , now denoted by GP t (µ t , Ω t (φ t , ν)), ∀ = 1, . . . , b t , where µ t = (µ t , . . . , µ t ) is again a constant vector, φ t = {φ t1 , . . . , φ tp } ∈ R p is now specific to each tree, in addition to each variable, and g now assigns the predicted values ψ t from GP t . The GP-BART model can be interpreted as a piecewise sum of non-linear GPs whereby each of the T trees will make a small contribution to the overall E[y i | x i ], whereas BART can be interpreted as a less flexible piecewise sum of constants. Consequently, GP-BART typically requires fewer trees than the standard BART model.

Prior Model
As in standard BART, we require prior distributions for the tree structure and terminal node parameters; i.e., (T 1 , G 1 ), . . . , (T t , G t ). We assume ν is fixed and select the following shrinkage priors assuming independence between trees and terminal nodes: We follow [7] in our selection of priors for T t and τ and adopt data-driven priors for the node-level µ t in such a way that considerable probability is assigned around the range of the observed y given the induced prior from the sum of GPs. Associated hyperparameters are omitted from Equations (1) and (2), for brevity, but we now fully define each prior in turn.

The tree structure
The prior π (T t ) is specified using the standard setting given by [6]. Thus, the tree prior distribution is implicitly defined by a generating stochastic process. The tree generation is initialised with a root node and a splitting variable is uniformly randomly selected among the possible p covariates in X. In the standard BART algorithm, the structure is learned via grow, prune, change, and swap moves. Thus, new trees are proposed by growing a new terminal node, removing a pair of terminal nodes, changing the split rule for an internal node, and swapping the split rules for a pair of internal nodes, where the type of move is chosen at random. Each proposed tree is then accepted or rejected via Metropolis-Hastings (MH); see [6] for further details. Notably, the swap move is not incorporated by GP-BART, as per the bartMachine implementation of BART by Kapelner and Bleich [23], due to computational complexity and the tendency of GP-BART to yield shallower trees, for which proposing such swap moves would not be feasible.
We introduce two additional moves which we name 'grow-project' and 'change-project'. Notably, these moves can also be incorporated to improve prediction in the standard BART. For a given pair of randomly selected continuous covariates, they provide random projections of the feature space when growing or changing nodes. Such rules in the projection space yield rotated rules in the original space. It is possible to add the rotated components of the coordinate system using the rotation matrix where the angle θ is uniformly randomly selected among a grid of values within the interval [0, π/4] at each split. A uniform prior over the range of the given covariate is defined and the splitting rule for that node is sampled uniformly. With the addition of these two new moves, the proposal distribution of a new tree is described by an uniform sample of the possible grow, grow-project, change, change-project, and prune moves: with respective probabilities of 0.15 , 0.15, 0.2, 0.2, and 0.3. Here, the probabilities associated with grow and grow-project moves are chosen to sum to the usual probability of a grow move in bartMachine [23]. The same rationale applies to the probabilities associated with the change and change-project moves. This makes apparent the sense in which the rotated versions of the grow and change moves are distinct proposals in their own right. In cases where axis-aligned splits are sufficient, proposed rotated rules would simply be rejected. For the choice of the split rule for the grow and change moves, we select it by sampling the cutpoint c are respectively, the minimum and maximum values of the selected split variable x (j) in the branch. We alter this approach slightly for the growproject and change-project moves. For a given randomly selected pair of continuous predictors x (j) and x (h) , rotated with respect to the angle θ, we obtain a transformed feature space of the form {x are, respectively, the minimum and maximum values of the selected variable x (h) in the branch in the transformed feature space. This ultimately leads to rules of the form x (j) ≤ θx (h) in the original feature space.
Finally, the probability of an individual node at depth d = 0, 1, 2, . . . being non-terminal is controlled by the hyperparameters α and β through resulting in a tree prior π (T t ) which is a product of the probabilities of each node, since Equation (4) assumes independence between nodes. We fix the default values α = 0.95 and β = 2 in order to prioritise shallower trees [7].

The prior on the Gaussian processes
The main contribution of the GP-BART model is to define as a GP prior over the set of n t observations belonging to terminal node of tree T t , where 1 n t is a vector of ones of length n t , such that the mean vector is constant. Here, Ω t ∈ R n t × R n t is specified as a node-specific, stationary, anisotropic matrix of exponentiated-quadratic covariance terms, with its (i, k)-th element given by We normalise all predictors to improve the numerical stability of the kernel. Notably, the trees themselves are unaffected by this choice, as the splitting rules governing their structure are invariant to monotone transformations.
to exploit conjugacy and enable all µ t parameters to be marginalised out. This enables Equation (5) to be redefined as in order to encourage better mixing. We adopt this likelihood formulation throughout and provide further details in Appendix A. Chipman et al. showed that the induced prior distribution on E[y i | x i ] over all T trees in a BART model allows for some expert knowledge to be incorporated about the contribution of each tree which can help to guide the choices of hyperparameter values. However, the presence of the GP priors on ψ t in GP-BART yields a different induced prior which we write as Following the Chipman et al. approach, the key idea is to select the hyperparameters such that E[y i |x i ] is between y min and y max with high probability.
for a chosen k. We adopt k = 2, which represents an approximate 95% confidence interval. Following Chipman et al., we re-scale y such that y min = −0.5 and y max = 0.5, set µ µ = 0, and hence set the precision parameters to Though ν and τ µ are both referred to as precision parameters, their roles and interpretations differ, with ν and τ µ -both of which are fixed rather than estimated -being the parameters that control the precision of the GPs and the µ t parameters, respectively. As we increase the number of trees T , the scale ν −1 of each GP decreases, regularising the model by setting each tree contribution to be small, thereby reducing the chance of only one single GP dominating the model. Likewise, the precision of the µ t parameter is proportional to the number of trees, shrinking the mean of each terminal node as more tree components are added into the model.

The prior on the length parameter
The prior for the length parameter φ tj for a given tree t and covariate j is described by the mixture of gamma distributions shown in Equation (8). This prior enables the use of Automatic Relevance Determination (ARD) over the variables used in the GP [29]. As shown in Equation (6), φ t controls the rate of decay with respect to the L 2 distances between pairs of design points, such that larger values of φ tj will quickly decrease the contribution of variables that are uncorrelated to the true generation function f (x i ).
where Ga(a, d) denotes a gamma distribution with expectation a/d and κ is a mixture weight, which we set throughout to 0.3 to reflect the fact that, a priori, we expect fewer variables to contribute meaningfully to the GPs. The φ tj sampling processes is done using a Metropolis-Hasting algorithm, with a proposal distribution for new parameter given by φ tj ∼ Unif (0.1, 0.5, 1, 1.5, 2, 3, 4, . . . , 10, 50) , to reflect the fact that the precise magnitude of φ tj is only important for small values. The aforementioned normalisation of each predictor in X also aids the elicitation of this prior, by minimising the range of φ tj and ensuring all covariates are on the same scale.

The prior on the residual precision
A conjugate gamma distribution τ ∼ Ga(a τ , d τ ) is assumed for the residual precision parameter. To select the hyperparameters, we follow Chipman et al. in setting the shape a τ and rate d τ such that Pr (τ ≥τ OLS ) = η τ , where η τ is a high-probability value (we typically use η τ = 0.9) andτ OLS is the precision calculated from an ordinary linear regression of y against the same set of predictors X. The intuition behind this estimation strategy comes from the idea that, given the non-linearity of the GP and the piecewise additive component from BART, we can be optimistic that the precision of the model is greater than that of a linear model.

Computational algorithms for inference and prediction
Given the observed y, the posterior distribution for the trees and their parameters is given by We define the notation of a generic set M −t as the the set of all M 1 , . . . , M T elements except M t , such that T −t corresponds to the set of T −1 trees except T t with respective terminal node parameters G −t . The key feature necessary to sample from Equation (9) is the "Bayesian Backfitting" algorithm [20], which enables iterative sampling of the t-th tree and its parameters. Hastie and Tibshirani showed that the distribution π (T t , G t | T −t , G −t , τ, y)) can be rewritten in terms of the partial residuals The general structure of the sampler is thus given by: 1 : The algorithm is initialized with T stumps (i.e., trees with a single root node), with all mean parameters µ 1t = 0, all length parameters φ tj = 1, and residual precision parameter τ = 1. Initially, only grow and grow-project moves are proposed. Thereafter, once trees have reached sufficient depth d = 1, new trees T t are sequentially proposed by randomly selecting one of the five available moves: grow, grow-project, change, change-project, and prune, and then accepted or rejected according to a MH step.
The changes in the number of parameters in G t as these moves modify the tree depth do not affect the sampling of T t since ν is fixed, φ t is specified at the tree level, and all mean parameters µ t are marginalised out, yielding a tractable tree posterior proportional to π(T t )π(φ t )π(τ )π(R t | T t , φ t , ν, τ µ , τ ), which does not depend on any varying-dimensional parameters at the terminal node level. The terminal node parameters G t are updated by a Gibbs sampling scheme, with the associated full conditional distributions where with Λ t = τ −1 µ 1 n t 1 n t + Ω t and I n t being an identity matrix of the indicated dimension.
Lastly, we sample the length parameters φ tj ∀ (j = 1, . . . , p, t = 1, . . . , T ) using MH steps with a uniform proposal distribution as per Equation (8). Once all T trees are updated, the precision parameter is sampled using a Gibbs step, with the full conditional given by whereŷ ≡ T t=1 g(X; T t , G t ) represents the sum of the predictions ψ t across all terminal nodes from all sampled trees.

Algorithm specifications and initialisation
We set the number of trees T to have a default value of 10, since we require fewer trees than BART due to the inherent non-linearity of the GPs and achieved reasonable predictive performance in various scenarios demonstrated in Sections 4 and 5 using this value. Alternatively, this quantity could be selected via cross-validation, though the computational cost of doing so may be prohibitive. The inclusion, or not, of the rotated splitting rules is also a setting of the model that can be calibrated by the user, as well as which variables are eligible to be chosen to define rotated split rules, or included in the GPs themselves. By default, GP-BART includes the rotated splits over all continuous variables, since it improves the model's prediction in general, especially for spatial data. By default, if there is no strict prior knowledge about the covariates, all are used in the GPs. Though a more parsimonious model could be achieved if the variables used in the GPs are merely a subset of those used to construct the trees, we do not consider this further here.
We present the full structure of the GP-BART sampler in Algorithm 1, where the matrix of covariates X and response vector y from the training set enter as inputs. Trees, partial residuals, and hyperparameters are then initialised. Sequentially, for each MCMC sample, a proposed tree T t is accepted, if it is valid and contains no empty terminal nodes, with probability γ (T t , T t ). Lastly, the remaining parameters are sampled using Equations (11)- (12). A standard number of iterations N MCMC = 2000, of which the first N burnin = 500 are discarded, were found to yield a sufficient number of samples to reliably characterise the posterior in all applications herein. This was verified through examination of the convergence of the posterior samples of τ . Though the algorithm is computationally onerous given the matrix inversions associated with the use of GPs, we stress that such operations are of the order O(n 3 t ) within a given terminal node, rather than O(n 3 ) as they would be under a single GP.

Algorithm 1: GP-BART sampling algorithm
Input: X, y, T , N MCMC , and all hyperparameters of the priors. Initialise: T tree stumps, τ = 1, and φ tj = 1 ∀ (t, j). for iterations m from 1 to N MCMC do for trees t from 1 to T do Calculate the partial residuals R t via Equation (10); Propose a new tree T t by a grow, grow-project, change, change-project, or prune move; Accept and update T t = T t with probability for terminal nodes from 1 to b t do Update ψ t via Equation (11). end Update φ t using MH. end Update τ via Equation (12). end Output: Approx. samples from π ((T 1 , G 1 ), . . . , (T T , G T ), τ | y).

Prediction in GP-BART
The trees in GP-BART models can provide out-of-sample predictions for a set of n new observations X . For a given terminal node in tree T t for a particular MCMC sample, the joint posterior distribution of the training observations (expressed as partial residuals) and the predictions is given by with Λ t ∈ R n t × R n t and Λ t ∈ R n t × R n t . Here, n t and n t denote the number of observations assigned to terminal node of tree T t for the training samples and new data, respectively. This posterior predictive distribution can be conditioned with respect to the observed ψ t and be given in terms of where Ultimately, the function g assigns the vector µ GP t = E(ψ t | . . .) to the associated new observations X (t ) on a per-iteration basis, such that the estimates from GP-BART are given bŷ where m indexes the draws from the posterior distribution after the burnin iterations. The overall predictionȳ i for a new observation x i is then given by the average of the estimatesŷ Posterior samples from Equation (13) can also be used to quantify the uncertainty in the predictions. For instance, With some large number Q of draws per posterior sample, the endpoints of a (1 − α)% prediction interval for a predicted valueȳ i can be obtained from the upper and lower α/2 quantiles of (ŷ

Simulation studies
In this Section, we present simulation studies to evaluate the performance of GP-BART from two different perspectives. In Section 4.1 we primarily aim to assess the efficacy of incorporating the rotated splitting rules and the GPs themselves for data with explicit spatial components, whereas in Section 4.2 we primarily aim to assess the ARD associated with the mixture of gamma prior on the tree-varying, variable-specific length parameters φ tj .

Benchmarking experiments
In these experiments, the simulated data are composed by a summation of trees with two terminal nodes, built using the variables X = (x (1) , x (2) ). These covariates are simulated such that each is drawn from a Unif(−1, 1) distribution. The values associated with each terminal node follow a multivariate normal distribution with specific mean and covariance parameters. We generate the response variable via with number of trees T = 3, each with two terminal nodes. Here, the nodespecific mean parameters µ t are all constant vectors of the form (µ t , . . . , µ t ), with respective values given by µ 11 = −10, µ 21 = 0, µ 31 = 10, µ 12 = 5, µ 22 = 20, and µ 32 = −15. Multivariate normal spatial terms s t ∼ MVN(0 n t , Ω t (φ t = 3 n t , ν = 0.1)) are added within each terminal node. Noise terms ε ∼ MVN(0 n t τ −1 t I n t ) are also added, with τ = 10. Figure 2 shows the simulated data surfaces for data sets of size n = {100, 500, 1000}, respectively, highlighting the different partitioning behaviour and smoothness within each data set. We compare the performance of our GP-BART model to other tree-based methods, namely BART [7], SoftBART [27], and tGP [16], as well as the universal kriging model [8] and latent Gaussian models using integrated nested Laplace approximations [INLA ; 25]. We evaluate the results using 5 repetitions of 5-fold cross-validation; each fold is treated as a test set and prediction accuracy and uncertainty calibration is quantified using the root-mean-square error (RMSE) and the continuous ranked score probability (CRPS) [15], respectively, over all folds within a given repetition.
The models are fitted using the R packages BART [32], tgp [17], fields [9], and INLA [25], with their default settings, along with an implementation of the SoftBART model provided on the GitHub repository of Linero and Yang. All hyperparameters for the GP-BART model were specified using their default values previously described in Sections 2 and 3. To qualitatively compare the methods, we analyse the prediction surface generated by each algorithm for the data sets of size n = {100, 500, 1000}, shown in Figure  2, using predictions over the test sets in the repeated 5-fold setting. The corresponding plots are provided in Figures 3, 4, and 5, respectively. In each case, results from one randomly chosen repetition of the repeated 5-fold cross-validation are used to construct the plots.   Though the provided plots indicate clear differences between each model type, each model's behaviour is similar across the sample sizes. GP-BART's prediction surfaces appear most similar to the original data shown in Figure  2 in each case. Indeed, GP-BART successfully identifies diagonal partitions due to its rotated splits, while BART, SoftBART, and tGP only produce splits parallel to the axes. Though BART and SoftBART uncover differences among the terminal node regions nonetheless, the assumption of spatial independence renders such predictions unreliable. The tGP, kriging, and INLA predictions capture the spatial features well, but their failure to identify the partitions results in blurred prediction surfaces in areas where the data splits.
A quantitative comparison is shown in the boxplots in Figure 6, which reflect the previous qualitative interpretations. Here, GP-BART presents substantially lower RMSE than its competitors, particularly for smaller n. We assess uncertainty calibration by examining boxplots of CRPS scores in Figure 7. These results show that the GP-BART model presents the lowest CRPS values among all methods. Thus, its performance in terms of uncertainty quantification is superior to the other models considered.  The MH acceptance rates of the newly proposed moves used for learning the tree structures is another relevant aspect of GP-BART which allows us to study the efficacy of the novel grow-project and change-project moves. Table  1 shows the proportion of new trees that were accepted after the burnin phase using each of the five available moves under GP-BART, for each simulated data set, over all 25 folds in total. These results show that the grow-project and change-project moves present acceptance rates which are comparable to their respective standard counterparts. This is most likely attributable to the diagonal splitting rules used in the generation of each data set. To highlight the effect of the proposed moves and the use of GPs over the terminal nodes, four different, restricted versions of GP-BART are compared: (A) without any projection moves or GPs (i.e., the standard BART model); (B) without GPs, but with the addition of the two new moves; (C) without the new moves, but with GPs; (D) the standard GP-BART with both rotated splitting rules and GPs.
We defer the results for the other sample sizes, which lead to similar conclusions, to Appendix B and consider only the most computationally demanding n = 1000 setting here, for brevity. This comparison is summarised in Figures  8 and 9, in which the letters above are used to distinguish the model versions. As before, results based one randomly chosen repetition of the 5-fold cross-validation are used to construct these plots.  The prediction surface (A) in Figure 8 suggests BART cannot adequately capture different behaviours in the terminal node regions due to the lack of smoothness and non-linearity compared with GP-BART. Panel (B) and panel (C) both compare reasonably well with (D), which highlights the benefits of the rotated split rules and use of GPS, respectively. However, there is an apparent lack of smoothness in the terminal node regions of (B), and visible blurriness in the areas where the data splits in (C). Ultimately, it is evident that combining both innovations in (D) yields the best performance. This conclusion is reinforced by Figure 9, which clearly indicates the superior performance of version (D), the standard GP-BART, which presents the lowest median value according to both presented metrics. The incorporation of rotated splits and GPs in versions (B) and (C), respectively, yield similar RMSE values to (D), but their performance in terms of uncertainty calibration as measured by CRPS differs more substantially. Notably, the standard BART model (A) is unsatisfactory from both points of view.

Friedman data
In this scenario, we consider the Friedman equation [10]: i ∼ Unif(0, 1)∀j = 1, . . . , p and i ∼ N(0, τ −1 ). This equation is used for benchmarking tree-based methods using synthetic data, and has been examined in many other papers, e.g., [7,27]. For these data, we compare GP-BART to its explicitly tree-based competitors, namely BART, SoftBART, and tGP. Though there are no spatial features here, we still anticipate that incorporating GPs and rotated splits will help as there are non-linear smooth interactions in these data.
Here, we specify τ = 100, n = 500, and consider two versions of the same data; firstly with p = 5 and secondly with p = 10 features, of which the first 5 are those from the first scenario. As the Friedman equation uses only 5 covariates to generate the response, the additional five predictors in the second scenario are uninformative noise variables with no effect on y i . Figure  10 shows that GP-BART outperforms the other methods and presents good performance in terms of predictive accuracy and uncertainty calibration, using the RMSE and CRPS metrics as above. Subsequently, Figure 11 shows the same comparison, this time with the additional 5 noise variables.  The latter comparison is also favourable to GP-BART. In particular, these results show that the uninformative variables do not have a detrimental effect on GP-BART's performance. This can be attributed to the mixture of gamma priors assumed for the φ tj parameters automatically diminishing their influence on the kernels of the GPs. Conversely, the detrimental effects of such variables on the RMSE and CRPS values under BART, SoftBART, and tGP are more readily apparent, when one compares Figure 10 and Figure  11. The performance deterioration is especially notable for tGP. Figure 12 demonstrates the effectiveness of the ARD by examining boxplots of the minimum values of φ tj for each variable (in both the p = 5 and p = 10 scenarios) over all trees and all accepted MH proposals in the retained posterior samples across each repetition of 5-fold cross-validation. In the first scenario (p = 5), the two variables which are merely related linearly to the response are shown to be associated with moderately higher values. In the p = 10 scenario, the model selects large values for the 5 extra noise variables which are unrelated to the response, whereas small values are selected for the informative predictors, such that they contribute meaningfully to the GPs.

Applications
In this Section, we appraise the predictive performance of GP-BART compared to BART, SoftBART, tGP, kriging, and INLA on diverse real data sets, as a larger and more challenging test of GP-BART's capabilities. For illustration, we use four public data sets containing spatial features; i.e., with inherent dependence over the observations. These data sets are: 1. The Auckland data set consists of 166 rows describing infant mortality in Auckland, with two spatial covariates and the target variable [4]. 2. The Baltimore data set comprises 221 observations of house sales prices, two spatial features, and 13 other covariates, not all of which are continuous [4]. 3. The Boston data set contains 506 observations of the median values of owner-occupied suburban homes, two spatial features, and 13 other covariates, not all of which are continuous. We model a corrected version [14] of the original data [19]. 4. Swmud is a data set of seabed mud content in the southwest Australia Exclusive Economic Zone with 177 observations of two sets of spatial coordinates and mud content as the target variable [24].
Our implementations of each algorithm follow their respective default settings, including those previously described in Sections 2 and 3 for GP-BART. As before, 5 repetitions of 5-fold cross-validation are used to evaluate performance. Categorical and integer-valued features cannot be accommodated in the GPs under the present parameterisations of GP-BART and tGP. Hence, for the Baltimore and Boston data sets, we include only the continuous covariates here for all models in the comparison. In each case, the strictly spatial continuous features represent the exact coordinates of the instances. The results are summarised in Figure 13 and Figure 14, which show the RMSE and CRPS values over all data sets, respectively. According to Figure  13, GP-BART presents the lowest median RMSE for all but the Baltimore data set. The differences are most pronounced for the Swmud data (especially in the comparison between GP-BART and BART) and the Boston data, for which kriging and INLA perform notably worse than all tree-based methods. Figure 14 shows that the CRPS values produced by GP-BART are similarly favourable when compared with the performance of the other algorithms, with GP-BART exhibiting the lowest or second-lowest median CRPS value for three of the four data sets. Note that boxplots of the CRPS values for kriging are omitted from 14 for the sake of visual clarity, as they are well outside the range of those for the other models in the comparison. The superior calibration of GP-BART is most apparent for the Swmud data set, followed by the Boston data set, for which INLA performs notably worse than all tree-based methods. Jointly considering both the predictive accuracy and the uncertainty calibration, GP-BART was able to consistently yield superior or competitive predictions.  Another aspect of performance evaluation for each model is illustrated in Figure 15, which presents the average RMSE rank for each of the 25 test partitions from the repeated cross-validations. Ranks are defined here such that the model yielding the lowest mean RMSE is given a rank of 1, while the one with the worst prediction performance is given the highest possible rank of 6, for each test partition. From Figure 15, we can see that GP-BART has by far the lowest average RMSE rank for the Swmud data set. For the Auckland data, GP-BART's performance in this regard is also the best followed right after by INLA, where both jointly outperform the other methods. For the Boston data, SoftBART and GP-BART are highly competitive, though GP-BART presents a slightly lower average ranking. Finally, GP-BART's performance on the Baltimore data is notably worse than other methods based on trees and GPs, particularly compared to the standard BART model. Figure 15: RMSE ranks for all six competing models over the four benchmark data sets, averaged over all five repetitions of the 5-fold cross validation. The ranks range from 1 to 6, with lower ranks being associated with lower mean RMSE values. Figure 16 also relies on average ranks, though here using CRPS as the metric of comparison in order to evaluate the uncertainty quantification of the models. As per Figure 15, GP-BART yields the best performance for the Swmud and Auckland data and performs poorly on the Baltimore data. For the Boston data set, SoftBART and tGP outperform all other methods, but GP-BART achieves the next-lowest mean rank. Following its omission from Figure 14, kriging's CRPS performance is by far the worst for each data set.

Discussion
In this paper, we proposed GP-BART as an extension to the standard BART model. Our model uses Gaussian processes (GPs) to make observationspecific predictions at the terminal node level, and is thus able to capture non-linear relations and spatial dependence through the covariance structure of the GPs. In addition, our novel model also allows the use of projected splitting rules to build rotated partitions, which enable more flexibility in the tree representations.
The performance of GP-BART was evaluated over a number of simulated scenarios, where the model outperformed BART, restricted versions of GP-BART itself without the use of GPs and/or novel projected splitting rules, and another unrelated BART extension. Our benchmarking studies also highlighted GP-BART's superior performance relative to some spatial models, namely basic kriging and INLA. Our second simulation setting, using data generated according to the well-known Friedman equation, without explicit spatial components, was also favourable to GP-BART over other tree-based methods. In particular, these results demonstrated GP-BART's insensitivity to the inclusion of noise variables through the use of ARD.
When tested on real applications, using out-of-sample data via 5 repetitions of 5-fold cross-validation, GP-BART displayed competitive predictive capabilities, beating many of the established methods. We also compared the calibration properties of our method using CRPS; again, GP-BART performed as well or better than competing methodologies in almost all cases. Overall, in terms of predictive accuracy and uncertainty quantification, GP-BART consistently showed promising performance from both perspectives.
There are several potential issues remaining with the model, which may also provide opportunities for future research: • Careful choices have been made regarding the specification of prior distributions for the model parameters because the trees and the GPs can compete to explain the variability in the data. We have endeavoured to set sensible default parameters throughout. However, a more substantial study might suggest general rules as to how these parameters might be elicited in light of certain data set properties.
• The model can be computationally challenging to fit for larger data sets, since the calculation of each terminal node's contribution to the overall likelihood involves inverting each associated covariance matrix, though the cost is substantially reduced from O(n 3 ) under a single GP to O(n t ) 3 per node. Marginalising the GP mean parameters also speeds up the algorithm. However, we are also working on a C++ implementation of the model to further reduce the computational burden, which will soon be available. Approximations for the GP -using, for example, the Nyström method [35] or the methods of Quiñonero-Candela et al. [31] -might also prove to be advantageous in future work.
• Though the ARD appears to adequately account for irrelevant variables in the applications considered herein, there is further scope for exploiting variable-selection from the BART component. At present, all continuous predictors used to construct the trees are used in the GPs, which need not be the case. It may be beneficial to restrict the GPs only to the variables used to define splits along the given branch, though this would come with significant additional computational costs.
• In the applications herein, we have focused on the use of GP-BART for spatial data sets, but there is nothing to prohibit the model being used in generic machine learning tasks. However, we have restricted the GPs to be covariance-stationary through our use of anisotropic exponentiated-quadratic kernels, which are governed only by scalar rate and tree-level, variable-specific length parameters. A superior approach may introduce non-stationarity to the autocovariance and hence produce more flexible GP surfaces. Indeed, though the model outperforms its competitors in all simulation experiments and most of the real data sets analysed above, the performance on the Baltimore data set is poor. This is most likely attributable to the underlying exponentiated-quadratic kernel functions used in our parameterisation of the GP components being inappropriate for these data. This reaffirms that investigating alternative kernels could be of great interest for future work, particularly kernels capable of accommodating the noncontinuous features we discarded in our analysis of these data.
We hope to report on these developments as part of our future research plans.
applying the log, and then summing over the terminal nodes, we obtain log π (R t | T t , φ t , ν, τ ) = log C − 1 2 where C is a constant of proportionality. Recalling Λ t = τ −1 µ 1 n t 1 n t + Ω t , this expression can be further simplified with the constant µ t parameters explicitly absorbed into the kernel of the GP. This yields the following distribution for the partial residuals R t | T t , φ t , ν, τ ∼ MVN 0 n t , τ −1 I t + Λ t , which circumvents the need to sample the µ t parameters and encourages better mixing.

B. Performance evaluation for restricted versions of GP-BART
The results of a comparison between different versions of GP-BART for simulated data with n = 1000 are illustrated in Figure 8, showing predicted surfaces, and Figure 9, showing boxplots of the RMSE and CRPS values. For completeness, we provide here the analogous plots for the other sample sizes considered in the simulation study, with predicted surfaces and boxplots for the n = 100 data in Figures and B.2, respectively, and equivalent plots for the n = 500 data in Figures B.3 Figure 9 for the n = 1000 benchmarking experiment in Section 4.1. These versions clearly demonstrate the efficacy of the novel grow-project and change-project moves and the use of GP priors over terminal nodes, in that they show improved performance relative to the standard BART according to both metrics, but incorporating both innovations under GP-BART (D) yields the best performances. Regarding Figures B.1 and  B.3, the predicted surface under GP-BART is the one which is closest to the observed data in each case.     improve on the standard BART (A) in each case, GP-BART remains the superior method from both perspectives at each value of n. Though, the difference between it and its competitors in terms of RMSE becomes less pronounced as n increases, GP-BART remains by far the best from the point of view of calibration. Interestingly, there is no clear tendency for version (B), which adds rotated split rules only, or version (C), which adds GPs only, to be second best. This reaffirms that combining both innovations is necessary to achieve the best performance.