tgp : an R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models

The tgp package for R [25] is a tool for fully Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes with jumps to the limiting linear model. Special cases also implemented include Bayesian linear models, linear CART, stationary separable and isotropic Gaussian processes. In addition to inference and posterior prediction, the package supports the (sequential) design of experiments under these models paired with several objective criteria. 1-d and 2-d plotting, with higher dimension projection and slice capabilities

The outline is as follows.Section 1 introduces the functions and associated regression models implemented by the tgp package, including plotting and visualization methods.The Bayesian mathematical specification of these models is contained in Section 2. In Section 3 the functions and methods implemented in the package are illustrated by example.The appendix covers miscellaneous topics such as how to link with the ATLAS libraries for fast linear algebra routines, compile-time support for Pthreads parallelization, the gathering of parameter traces, the verbosity of screen output, and some miscellaneous details of implementation.

Motivation
Consider as motivation the Motorcycle Accident Dataset [30].It is a classic data set used in recent literature [26] to demonstrate the success of nonstationary regression models.The data consists of measurements of the acceleration of the head of a motorcycle rider as a function of time in the first moments after an impact.Many authors have commented on the existence of two-perhaps three-regimes in the data over time where the characteristics of the mean process and noise level change (i.e., a nonstationarity and heteroskedasticity, respectively).It can be interesting to see how various candidate models handle this nuance.
Figure 1 shows a fit of this data using a standard (stationary) Gaussian process (GP; left), and the treed GP model (right). 1 Notice how stationary GP model is unable to capture the smoothness in the linear section(s), nor the decreased noise level.We say that the standard GP model is stationary because it has a single fixed parameterization throughout the input space.An additive model would be inappropriate for similar reasons.In contrast, the treed GP model is able to model the first linear part, the noisy "whiplash" middle section, and the smooth (possibly linear) final part with higher noise level, thus exhibiting nonstationary modeling behavior and demonstrating an ability to cope with heteroskedasticity.
The remainder of this paper describes the treed GP model in detail, and provides illustrations though example.There are many special cases of the treed GP model, e.g., the linear model (LM), treed LM, stationary GP, etc..These are outlined and demonstrated as well.
1 What is implemented?
The tgp package contains implementations of seven Bayesian multivariate regression models and functions for visualizing posterior predictive surfaces.These models, and the functions which implement them, are outlined in Section 1.1.Also implemented in the package are functions which aid in the sequential design of experiments for tgp-class models, which is what I call adaptive sampling.
Figure 1: Fit of the Motorcycle Accident Dataset using a GP (top) and treed GP model (bottom).The x-axis is time in milliseconds after an impact; the y-axis is acceleration of the helmet of a motorcycle rider measured in "g's" in a simulated impact.
These functions are introduced at the end of Section 2 and a demonstration is given in Section 3.6.

Bayesian regression models
The seven regression models implemented in the package are summarized in Table 1.They include combinations of treed partition models, (limiting) linear models, and Gaussian process models as indicated by The b* functions are intended as the main interface, so little further attention to the tgp master function will be included here.The easiest way to see how the master tgp function implements one of the b* methods is to simply type the name of the function of interest into R.For example, to see the implementation of bgp, type:

> bgp
The output (return-value) of tgp and the b* functions is a list object of class "tgp".This is what is meant by a "tgp-class" object.This object retains all of the relevant information necessary to summarize posterior predictive inference, maximum a' posteriori (MAP) trees, and statistics for adaptive sampling.Information about its actual contents is contained in the help files for the b* functions.Generic print, plot, and predict methods are defined for tgp-class objects.The plot and predict functions are discussed below.The print function simply provides a list of the names of the fields comprising a tgp-class object.

Plotting and visualization
The two main functions provided by the tgp package for visualization are plot.tgp,inheriting from the generic plot method, and a function called tgp.trees for graphical visualization of MAP trees.
The plot.tgp function can make plots in 1-d or 2-d.Of course, if the data are 1-d, the plot is in 1-d.If the data are 2-d, or higher, they are 2-d image or perspective plots unless a 1-d projection argument is supplied.Data which are 3-d, or higher, require projection down to 2-d or 1-d, or specification of a 2-d slice.The plot.tgp default is to make a projection onto the first two input variables.Alternate projections are specified as an argument (proj) to the function.Likewise, there is also an argument (slice) which allows one to specify which slice of the posterior predictive data is desired.For models that use treed partitioning (those with a T in the center column of Table 1), the plot.tgpfunction will overlay the region boundaries of the MAP tree ( T ) found during MCMC.
A few notes on 2-d plotting of tgp-class objects: • 2-d plotting requires interpolation of the data onto a uniform grid.This is supported by the tgp package in two ways: (1) loess smoothing, and (2) the akima package, available from CRAN.The default is loess because it is more stable and does not require installing any further packages.When akima works it makes (in my opinion) smarter interpolations.However there are two bugs in the akima package, one malign and the other benign, which preclude it from the default position in tgp.The malign bug can cause a segmentation fault, and bring down the entire R session.The benign bug produces NA's when plotting data from a grid.For beautiful 2-d plots of gridded data I suggest exporting the tgp predictive output to a text file and using gnuplot's 2-d plotting features.
• The current version of this package contains no examples-nor does this document-which demonstrate plotting of data with dimension larger than two.The example provided in Section 3.5 uses 10-d data, however no plotting is required.tgp methods have been used on data with input dimension as large as 15 [14], and were used in a sequential design and detailed analysis of some proprietary 3-d input and 6-d output data sampled using a NASA supercomputer [16].
• The plot.tgp function has many more options than are illustrated in [Section 3 of] this document.Please refer to the help files for more details.
The tgp.trees function provides a diagrammatic representation of the MAP trees of each height encountered by the Markov chain during sampling.The function will not plot trees of height one, i.e., trees with no branching or partitioning.Plotting of trees requires the maptree package, which in turn requires the combinat package, both available from CRAN.

Prediction
Prediction, naturally, depends on fitted model parameters θ|data, or Monte Carlo samples from the posterior distribution of θ in a Bayesian analysis.Rather than saving samples from π(θ|data) for later prediction, usually requiring enormous amounts of storage, tgp samples the posterior predictive distribution inline, as samples of θ become available.[Section 2.1.4and 2.2.1 outlines the prediction equations.]A predict.tgp function is provided should it be necessary to obtain predictions after the MCMC has finished.
The b* functions save the MAP parameterization θ maximizing π(θ|data).More specifically, the "tgp"-class object stores the MAP tree T and corresponding GP (or LLM) parameters θ| T found while sampling from the joint posterior π(θ, T |data).These may be accessed and used, via predict.tgp, to obtain posterior-predictive inference through the MAP parameterization.In this way predict.tgp is similar to predict.lm, for example.Samples can also be obtained from the MAP-parameterized predictive distributions via predict.tgp,or a re-initialization of the joint sampling of the posterior and posterior predictive distribution can commence starting from the ( θ, T ).
The output of predict.tgp is also a tgp class object.Appendix B.3 illustrates how this feature can be useful in the context of passing tgp model fits between collaborators.There are other miscellaneous demonstrations in Section 3.

Speed
Fully Bayesian analyses with MCMC are not the super-speediest of all statistical models.Nor is inference for GP models, classical or Bayesian.When the underlying relationship between inputs and responses is non-linear, GPs represent a state of the art phenomenological model with high predictive power.The addition of axis-aligned treed partitioning provides a divide-and-conquer mechanism that can not only reduce the computational burden relative to the base GP model, but can also facilitate the efficient modeling of nonstationarity and heteroskedasticity in the data.This is in stark contrast to other recent approaches to nonstationary spatial models (e.g., via deformations [8,28], or process convolutions [19,12,24]) which can require orders of magnitude more effort relative to stationary GPs.
Great care has been taken to make the implementation of Bayesian inference for GP models as efficient as possible [see Appendix A].However, inference for non-treed GPs can be computationally intense.Several features are implemented by the package which can help speed things up a bit.Direct support for ATLAS [32] is provided for fast linear algebra.Details on linking this package with ATLAS is contained in Appendix C.1.Parallelization of prediction and inference is supported by a producer/consumer model implemented with Pthreads.Appendix C.2 shows how to activate this feature, as it is not turned on by default.An argument called linburn is made available in tree class (T) b* functions in Table 1.When linburn = TRUE, the Markov chain is initialized with a run of the Bayesian treed linear model [5] before burn-in in order to pre-partition the input space using linear models.Finally, thinning of the posterior predictive samples obtained by the Markov chain can also help speed things up.This is facilitated by the E-part of the BTE argument to b* functions.

Sequential design of experiments
Sequential design of experiments, a.k.a.adaptive sampling, is not implemented by any single function in the tgp package.Nevertheless, options and functions are provided in order to facilitate the automation of adaptive sampling with tgp-class models.A detailed example is included in Section 3.6.
Arguments to b* functions, and tgp, which aid in adaptive sampling include Ds2x and improv.Both are booleans, i.e., should be set to TRUE or FALSE (the default for both is FALSE).TRUE booleans cause the tgp-class output list to contain vectors of similar names with statistics that can be used toward adaptive sampling.When Ds2x = TRUE then ∆σ 2 (x) statistic is computed at each x ∈ XX, in accordance with the ALC (Active Learning-Cohn) algorithm [6].Likewise, when improv = TRUE, statistics are computed in order to asses the expected improvement (EI) for each x ∈ XX about the global minimum [20].The ALM (Active Learning-Mackay) algorithm [21] is implemented by default in terms of difference in predictive quantiles for the inputs XX, which can be accessed via the ZZ.q output field.Details on the ALM, ALC, and EI algorithms are provided in Section 2.
Calculation of EI statistics was considered "beta" functionality while this document was being prepared.At that time it had not been adequately tested, and its implementation changed substantially in future versions of the package.For updates see the follow-on vignette [17], or vignette("tgp2").That document also discusses sensitivity analysis, handling of categorical inputs, and importance tempring.
The functions included in the package which explicitly aid in the sequential design of experiments are tgp.designand dopt.gp.They are both intended to produce sequential D-optimal candidate designs XX at which one or more of the adaptive sampling methods (ALM, ALC, EI) can gather statistics.The dopt.gp function generates D-optimal candidates for a stationary GP.The tgp.design function extracts the MAP tree from a tgp-class object and uses dopt.gp on each region of the MAP partition in order to get treed sequential D-optimal candidates.

Methods and Models
This section provides a quick overview of the statistical models and methods implemented by the tgp package.Stationary Gaussian processes (GPs), GPs with jumps to the limiting linear model (LLM; a.k.a.GP LLM), treed partitioning for nonstationary models, and sequential design of experiments (a.k.a.adaptive sampling) concepts for these models are all briefly discussed.Appropriate references are provided for the details, including the original paper on Bayesian treed Gaussian process models [14], and an application paper on adaptively designing supercomputer experiments [16].
As a first pass on this document, it might make sense to skip this section and go straight on to the examples in Section 3.

Stationary Gaussian processes
Below is a hierarchical generative model for a stationary GP with linear tend for data D = {X, Z} consisting of n pairs of m X covariates and a single response variable {(x i1 , . . ., x im X ), z i } n i=1 .
X is a design matrix with m X columns.An intercept term is added with F = (1, X) which has m ≡ m X + 1 columns, and W is a m × m matrix.N , IG, and W are the (Multivariate) Normal, Inverse-Gamma, and Wishart distributions, respectively.Constants µ, B, V, ρ, α σ , q σ , α τ , q τ .are treated as known.
The GP correlation structure K is chosen either from the isotropic power family, or separable power family, with a fixed power p 0 (see below), but unknown (random) range and nugget parameters.Correlation functions used in the tgp package take the form K(x j , x k ) = K * (x j , x k ) + gδ j,k , where δ •,• is the Kronecker delta function, g is the nugget, and K * is a true correlation representative from a parametric family.The isotropic Matérn family is also implemented in the current version as "beta" functionality.
All parameters in (1) can be sampled using Gibbs steps, except for the covariance structure and nugget parameters, and their hyperparameters, which can be sampled via Metropolis-Hastings [14].

The nugget
The g term in the correlation function K(•, •) is referred to as the nugget in the geostatistics literature [22,7] and sometimes as jitter in the Machine Learning literature [23].It must always be positive (g > 0), and serves two purposes.Primarily, it provides a mechanism for introducing measurement error into the stochastic process.It arises when considering a model of the form: where m(•, •) is underlying (usually linear) mean process, ε(•) is a process covariance whose underlying correlation is governed by K * , and η(•) represents i.i.d.Gaussian noise.Secondarily, though perhaps of equal practical importance, the nugget (or jitter) prevents K from becoming numerically singular.Notational convenience and conceptual congruence motivates referral to K as a correlation matrix, even though the nugget term (g) forces K(x i , x i ) > 1.

Exponential Power family
Correlation functions in the isotropic power family are stationary which means that correlations are measured identically throughout the input domain, and isotropic in that correlations K * (x j , x k ) depend only on a function of the Euclidean distance between x j and x k : where d > 0 is referred to as the width or range parameter.The power 0 < p 0 ≤ 2 determines the smoothness of the underlying process.A typical default choice is the Gaussian p 0 = 2 which gives an infinitely differentiable process.
A straightforward enhancement to the isotropic power family is to employ a unique range parameter d i in each dimension (i = 1, . . ., m X ).The resulting separable correlation function is still stationary, but no longer isotropic.
The isotropic power family is a special case (when With the separable power family, one can model correlations in some input variables as stronger than others.However, with added flexibility comes added costs.When the true underlying correlation structure is isotropic, estimating the extra parameters of the separable model represents a sort of overkill.

Matérn Family
Another popular set of correlation functions is the Matérn family, due to many nice properties [31,24].Correlations in this family are isotropic, and have the form: where K ν is a modified Bessel function of the second kind [1].This family of correlation functions are obtained from spectral densities of the form f (ω) = ϕ(α 2 + ω 2 ) −ν−1/2 .Since the resulting process can shown to be ⌈ν⌉ − 1 times differentiable, ν can be thought of as a smoothness parameter.The ability to specify smoothness is a significant feature of the Matérn family, especially in comparison to the power exponential family which is either nowhere differentiable (0 < p 0 < 2) or infinitely differentiable (p 0 = 2).
Separable parameterizations of the Matérn family also exist, but the current version of tgp supports only the isotropic parameterization, for fixed ν.Future versions will allow ν to be estimated, and support both isotropic and separable parameterizations.

Prediction and Adaptive Sampling
The predicted value of z(x) is normally distributed with mean and variance where β is the posterior mean estimate of β, and with f ⊤ (x) = (1, x ⊤ ), and k(x) a n−vector with k ν,j (x) = K(x, x j ), for all x j ∈ X.Notice that σ(x) 2 does not directly depend on the observed responses Z.These equations often called kriging equations [22].The ALM algorithm [21] is implemented with MCMC by computing the norm (or width) of predictive quantiles obtained by samples from the Normal distribution given above.The ALC algorithm [6] computes the reduction in variance given that the candidate location x ∈ X is added into the data (averaged over a reference set Ỹ): , which is easily computed using MCMC methods [16].In the tgp package, the reference set is taken to be the same as the candidate set, i.e., Ỹ = X.
The Expected Global Optimization (EGO) algorithm [20] is based on a statistic which captures the expected improvement (EI) in the model about its ability to predict the spatial location of the global minimum.If f min is the current minimum, e.g., f min = min{z 1 , . . ., z n }, then the EI at the point x can reasonably be encoded as which can be shown to work out to be where ẑ and σ = √ σ2 are taken from the equations for the posterior predictive distribution (6).Φ and ϕ are the standard Normal cumulative distribution and probability density functions, respectively.
The tgp package computes the expectation in (9) via MCMC samples from the improvement max{f min − Z(x), 0} at locations x ∈ X.However, the method uses min n i=1 {Z xi }, a sample from the first order statistic of the posterior predictive distribution at the inputs x ∈ X, in place of f min .An exception is when the argument pred.n=FALSE is provided instructing tgp not to sample from the posterior predictive distribution of the input locations X.In this case, the original closed form EI formula (10) is used.

GPs and Limiting linear models
A special limiting case of the Gaussian process model is the standard linear model.Replacing the top (likelihood) line in the hierarchical model ( 1) where I is the n × n identity matrix, gives a parameterization of a linear model.From a phenomenological perspective, GP regression is more flexible than standard linear regression in that it can capture nonlinearities in the interaction between covariates (x) and responses (z).From a modeling perspective, the GP can be more than just overkill for linear data.Parsimony and over-fitting considerations are just the tip of the iceberg.It is also unnecessarily computationally expensive, as well as numerically unstable.Specifically, it requires the inversion of a large covariance matrix-an operation whose computing cost grows with the cube of the sample size.Moreover, large finite d parameters can be problematic from a numerical perspective because, unless g is also large, the resulting covariance matrix can be numerically singular when the off-diagonal elements of K are nearly one.Bayesians can take advantage of the limiting linear model (LLM) by constructing a prior for the "mixture" of the GP with its LLM [15].The key idea is an augmentation of the parameter space by m X indicators b = {b} m X i=1 ∈ {0, 1} m X .The boolean b i is intended to select either the GP (b i = 1) or its LLM for the i th dimension.The actual range parameter used by the correlation function is multiplied by b: e.g.
To encode the preference that GPs with larger range parameters be more likely to "jump" to the LLM, the prior on b i is specified as a function of the range parameter

(d).
There is truncation in the left-most bin, which rises to about 6.3.
Probability mass functions which increase as a function of d i , e.g., with 0 < γ and 0 ≤ θ 1 ≤ θ 2 < 1, can encode such a preference by calling for the exclusion of dimensions i with large d i when constructing K * .Thus b i determines whether the GP or the LLM is in charge of the marginal process in the i th dimension.Accordingly, θ 1 and θ 2 represent minimum and maximum probabilities of jumping to the LLM, while γ governs the rate at which p(b i = 0|d i ) grows to θ 2 as d i increases.Figure 2 plots p(b i = 0|d i ) for (γ, θ 1 , θ 2 ) = (10, 0.2, 0.95) superimposed on a convenient p(d i ) which is taken to be a mixture of Gamma distributions, representing a population of GP parameterizations for wavy surfaces (small d) and a separate population of those which are quite smooth or approximately linear.The θ 2 parameter is taken to be strictly less than one so as not to preclude a GP which models a genuinely nonlinear surface using an uncommonly large range setting.
The implied prior probability of the full m X -dimensional LLM is Notice that the resulting process is still a GP if any of the booleans b i are one.The primary computational advantage associated with the LLM is foregone unless all of the b i 's are zero.However, the intermediate result offers increased numerical stability and represents a unique transitionary model lying somewhere between the GP and the LLM.It allows for the implementation of a semiparametric stochastic processes like Z(x) = βf (x) + ε(x) representing a piecemeal spatial extension of a simple linear model.The first part (βf (x)) of the process is linear in some known function of the full set of covariates x = {x i } m X i=1 , and ε(•) is a spatial random process (e.g. a GP) which acts on a subset of the covariates x ′ .Such models are commonplace in the statistics community [9].Traditionally, x ′ is determined and fixed a' priori.The separable boolean prior (11) implements an adaptively semiparametric process where the subset x ′ = {x i : b i = 1, i = 1, . . ., m X } is given a prior distribution, instead of being fixed.

Prediction and Adaptive Sampling under LLM
Prediction under the limiting GP model is a simplification of (6) when it is known that K = (1 + g)I.It can be shown [15] that the predicted value of z at x is normally distributed with mean ẑ Applying the ALC algorithm under the LLM also offers computational savings.Starting with the predictive variance given in (6), the expected reduction in variance under the LM is [16] T :g r a p h i c a l l y Figure 3: An example tree T with two splits, resulting in three regions, shown in a diagram (left) and pictorially (right).The notation X[:, u] < s represents a subsetting of the design matrix X by selecting the rows which have u th column less than s, i.e. columns {i : x iu < s}, so that X 1 has the rows I 1 of X where The responses are subsetted similarly so that Z 1 contains the I 1 elements of Z.We have that which is similarly preferred over (8) with The statistic for expected improvement (EI; about the minimum) is the same under the LLM as (10) for the GP.Of course, it helps to use the linear predictive equations instead of the kriging ones for ẑ(x) and σ2 (x).

Treed partitioning
Nonstationary models are obtained by treed partitioning and inferring a separate model within each region of the partition.Treed partitioning is accomplished by making (recursive) binary splits on the value of a single variable so that region boundaries are parallel to coordinate axes.Partitioning is recursive, so each new partition is a sub-partition of a previous one.Since variables may be revisited, there is no loss of generality by using binary splits as multiple splits on the same variable are equivalent to a non-binary split.Figure 3 shows an example tree.In this example, region D 1 contains x's whose u 1 coordinate is less than s 1 and whose u 2 coordinate is less than s 2 .Like D 1 , D 2 has x's whose coordinate u 1 is less than s 1 , but differs from D 1 in that the u 2 coordinate must be bigger than or equal to s 2 .Finally, D 3 contains the rest of the x's differing from those in D 1 and D 2 because the u 1 coordinate of its x's is greater than or equal to s 1 .The corresponding response values (z) accompany the x's of each region.
These sorts of models are often referred to as Classification and Regression Trees (CART) [2].CART has become popular because of its ease of use, clear interpretation, and ability to provide a good fit in many cases.The Bayesian approach is straightforward to apply to tree models, provided that one can specify a meaningful prior for the size of the tree.The trees implemented in the tgp package follow Chipman et al. [4] who specify the prior through a tree-generating process.Starting with a null tree (all data in a single partition), the tree, T , is probabilistically split recursively with each partition, η, being split with probability p split (η, T ) = a(1 + q η ) −b where q η is the depth of η in T and a and b are parameters chosen to give an appropriate size and spread to the distribution of trees.
Extending the work of Chipman et al. [5], the tgp package implements a stationary GP with linear trend, or GP LLM, independently within each of the regions depicted by a tree T [14].Integrating out dependence on T is accomplished by reversible-jump MCMC (RJ-MCMC) via tree operations grow, prune, change, and swap [4].To keep things simple, proposals for new parametersvia an increase in the number of partitions (through a grow)-are drawn from their priors2 , thus eliminating the Jacobian term usually present in RJ-MCMC.New splits are chosen uniformly from the set of marginalized input locations X.The swap operation is augmented with a rotate option to improve mixing of the Markov chain [14].
There are many advantages to partitioning the input space into regions, and fitting separate GPs (or GP LLMs) within each region.Partitioning allows for the modeling of non-stationary behavior, and can ameliorate some of the computational demands by fitting models to less data.Finally, fully Bayesian model averaging yields a uniquely efficient nonstationary, nonparametric, or semiparametric (in the case of the GP LLM) regression tool.The most general Bayesian treed GP LLM model can facilitate a model comparison between its special cases (LM, CART, treed LM, GP, treed GP, treed GP LLM) through the samples obtained from the posterior distribution.

(Treed) sequential D-optimal design
In the statistics community, sequential data solicitation goes under the general heading of design of experiments.Depending on a choice of utility, different algorithms for obtaining optimal designs can be derived.Choose the Kullback-Leibler distance between the posterior and prior distributions as a utility leads to what are called D-optimal designs.For GPs with correlation matrix K, this is equivalent to maximizing det(K).Choosing quadratic loss leads to what are called A−optimal designs.An excellent review of Bayesian approaches to the design of experiments is provided by Chaloner & Verdinelli [3].Other approaches used by the statistics community include space-filling designs: e.g.max-min distance and Latin Hypercube (LH) designs [27].The FIELDS package [10] implements space-filling designs along side kriging and thin plate spline models.
A hybrid approach to designing experiments employs active learning techniques.The idea is to choose a set of candidate input configurations X (say, a D−optimal or LH design) and a rule for determining which x ∈ X to add into the design next.The ALM algorithm has been shown to approximate maximum expected information designs by choosing x with the the largest predictive variance [21].The ALC algorithm selects x minimizing the reduction in squared error averaged over the input space [6].Seo et al. [29] provide a comparison between ALC and ALM using standard GPs.The EI [20] algorithm can be used to find global minima.
Choosing candidate configurations X (XX in the tgp package), at which to gather ALM, ALC, or EI statistics, is a significant component in the hybrid approach to experimental design.Candidates which are are well-spaced relative to themselves, and relative to already sampled configurations, are clearly preferred.Towards this end, a sequential D-optimal design is a good first choice, but has a number of drawbacks.D-optimal designs are based require a known parameterization, and are thus not well-suited to MCMC inference.They may not choose candidates in the "interesting" part of the input space, because sampling is high there already.They are ill-suited partition models wherein "closeness" may not measured homogeneously across the input space.Finally, they are computationally costly, requiring many repeated determinant calculations for (possibly) large covariance matrices.
One possible solution to both computational and nonstationary modeling issues is to use treed sequential D-optimal design [16], where separate sequential D-optimal designs are computed in each of the partitions depicted by the maximum a posteriori (MAP) tree T .The number of candidates selected from each region can be proportional to the volume of-or to the number of grid locations in-the region.MAP parameters θν | T , or "neutral" or "exploration encouraging" ones, can be used to create the candidate design-a common practice [27].Small range parameters, for learning about the wiggliness of the response, and a modest nugget parameter, for numerical stability, tend to work well together.
Finding a local maxima is generally sufficient to get well-spaced candidates.The dopt.gp function uses a stochastic ascent algorithm to find local maxima without calculating too many determinants.This works work well with ALM and ALC.However, it is less than ideal for EI as will be illustrated in Section 3.6.Adaptive sampling from EI (with tgp) is still an open area of research.

Examples using tgp
The following subsections take the reader through a series of examples based, mostly, on synthetic data.At least two different b* models are fit for each set of data, offering comparisons and contrasts.Duplicating these examples in your own R session is highly recommended.The Stangle function can help extract executable R code from this document.For example, the code for the exponential data of Section 3.3 can be extracted with one command.> Stangle(vignette("exp", package="tgp")$file) This will write a file called "exp.R".Additionally, each of the subsections that follow is available as an R demo.Try demo(package="tgp") for a listing of available demos.To invoke the demo for the exponential data of Section 3.3 try demo(exp, package="tgp").This is equivalent to source("exp.R") because the demos were created using Stangle on the source files of this document. 3ach subsection (or subsection of the appendix) starts by seeding the random number generator with set.seed(0).This is done to make the results and analyses reproducible within this document, and in demo form.I recommend you try these examples with different seeds and see what happens.Usually the results will be similar, but sometimes (especially when the data (X, Z) is generated randomly) they may be quite different.
Other successful uses of the methods in this package include applications to the Boston housing data [18,14], and designing an experiment for a reusable NASA launch vehicle [13,16] called the Langely glide-back booster (LGBB).

1-d Linear data
Consider data sampled from a linear model.
The following R code takes a sample {X, Z} of size N = 50 from (15).It also chooses N ′ = 99 evenly spaced predictive locations X = XX.
The generic plot method can be used to visualize the fitted posterior predictive surface (with option layout = 'surf') in terms of means and credible intervals.Figure 4 shows how to do it, and what you get.The default option layout = 'both' shows both a predictive surface and error (or uncertainty) plot, side by side.The error plot can be obtained alone via layout = 'as'.
Examples of these layouts appear later.
If, say, you were unsure about the dubious "linearness" of this data, you might try a GP LLM (using bgpllm) and let a more flexible model speak as to the linearity of the process.

1-d Synthetic Sine Data
Consider 1-dimensional simulated data which is partly a mixture of sines and cosines, and partly linear.
Figure 7 shows the resulting posterior predictive surface (top) and trees (bottom).The MAP partition ( T ) is also drawn onto the surface plot (top) in the form of vertical lines.The treed LM captures the smoothness of the linear region just fine, but comes up short in the sinusoidal region-doing the best it can with piecewise linear models.
The ideal model for this data is the Bayesian treed GP because it can be both smooth and wiggly.
Finally, speedups can be obtained if the GP is allowed to jump to the LLM [14], since half of the response surface is very smooth, or linear.This is not shown here since the results are very similar to those above, replacing btgp with btgpllm.Each of the models fit in this section is a special case of the treed GP LLM, so a model comparison is facilitated by fitting this more general model.The example in the next subsection offers such a comparison for 2d data.A followup in Appendix B.1 shows how to use parameter traces to extract the posterior probability of linearity in regions of the input space.

Synthetic 2-d Exponential Data
The next example involves a two-dimensional input space in The true response is given by A small amount of Gaussian noise (with sd = 0.001) is added.Besides its dimensionality, a key difference between this data set and the last one is that > plot(sin.btlm,main='treed LM,', layout='surf') > lines(X, Ztrue, col=4, lty=2, lwd=2)  it is not defined using step functions; this smooth function does not have any artificial breaks between regions.The tgp package provides a function for data > plot(sin.btgp,main='treed GP,', layout='surf') > lines(X, Ztrue, col=4, lty=2, lwd=2) subsampled from a grid of inputs and outputs described by ( 17) which concentrates inputs (X) more heavily in the first quadrant where the response is more interesting.Predictive locations (XX) are the remaining grid locations.

> exp2d.data <-exp2d.rand() > X <-exp2d.data$X; Z <-exp2d.data$Z > XX <-exp2d.data$XX
The treed LM is clearly just as inappropriate for this data as it was for the sinusoidal data in the previous section.However, a stationary GP fits this data just fine.After all, the process is quite well behaved.In two dimensions one has a choice between the isotropic and separable correlation functions.Separable is the default in the tgp package.For illustrative purposes here, I shall use the isotropic power family.

> exp.bgp <-bgp(X=X, Z=Z, XX=XX, corr="exp", verb=0)
Progress indicators are suppressed.Figure 9 shows the resulting posterior predictive surface under the GP in terms of means (left) and variances (right) in the default layout.The sampled locations (X) are shown as dots on the right image plot.Predictive locations (XX) are circles.Predictive uncertainty for the stationary GP model is highest where sampling is lowest, despite that the process is very uninteresting there.
A treed GP seems more appropriate for this data.It can separate out the large uninteresting part of the input space from the interesting part.The result is speedier inference and region-specific estimates of predictive uncertainty.

> exp.btgp <-btgp(X=X, Z=Z, XX=XX, corr="exp", verb=0)
Figure 10 shows the resulting posterior predictive surface (top) and trees (bottom).Typical runs of the treed GP on this data find two, and if lucky three, partitions.As might be expected, jumping to the LLM for the uninteresting, zero-response, part of the input space can yield even further speedups [14].Also, Chipman et al. recommend restarting the Markov chain a few times in order to better explore the marginal posterior for T [5].This can be important for higher dimensional inputs requiring deeper trees.The tgp default is R = 1, i.e., one chain with no restarts.Here two chains-one restart-are obtained using R = 2.

Motorcycle Accident Data
The Motorcycle Accident Dataset [30] is a classic nonstationary data set used in recent literature [26] to demonstrate the success of nonstationary models.The data consists of measurements of the acceleration of the head of a motorcycle rider as a function of time in the first moments after an impact.In addition to being nonstationary, the data has input-dependent noise (heteroskedasticity) which makes it useful for illustrating how the treed GP model handles this nuance.There are at least two-perhaps three-three regions where the response exhibits different behavior both in terms of the correlation structure and noise level.
The data is included as part of the MASS library in R.

> library(MASS) > X <-data.frame(times=mcycle[,1]) > Z <-data.frame(accel=mcycle[,2])
Figure 13 shows how a stationary GP is able to capture the nonlinearity in the response but fails to capture the input dependent noise and increased smoothness (perhaps linearity) in parts of the input space.
> plot(moto.bgp,main='GP,', layout='surf') A Bayesian Linear CART model is able to capture the input dependent noise but fails to capture the waviness of the "whiplash"-center-segment of the response.
A treed GP model seems appropriate because it can model input dependent smoothness and noise.A treed GP LLM is probably most appropriate since the left-hand part of the input space is likely linear.One might further hypothesize that the right-hand region is also linear, perhaps with the same mean as the left-hand region, only with higher noise.The b* functions can force an i.i.d.hierarchical linear model by setting bprior="b0".The tgp-default bprior="bflat" implies an improper prior on the regression coefficients β.It essentially forces W = ∞, thus eliminating the need to specify priors on β 0 and W −1 in (1).This was chosen as the default because it works well in many examples, and leads to a simpler overall model and a faster implementation.However, the Motorcycle data is an exception.Moreover, when the response data is very noisy (i.e., low signal-to-noise ratio), tgp can be expected to partition heavily under the bprior="bflat" prior.In such cases, one of the other proper priors like the full hierarchical bprior="b0" or bprior="bmzt" might be preferred.

> plot(moto.btlm, main='Bayesian CART,', layout='surf')
An anonymous reviewer pointed out a shortcoming of the treed GP model on this data.The sharp spike in predictive variance near the first regime shift suggests that the symmetric Gaussian noise model may be inappropriate.A log Gaussian process might offer an improvement, at least locally.Running the treed GP MCMC for longer will eventually result in the finding of a partition near time=17, just after the first regime change.The variance is still poorly > par(mfrow=c(1,2)) > plot(moto.btgpllm,main='treed GP LLM,', layout='surf') > plot(moto.btgpllm.p,center='km', layout='surf') modeled in this region.Since it is isolated by the tree it could potentially be fit with a different noise model.

Friedman data
This Friedman data set is the first one of a suite that was used to illustrate MARS (Multivariate Adaptive Regression Splines) [11].There are 10 covariates in the data (x = {x 1 , x 2 , . . ., x 10 }).The function that describes the responses (Z), observed with standard Normal noise, has mean but depends only on {x 1 , . . ., x 5 }, thus combining nonlinear, linear, and irrelevant effects.Comparisons are made on this data to results provided for several other models in recent literature.Chipman et al. [5] used this data to compare their treed LM algorithm to four other methods of varying parameterization: linear regression, greedy tree, MARS, and neural networks.The statistic they use for comparison is root mean-square error (RMSE) where ẑi is the model-predicted response for input x i .The x's are randomly distributed on the unit interval.Input data, responses, and predictive locations of size N = 200 and N ′ = 1000, respectively, can be obtained by a function included in the tgp package.:10] This example compares Bayesian treed LMs with Bayesian GP LLM (not treed), following the RMSE experiments of Chipman et al.It helps to scale the responses so that they have a mean of zero and a range of one.First, fit the Bayesian treed LM, and obtain the RMSE.
Parameter traces need to be gathered in order to judge the ability of the GP LLM model to identify linear and irrelevant effects.> XX1 <-matrix(rep(0,10), nrow=1) > fr.bgpllm.tr<-bgpllm(X=X, Z=Z, XX=XX1, pred.n=FALSE,trace=TRUE, + m0r1=FALSE, verb=0) Here, m0r1 = FALSE has been specified instead so that the β estimates provided below will be on the original scale. 4A summary of the parameter traces show that the Markov chain had the following (average) configuration for the booleans.

> trace <-fr.bgpllm.tr$trace$XX[[1]] > apply(trace[,27:36], 2, mean)
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 Therefore the GP LLM model correctly identified that only the first three input variables interact only linearly with the response.This agrees with dimensionwise estimate of the total area of the input domain under the LLM (out of a total of 10 input variables).
> mean(fr.bgpllm.tr$trace$linarea$ba) [1] 7 A similar summary of the parameter traces for β shows that the GP LLM correctly identified the linear regression coefficients associated with the fourth and fifth input covariates (from ( 18))

Adaptive Sampling
In this section, sequential design of experiments, a.k.a.adaptive sampling, is demonstrated on the exponential data of Section 3.3.Gathering, again, the data: In contrast with the data from Section 3.3, which was based on a grid, the above code generates a randomly subsampled D-optimal design X from LH candidates, and random responses Z.As before, design configurations are more densely packed in the interesting region.Candidates X are from a large LHsample.
Given some data {X, Z}, the first step in sequential design using tgp is to fit a treed GP LLM model to the data, without prediction, in order to infer the MAP tree T .Figure 17 shows the sampled XX locations (circles) amongst the input locations X (dots) and MAP partition ( T ).Notice how the candidates XX are spaced out relative to themselves, and relative to the inputs X, unless they are near partition boundaries.The placing of configurations near region boundaries is a symptom particular to D-optimal designs.This is desirable for experiments with tgp models, as model uncertainty is usually high there [3].

> exp1 <-btgpllm(X=X
> plot(exp1$X, pch=19, cex=0.Now, the idea is to fit the treed GP LLM model, again, in order to assess uncertainty in the predictive surface at those new candidate design points.The following code gathers all three adaptive sampling statistics: ALM, ALC, & EI. > exp.as <-btgpllm(X=X, Z=Z, XX=XX, corr="exp", improv=TRUE, + Ds2x=TRUE, R=5, verb=0) Figure 18 shows the posterior predictive estimates of the adaptive sampling statistics.The error surface, on the left, summarizes posterior predictive uncertainty by a norm of quantiles.
> par(mfrow=c (1,3), bty="n") > plot(exp.as,main="tgpllm,", layout="as", as="alm") > plot(exp.as,main="tgpllm,", layout='as', as='alc') > plot(exp.as,main="tgpllm,", layout='as', as='improv') In accordance with the ALM algorithm, candidate locations XX with largest predictive error would be sampled (added into the design) next.These are most likely to be in the interesting region, i.e., the first quadrant.However, these results depend heavily on the clumping of the original design in the uninteresting areas, and on the estimate of T .Adaptive sampling via the ALC, or EI (or both) algorithms proceeds similarly, following the surfaces shown in center and right panels of Figure 18.

A Implementation notes
The treed GP model is coded in a mixture of C and C++: C++ for the tree data structure (T ) and C for the GP at each leaf of T .The code has been tested on Unix (Solaris, Linux, FreeBSD, OSX) and Windows (2000, XP) platforms.
It is useful to first translate and re-scale the input data (X) so that it lies in an ℜ m X dimensional unit cube.This makes it easier to construct prior distributions for the width parameters to the correlation function K(•, •).Proposals for all parameters which require MH sampling are taken from a uniform "sliding window" centered around the location of the last accepted setting.For example, a proposed a new nugget parameter g ν to the correlation function K(•, •) in region r ν would go as Calculating the corresponding forward and backwards proposal probabilities for the MH acceptance ratio is straightforward.
For more details about the MCMC algorithm and proposals, etc., please see the original technical report on Bayesian treed Gaussian process models [14].

B Interfaces and features
The following subsections describe some of the ancillary features of the tgp package such as the gathering and summarizing of MCMC parameter traces, the progress meter, and an example of how to use the predict.tgpfunction in a collaborative setting.

B.1 Parameter traces
Traces of (almost) all parameters to the tgp model can be collected by supplying trace=TRUE to the b* functions.In the current version, traces for the linear prior correlation matrix (W) are not provided.I shall illustrate the gathering and analyzing of traces through example.But first, a few notes and cautions.
Models which involve treed partitioning may have more than one base model (GP or LM).The process governing a particular input x depends on the coordinates of x.As such, tgp records region-specific traces of parameters to GP (and linear) models at the locations enumerated in the XX argument.Even traces of single-parameter Markov chains can require hefty amounts of storage, so recording traces at each of the XX locations can be an enormous memory hog.A related warning will be given if the product of |XX|, (BTE [2]-BTE [1])/BTE [3] and R is beyond a threshold.The easiest way to keep the storage requirements for traces down is to control the size of XX and the thinning level BTE [3].Finally, traces for most of the parameters are stored in output files.The contents of the trace files are read into R and stored as data.frameobjects, and the files are removed.The existence of partially written trace files in the current working directory (CWD)-while the C code is executing-means that not more than one tgp run (with trace = TRUE) should be active in the CWD at one time.
Consider again the exponential data.For illustrative purposes I chose XX locations (where traces are gathered) to be (1) in the interior of the interesting region, (2) on/near the plausible intersection of partition boundaries, and (3) in the interior of the flat region.The hierarchical prior bprior = "b0" is used to leverage a (prior) belief the most of the input domain is uninteresting.
We now fit a treed GP LLM and gather traces, and also gather EI and ALC statistics for the purposes of illustration.Prediction at the input locations X is turned off to be thrifty.
> out <-btgpllm(X=X, Z=Z, XX=XX, corr="exp", bprior="b0", + pred.n=FALSE,Ds2x=TRUE, R=10, + trace=TRUE, verb=0) Figure 19 shows a dump of out$trace which is a "tgptraces"-class object.It depicts the full set of parameter traces broken down into the elements of a list: $XX with GP/LLM parameter traces for each XX location (the parameters are listed); $hier with traces for (non-input-dependent) hierarchical parameters (listed); $linarea recording proportions of the input space under the LLM; $parts with the boundaries of all partitions visited; $post containing (log) posterior probabilities; preds containing traces of samples from the posterior predictive distribution and adaptive sampling statistics.Plots of traces are useful for assessing the mixing of the Markov chain.For example, Figure 20 plots traces of the range parameter (d) for each of the 3 predictive locations XX.It is easy to see which of the locations is in the same partition with others, and which have smaller range parameters than others.
The mean area under the LLM can be calculated as This means that the expected proportion of the input domain under the full LLM is 0.501.Figure 21 shows a histogram of areas under the LLM.The clumps near 0, 0.25, 0.  The final column above represents the probability that the corresponding XX location is under the LLM (which is equal to 1-b).
Traces of posterior predictive and adaptive sampling statistics are contained in the $preds field.For example, Figure 22 shows samples of the ALC statistic ∆σ 2 (x).We can see from the trace that statistic is generally lowest for XX [

B.2 Explaining the progress meter
The progress meter shows the state of the MCMC as it iterates through the desired number of rounds of burn-in (BTE [1]), and sampling (BTE [2]-BTE [1]), for the requested number of repeats (R-1).The verbosity of progress meter print statements is controlled by the verb arguments to the b* functions.Providing verb=0 silences all non-warning (or error) statements.To suppress warnings, try enclosing commands within suppressWarnings(...), or globally set options(warn=0).See the help file (?options) for more global warning settings.
The default verbosity setting (verb=1) shows all grows and prunes, and a summary of d-(range) parameters for each partition every 1000 rounds.Higher verbosity arguments will show more tree operations, e.g., change and swap, etc. Setting verb=2 will cause an echo of the tgp model parameters and their starting values; but is otherwise the same as verb=1.The max is verb=4 shows all successful tree operations.Here is an example grow statement.**GROW** @depth 2: [0,0.05],n= (10,29) The *GROW* statements indicate the depth of the split leaf node; the splitting dimension u and location v is shown between square brackets [u,v], followed The second part is a printing of the d-(range) parameter to a separable correlation function.For 2 partitions there are two sets of square brackets.Inside the square brackets is the m X (2 in this case) range parameters for the separable correlation function.Whenever the LLM governs one of the input dimensions a zero will appear.I.e., the placement of 0/0.626 indicates the LLM is active in the 2nd dimension of the 2nd partition.0.626 is the d-(range) parameter that would have been used if the LLM were inactive.Whenever all dimensions are under the LLM, the d-parameter print is simply [0].This also happens when forcing the LLM (i.e., for blm and btlm), where [0] appears for each partition.These prints will look slightly different if the isotropic instead of separable correlation is used, since there are not as many range parameters.

B.3 Collaboration with predict.tgp
In this section I revisit the motorcycle accident data in order to demonstrate how the predict.tgpfunction can be helpful in collaborative uses of tgp.Consider a fit of the motorcycle data, and suppose that infer the model parameters only (obtaining no samples from the posterior predictive distribution).The "tgp"class output object can be saved to a file using the R-internal save function.

> library(MASS) > out <-btgpllm(X=mcycle[,1], Z=mcycle[,2], bprior="b0", +
pred.n=FALSE, verb=0) > save(out, file="out.Rsave") > out <-NULL Note that there is nothing to plot here because there is no predictive data.(out <-NULL is set for illustrative purposes.) Now imagine e-mailing the "out.Rsave" file to a collaborator who wishes to use your fitted tgp model.S/he could first load in the "tgp"-class object we just saved, design a new set of predictive locations XX and obtain kriging estimates from the MAP model.> load("out.Rsave") > XX <-seq (2.4, 56.7, length=200) > out.kp<-predict(out, XX=XX, pred.n=FALSE)Another option would be to sample from the posterior predictive distribution of the MAP model.

> out.p <-predict(out, XX=XX, pred.n=FALSE, BTE=c(0,1000,1))
This holds the parameterization of the tgp model fixed at the MAP, and samples from the GP or LM posterior predictive distributions at the leaves of the tree.
Finally, the MAP parameterization can be used as a jumping-off point for more sampling from the joint posterior and posterior predictive distribution.> out2 <-predict(out, XX, pred.n=FALSE, BTE=c(0,2000,2), MAP=FALSE) > plot(out.kp,center="km", as="ks2") Since the return-value of a predict.tgpcall is also a "tgp"-class object the process can be applied iteratively.That is, out2 can also be passed to predict.tgp.Figure 23 plots the posterior predictive surfaces for each of the three calls to predict.tgpabove.The kriging surfaces are smooth within regions of the partition, but the process is discontinuous across partition boundaries.The middle surface is simply a Monte Carlo-sample summarization of the kriging one above it.The final surface summarizes samples from the posterior predictive distribution when obtained jointly with samples from T |θ and θ|T .Though these summaries are still "noisy" they depict a process with smoother transitions across partition boundaries than ones conditioned only on the MAP parameterization.
Finally, the predict.tgpfunction can also sample from the ALC statistic and calculate expected improvements (EI) at the XX locations.While the function was designed with prediction in mind, it is actually far more general.It allows a continuation of MCMC sampling where the b* function left off (when MAP=FALSE) with a possibly new set of predictive locations XX.The intended use of this function is to obtain quick kriging-style predictions for a previously-fit MAP estimate (contained in a "tgp"-class object) on a new set of predictive locations XX.However, it can also be used simply to extend the search for an MAP model when MAP=FALSE, pred.n=FALSE, and XX=NULL.

C Configuration and performance optimization
In what follows I describe customizations and enhancements that can be made to tgp at compile time in order to take advantage of custom computing architectures.The compilation of tgp with a linear algebra library different from the one used to compile R (e.g., ATLAS), and the configuration and compilation of tgp with parallelization is described in detail.

C.1 Linking to ATLAS
ATLAS [32] is supported as an alternative to standard BLAS and LAPACK for fast, automatically tuned, linear algebra routines.If you know that R has already been linked to tuned linear algebra libraries (e.g., on OSX), then compiling with ATLAS as described below, is unnecessary-just install tgp as usual.As an alternative to linking tgp to ATLAS directly, one could re-compile all of R linking it to ATLAS, or some other platform-specific BLAS/Lapack, i.e., Intel's Math Kernel Library, or AMD's Core Math Library, as described in: http://cran.r-project.org/doc/manuals/R-admin.html Look for the section titled "Linear Algebra".While this is arguably best solution since all of R benefits, the task can prove challenging to accomplish and require administrator (root) privileges.Linking tgp with ATLAS directly is described here.
GP models implemented in tgp can get a huge benefit from tuned linear algebra libraries, since the MCMC requires many large matrix multiplications 1. Add -DPARALLEL to PKG_CXXFLAGS of src/Makevars 2.You may need to add -pthread to PKG_LIBS of src/Makevars, or whatever is needed by your compiler in order to correctly link code with PThreads.
The biggest improvement in the parallel version, over the serial, is observed when calculating ALC statistics, which require O(n 2 2 ) time for n 2 predictive locations, or when calculating ALM (default) or EI statistics on predictive locations whose number (n 2 ) is at least an order of magnitude larger (n 2 ≫ n 1 ) than the number of input locations (n 1 ).
Parallel sampling of the posterior of θ|T for each of the {θ ν } R ν=1 is also possible.However, the speed-up in this second case is less impressive, and so is not supported by the current version of the tgp package.

Figure 2 :
Figure 2: Prior distribution for the boolean (b) superimposed on p(d).There is truncation in the left-most bin, which rises to about 6.3.

Figure 4 :
Figure 4: Posterior predictive distribution using blm on synthetic linear data: mean and 90% credible interval.The actual generating lines are shown as blue-dotted.

Figure 5 :
Figure 5: Posterior predictive distribution using bgpllm on synthetic linear data: mean and 90% credible interval.The actual generating lines are shown as blue-dotted.

Figure 6 :
Figure 6: Posterior predictive distribution using bgp on synthetic sinusoidal data: mean and 90% pointwise credible interval.The true mean is overlayed with a dashed line.

Figure 7 :
Figure 7: Top: Posterior predictive distribution using btlm on synthetic sinusoidal data: mean and 90% pointwise credible interval, and MAP partition ( T ).The true mean is overlayed with a dashed line.Bottom: MAP trees for each height encountered in the Markov chain showing σ2 and the number of observation n, at each leaf.

Figure 8 :
Figure 8: Posterior predictive distribution using btgp on synthetic sinusoidal data: mean and 90% pointwise credible interval, and MAP partition ( T ) .The true mean is overlayed with a dashed line.

Figure 9 :
Figure 9: Left: posterior predictive mean using bgp on synthetic exponential data; right image plot of posterior predictive variance with data locations X (dots) and predictive locations XX (circles).

Figure 10 :Figure 11 :Figure 12 : 1 -
Figure 10: Top-Left: posterior predictive mean using btgp on synthetic exponential data; topright image plot of posterior predictive variance with data locations X (dots) and predictive locations XX (circles).Bottom: MAP trees of each height encountered in the Markov chain with σ2 and the number of observations n at the leaves.

Figure 13 :
Figure 13: Posterior predictive distribution using bgp on the motorcycle accident data: mean and 90% credible interval

Figure 14 :
Figure 14: Posterior predictive distribution using btlm on the motorcycle accident data: mean and 90% credible interval

Figure 15 :
Figure 15: Top: Posterior predictive distribution using treed GP LLM on the motorcycle accident data.The left-hand panes how mean and 90% credible interval; bottom: Quantilenorm (90%-5%) showing input-dependent noise.The right-hand panes show similar kriging surfaces for the MAP parameterization.

Figure 16 :
Figure 16: MAP trees of each height encountered in the Markov chain for the exponential data, showing σ2 and the number of observations n at the leaves.T is the one with the maximum log(p) above.

Figure 17 :
Figure 17: Treed D-optimal candidate locations XX (circles), input locations X (dots), and MAP tree T

Figure 18 :
Figure 18: Left: Image plots of adaptive sampling statistics and MAP trees T ; Left; ALM adaptive sampling image for (only) candidate locations XX (circles); center: ALC; and right: EI.

Figure 20 :
Figure 20: Traces of the (log of the) first range parameter for each of the three XX locations

Figure 21 :
Figure 21: Histogram of proportions of the area of the input domain under the LLM

Figure 23 :
Figure 23: Predictive surfaces (left) and error/variance plots (right) resulting from three different uses of the predict.tgpfunction: MAP kriging (top), sampling from the MAP (middle), sampling from the joint posterior and posterior predictive starting from the MAP (bottom).

Table 1 :
T, LM/LLM, & GP in Bayesian regression models implemented by the tgp package the center column of the table.The details of model specification and inference are contained in Section 2. Each is a fully Bayesian regression model, and in the table they are ordered by some notion of "flexibility".These b* functions, as I call them, are wrappers around the master tgp function which is an interface to the core C code.
5, and 0.75 can be thought of as representing quadrants (none, one, two, and tree) under the LLM.Similarly, we can calculate the probability that each of the XX locations is governed by the LLM.