Skip to content
Publicly Available Published by De Gruyter April 29, 2016

Optimal Spatial Prediction Using Ensemble Machine Learning

  • Molly Margaret Davies EMAIL logo and Mark J. van der Laan

Abstract

Spatial prediction is an important problem in many scientific disciplines. Super Learner is an ensemble prediction approach related to stacked generalization that uses cross-validation to search for the optimal predictor amongst all convex combinations of a heterogeneous candidate set. It has been applied to non-spatial data, where theoretical results demonstrate it will perform asymptotically at least as well as the best candidate under consideration. We review these optimality properties and discuss the assumptions required in order for them to hold for spatial prediction problems. We present results of a simulation study confirming Super Learner works well in practice under a variety of sample sizes, sampling designs, and data-generating functions. We also apply Super Learner to a real world dataset.

1 Introduction

Optimal prediction of a spatially indexed variable is a crucial task in many scientific disciplines. For example, environmental health applications concerning air pollution often involve predicting the spatial distribution of pollutants of interest, and many agricultural studies rely heavily on interpolated maps of various soil properties.

Numerous algorithmic approaches to spatial prediction have been proposed (see Cressie [1] and Schabenberger and Gotway [2] for reviews), but selecting the best approach for a given data set remains a difficult statistical problem. One particularly challenging aspect of spatial prediction is that location is often used as a surrogate for large sets of unmeasured spatially indexed covariates. In such instances, effective prediction algorithms capable of capturing local variation must make strong, mostly untestable assumptions about the underlying spatial structure of the sampled surface and can be prone to overfitting. Ensemble predictors that combine the output of multiple predictors can be a useful approach in these contexts, allowing one to consider multiple aggressive predictors. There have been some recent examples of the use of ensemble approaches in the spatial and spatiotemporal literature. For example, Zaier et al. [3] used ensembles of artificial neural networks to estimate the ice thickness of lakes and Chen and Wang [4] used stacked generalization to combine support vector machines classifying land-cover types in hyperspectral imagery. Ensembling techniques have also been used to make spatially indexed risk maps. For example, Rossi et al. [5] used logistic regression to combine a library of four base learners trained on a subset of the observed data to obtain landslide susceptibility forecasts for the central Umbrian region of Italy. Kleiber et al. [6] have developed a Bayesian model averaging technique for obtaining locally calibrated probabilistic precipitation forecasts by combining output from multiple deterministic models.

The Super Learner prediction algorithm is an ensemble approach that combines a user-supplied library of heterogeneous candidate learners in such a way as to minimize ν-fold cross-validated risk [7]. It is a generalization of the stacking algorithm first introduced by Wolpert [8] within the context of neural networks and later adapted by Breiman [9] to the context of variable subset regression. LeBlanc and Tibshirani [10] discuss stacking and its relationship to the model-mix algorithm of Stone [11] and the predictive sample-reuse method of Geisser [12]. The library on which Super Learner trains can include parametric and nonparametric models as well as mathematical models and other ensemble learners. These learners are then combined in an optimal way in the sense that the Super Learner predictor will perform asymptotically as well as or better than any single prediction algorithm in the library under consideration. Super Learner has been used successfully in nonspatial prediction (see for example Polley et al. [13]). In this paper, we review its optimality properties and discuss the assumptions necessary for these optimality properties to hold within the context of spatial prediction. We also present the results of a simulation study, demonstrating that Super Learner works well in practice under a variety of spatial sampling schemes and data-generating distributions. In addition, we apply Super Learner to a real world dataset, predicting water acidity for a set of 112 lakes in the Southeastern United States. We show Super Learner is a practical, data-driven, theoretically supported way to build an optimal spatial prediction algorithm from a large, heterogeneous set of predictors, protecting against both model misspecification and over-fitting.

2 Problem formulation

Consider a random spatial process indexed by location over a fixed, continuous, d-dimensional domain, {Y(s):sDRd}. For a particular set of distinct sampling points {S1,,Sn}D, We observe {(Si,Yi):i=1,,n}, where Y=Y(Si)+ϵi and ϵi represents measurement error for the ith observation. For all i, we assume E[Yi|Si=s]=Y(s). Our objective is to predict Y(s) for unobserved locations sD. Thus, our parameter of interest is the spatial process itself. We do not make any assumptions about the functional form of the spatial process. We do, however, assume that one of the following is true: for all i, either

  1. (Si,Yi) are independently and identically distributed (i.i.d.), or

  2. (Si,Yi) are independent but not identically distributed, or

  3. Yi are independent given S1, …, Sn; and E[Yi|S1,,Sn]=E[Yi|Si]=Y(Si). This corresponds to a fixed design.

Each of these sets of assumptions implies that any measurement error is mean zero conditional on Si, or in the case of fixed design, conditional on S1, …, Sn. It is important to note that S could consist of both location and some additional covariates W, i. e. S=(X,W), where X refers to location. In such cases, it may be that measurement error is mean zero conditional on location and covariates, but not on location alone.

While these are reasonable assumptions for many spatial prediction problems, they are nontrivial and may not always be appropriate. For instance, instrumentation and calibration error within sensor networks can result in spatially structured measurement error that is not mean zero given S1,,Sn. There has been an effort on the part of researchers to develop ways to adapt the cross-validation procedure so as to minimize the effects of this kind of measurement error when choosing parameters such as bandwidth in local linear regression or smoothing parameters for splines. Interested readers should consult Opsomer et al. [14] and Francisco-Fernandez and Opsomer [15] for overviews.

3 The Super Learner algorithm

Suppose we have observed n copies of the random variable O with true data-generating distribution P0M, where the statistical model M contains all possible data generating distributions for O. The empirical distribution for our sample is denoted Pn. Define a parameter Ψ:MΨ{Ψ[P]:PM} in terms of a risk function R as follows: ΨP=argminψΨRψ,P In this paper, we will limit our discussion to so-called linear risk functions, where R(ψ,P)=PL(ψ)=L(ψ)(o)dP(o) for some loss function L. For a discussion of nonlinear risk functions, see van der Laan and Dudoit [16]. We write our parameter of interest as ψ0=Ψ[P0]=argminψR(ψ,P0), a function of the true data generating distribution P0. For many spatial prediction applications, the Mean-Squared Error (MSE) is an appropriate choice for the risk function R, but this needn’t necessarily be the case.

Define a library of J base learners of the parameter of interest ψ0, denoted {Ψˆj:PnΨˆj[Pn]}j=1J. We make no restrictions on the functional form of the base learners. For example, within the context of spatial prediction, a library could consist of various Kriging and smoothing splines algorithms, Bayesian hierarchical models, mathematical models, machine learning algorithms, and other ensemble algorithms. We make a minimal assumption about the size of the library: it must be at most polynomial in sample size. Given this library of base learners, we consider a family of combining algorithms {Ψˆα=f({Ψˆj:j},α):α} indexed by a Euclidean vector α for some function f. One possible choice of combining family is the family of linear combinations, Ψˆα=j=1Jα(j)Ψˆj. If it is known that ψ0[0,1], one might instead consider the logistic family, log[Ψˆα/Ψˆα(1Ψˆα)(1Ψˆα)]=j=1Jα(j)log[Ψˆα/Ψˆα(1Ψˆα)(1Ψˆα)]. In either of these families, one can also constrain the values α can take. In this paper, we constrain ourselves to convex combinations, i. e. for all j,α(j)0 and jα(j)=1.

Let {Bn}be a collection of length n binary vectors that define a random partition of the observed data into a training set {Oi:Bn(i)=0} and a validation set {Oi:Bn(i)=1}. The empirical probability distributions for the training and validation sets are denoted Pn,Bn0 and Pn,Bn1, respectively. The estimated risk of a particular estimator Ψˆ:PnΨˆ[Pn] obtained via cross-validation is defined as

EBnRΨˆPn,Bn0,Pn,Bn1=EBnPn,Bn1LΨˆPn,Bn0=EBnLΨˆPn,Bn0,ydPn,Bn1(y).

Given a particular class of candidate estimators indexed by α the cross-validation selector selects the candidate which minimizes the cross-validated risk under the empirical distribution Pn,

αnargminαEBnRΨˆαPn,Bn0,Pn,Bn1.

The Super Learner estimate of ψ0 is denoted Ψˆαn[Pn].

3.1 Key theoretical results

Super Learner’s aggressive use of cross-validation is informed by a series of theoretical results originally presented in van der Laan and Dudoit [16] and expanded upon in van der Vaart et al. [17].We provide a summary of these results below. For details and proofs, the reader is referred to these papers.

First, we define a benchmark procedure called the oracle selector, which selects the candidate estimator that minimizes the cross-validated risk under the true data generating distribution P0. We denote the oracle selector for estimators based on cross-validation training sets of size n(1p), where p is the proportion of observations in the validation set, as

α˜nargminαEBnRΨˆαPn,Bn0,P0.

van der Laan and Dudoit [16] present an oracle inequality for the cross-validation selector αn in the case of random design regression. Let L() be a uniformly bounded loss function with

M1supψ,O|L(ψ)[O]L(ψ0)[O]|<.

Let dn(ψ,ψ0)=P0[L(ψ)L(ψ0)] be a loss-function based risk dissimilarity between an arbitrary predictor ψ and the parameter of interest ψ0, where the risk dissimilarity dn() is quadratic in the difference between ψ and ψ0, i.e. P0[L(ψ)L(ψ0)]2M2P0[L(ψ)L(ψ0)]. Suppose the cross-validation selector αn defined above is a minimizer over a grid of Kn different α-indexed candidate estimators. Then for any real-valued δ>0,

(1)EdnΨˆαnPn,Bn0,ψ0(1+2δ)EminαEBndnΨˆαPn,Bn0,ψ0+C(M1,M2,δ)logKnn,

where C() is a constant defined in van der Vaart et al. [17] (see also Appendix A for a definition within the context of fixed regression). Thus if the proportion of observations in the validation set, p, goes to zero as n, and

1nlognEminαEBndnΨˆαPn,Bn0,ψ0n0,

it follows that Ψˆαn, the estimator selected by the cross-validation selector, is asymptotically equivalent to the estimator selected by the oracle, Ψˆα˜n, when applied to training samples of size n(1p), in the sense that

EBndΨˆαnPn,Bn0,ψ0EBndΨˆα˜nPn,Bn0,ψ0n1.

The oracle inequality as presented in eq. (1) shows us that if none of the base learners in the library are a correctly specified parametric model and therefore do not converge at a parametric rate, the cross-validation selector performs as well in terms of expected risk dissimilarity from the truth as the oracle selector, up to a typically second order term bounded by (logKn)/(logKn)nn. If one of the base learners is a correctly specified parametric model and thus achieves a parametric rate of convergence, the cross-validation selector converges (with respect to expected risk dissimilarity) at an almost parametric rate of (logKn)/(logKn)nn.

For the special case where Y=Y and the dimension of S is two, the cross-validation selector performs asymptotically as well as the oracle selector up until a constant factor of (logKn)/(logKn)nn. When Y=Y and the dimension of S, d, is greater than two, the rates of convergence of the base learners will be n1/d. This is slower than n1/2, the rate for a correctly specified parametric model, so the asymptotic equivalence of the cross-validation selector with the oracle selector applies.

The original work of van der Laan and Dudoit [16] used a random regression formulation. Spatial prediction problems where we have assumed either (2) or (3) in Section 2 above require a fixed design regression formulation. We provide a proof of the oracle inequality for the fixed design regression case in Appendix A.

The key message is that Super Learner is a data-driven, theoretically supported way to build the best possible prediction algorithm from a large, heterogeneous set of predictors. It will perform asymptotically as well as or better than the best candidate prediction algorithm under consideration. Expanding the search space to include all convex combinations of the candidates can be an important advantage in spatial prediction problems, where location is often used as a surrogate for unmeasured spatially indexed covariates. Super Learner allows one to consider sufficiently complex, flexible functions while providing protection against overfitting.

4 Cross-validation and spatial data

The theoretical results outlined above depend on the training and validation sets being independent. When this is not the case, there are generally no developed theoretical guarantees of the asymptotic performance of any cross-validation procedure [18]. Bernstein’s inequality, which van der Laan and Dudoit [16] use in developing their proof of the oracle inequality, has been extended to accommodate certain weak dependency structures, so it may be that there are ways to justify certain optimality properties of ν-fold cross-validation in these cases. There have also been some extensions to potentially useful fundamental theorems that accommodate other specific dependency structures. Lumley [19] proved an empirical process limit theorem for sparsely correlated data which can be extended to the multidimensional case. Jiang [20] provided probability bounds for uniform deviations in data with certain kinds of exponentially decaying one-dimensional dependence, although it is unclear how to extend these results to multidimensional dependency structures where sampling may be irregular. Neither of these extensions is immediately applicable to the general spatial case, where sampling may or may not be regular and the extent of spatial correlation cannot necessarily be assumed to be sparse. There has been some attention in the spatial literature to the use of cross-validation within the context of Kriging and selecting the best estimates for the parameters in a covariance function, most of it urging cautious and exploratory use [1, 21]. Todini [22] has investigated methods to provide accurate estimates of model-based Kriging error when the covariance structure has been selected via leave-one-out cross-validation, although this remains an open problem.

Recall from Section 2 above that our parameter of interest is the spatial process Y (s) and we have assumed E[Y|S=s]=Y(s). Even if Y (s) is a spatially dependent stochastic process such as a Gaussian random field, the true parameter of interest in most cases is not the full stochastic process, but rather the particular realization from which we have sampled. Conditioning on this realization removes all randomness associated with the stochastic process, and any remaining randomness comes from the sampling design and measurement error. So long as the data conform to one of the statistical models outlined above in Section 2, the optimality properties outlined above will apply.

5 Simulation study

We applied the Super Learner prediction algorithm to six data sets with known data generating distributions simulated on a grid of 128×128=16,384 points in [0,1]2R2. Each spatial process was simulated once, hence samples of stochastic processes were taken from a common realization. All simulated processes were scaled to [4,4] before sampling. See Figure 1 for an illustration.

The function f1() is a mean zero stationary Gaussian random field (GRF) with Matérn covariance function [23]

C(h,θ)=σ221νΓ(ν)hϕνKνhϕ+τ2,θ=(σ2=5,ϕ=0.5,ν=0.5,τ2=0),

where h is a distance magnitude between two spatial locations, σ2 is a scaling parameter, ϕ>0 is a range parameter influencing the spatial extent of the covariance function and τ2 is a parameter capturing micro-scale variation and/or measurement error. Kν() is a modified Bessel function of the third order and ν>0 parametrizes the smoothness of the spatial covariation. Learners were given spatial location as covariates.

The function f2() is a smooth sinusoidal surface used as a test function in both Huang and Chen [24] and Gu [25], f2(s)=1+3sin(2π[s1s2]π). Learners were given spatial location as covariates.

The function f3() is a weighted nonlinear function of a spatiotemporal cyclone GRF and an exponential decay function of distances to a set of randomly chosen points in [0.5,1.5]2R2. In addition to spatial location, learners were given the distance to the nearest point as a covariate.

Figure 1: The six spatial processes used in the simulation study. All surfaces were simulated once on the domain [0, 1]2. Process values for all surfaces were scaled to [−4,4]⊂R.$[- 4, \,4] \subset \mathbb R. $
Figure 1:

The six spatial processes used in the simulation study. All surfaces were simulated once on the domain [0, 1]2. Process values for all surfaces were scaled to [4,4]R.

The function f4() is defined by the piecewise function

f4(s,w)={|s1s2|+w}I(s1<s2)+{3s1sin(5π[s1s2])+w}I(s1s2),

where w is Beta distributed with non-centrality parameter 3 and shape parameters 4 and 1.5. Learners were given spatial location and w as covariates.

The function f5() is a sum of several surfaces on [0,1]R2; a nonlinear function of a random partition of [0, 1]2; a piecewise smooth function; and w2uniform(1,1). Learners were given spatial location, partition membership (w1) and w2 as covariates.

The function f6() is a weighted sum of a spatiotemporal GRF with five time-points, a distance decay function of a random set of points in [0, 1]2, and a beta-distributed random variable with non-centrality parameter 0 and shape parameters both equal to 0.5. Learners were given spatial location, the five GRFs and the beta-distributed random variable as covariates.

Table 1:

A list of R packages used to build the Super Learner library for spatial prediction.

Algorithm classR libraryReference(s)
DSADSANeugebauer and Bullard [27]
GAMGAMHastie [28]
GPkernlabKaratzoglou et al. [29]
GBMGBMRidgeway [30]
GLMnetglmnetFriedman et al. [31]
KNNregFNNLi [32]
KriginggeoRPJ and Ribeiro [33], PJ Ribeiro and Diggle [34]
PolymarspolsplineKooperberg [35]
Random ForestrandomForestLiaw and Wiener [36]
SVMkernlabKaratzoglou et al. [29]
TPSfieldsFurrer et al. [37]

5.1 Spatial Prediction Library

The library provided to Super Learner consisted of either 83 (number of covariates=2) or 85 (number of covariates>2) base learners from 13 general classes of prediction algorithms. We provide a brief description of each, and list the parameter values used in the libraries. All algorithms were implemented in R Development Core Team [26]. The names of the R packages used are listed in Table 1.

Deletion/Substitution/Addition (DSA) performs data-adaptive polynomial regression using νfold cross-validation and the L2 loss [38]. Both the number of folds in the algorithm’s internal cross-validation and the maximum number of terms allowed in the model (excluding the intercept) were fixed to five. The maximum order of interactions was k{3,4}, and the maximum sum of powers of any single term in the model was p{5,10}.

Generalized Additive Models (GAM) assume the data are generated by a model of the form E[Y|X1,,Xp]=α+i=1pϕi(Xi), where Y is the outcome, (X1,,Xp) are covariates and each ϕi() is a smooth nonparametric function [39]. In this simulation study, the ϕ() are cubic smoothing spline functions parametrized by desired equivalent number of degrees of freedom, df{2,3,4,5,6}. To achieve a uniformly bounded loss function, predicted values were truncated to the range of the sampled data, plus or minus one.

Table 2:

Kernels implemented in the simulation library. x,x is an inner product.

KernelFunction k(x,x)Parameter values
BesselJν+1(σ||xx||)(||xx||)d(ν+1)Jν+1 is a Bessel function of 1st kind, (σ,ν,d){1}×{0.5,1,2}×{2}
Radial Basis Function (RBF)exp(σ||xx||2)Inverse kernel width σ estimated from data.
Linearx,xNone
Polynomialαx,x+cd(σ,α,d){1,3}×{0.001,0.1,1}×{1}
hyberbolic tangenttanhαx,x+c(α,c){0.005,0.002,0.01}×{0.25,1}

Gaussian Processes (GP) assume the observed data are normally distributed with a covariance structure that can be represented as a kernel matrix [40]. Various implementations of the Bessel, Gaussian radial basis, linear and polynomial kernels were used. See Table 2 for details about the kernel functions and parameter values. Predicted values were truncated to the range of the observed data, plus or minus one, to achieve a uniformly bounded loss function.

Generalized Boosted Modeling (GBM) combines regression trees, which model the relationship between an outcome and predictors by recursive binary splits, and boosting, an adaptive method for combining many weak predictors into a single prediction ensemble [41]. The GBM predictor can be thought of as an additive regression model fitted in a forward stage-wise fashion, where each term in the model is a simple tree. We used the following parameter values: number of trees=10,000; shrinkage parameter λ=0.001; bag fraction subsamplingrate=0.5; minimum number of observations in the terminal nodes of each tree=10; interaction depth d{1,2,3,4,5,6}, where an interaction depth of d implies a model with up to d-way interactions.

GLMnet is a GLM fitted via penalized maximum likelihood with elastic-net mixing parameter α{1/4,1/2,3/4} [31].

K-Nearest Neighbor Regression (KNNreg) assumes the unobserved spatial process at a prediction point s can be well-approximated by an average of the observed spatial process values at the k nearest sampled locations to s,k{1,5,10,20}. When k=1 and S are spatial locations only, this is essentially equivalent to Thiessen Polygons.

Kriging is perhaps the most commonly used spatial prediction approach. A general formulation of the spatial model assumed by Kriging can be written as Y(s)=μ(s)+δ(s),δ(s)N(0,C(θ)). The first term represents the large-scale mean trend, assumed to be deterministic and continuous. The second term is a Gaussian random function with mean zero and positive semi-definite covariance function C(θ) satisfying a stationarity assumption. The Kriging predictor is given as a linear combination of the observed data, Ψˆ(s)=i=1nwi(s)Y(si). The weights {wi}i=1n are chosen so that Var Ψˆ(s)Y(s) is minimized, subject to the constraint that the predictions are unbiased. Thus, given a parametric covariance function with known parameters θ and a known mean structure, a Kriging predictor computes the best linear unbiased predictor of Y(s). For the Kriging base learners, the parametric covariance function was assumed to be spherical,

C(h,θ)=τ2+σ212πsin1hϕ+hϕ1hϕ2I(h<ϕ).

The nugget τ2, scale σ2, and range ϕ were estimated using Restricted Maximum Likelihood (for details about REML, see for example Gelfand et al. [42], chapter 4, pp 48–49). The trend was assumed to be one of the following: Constant (traditional Ordinary Kriging, OK); a first order polynomial of the locations (traditional Universal Kriging, UK); a weighted linear combination of non-location covariates only (if any); a weighted linear combination of both locations and non-location covariates (if any). All libraries contained the first and second Kriging algorithms. Libraries for simulated processes with additional covariates contained the third and fourth algorithms as well.

Multivariate adaptive polynomial spline regression (Polymars) is an adaptive regression procedure using piecewise linear splines to model the spatial process, and is parametrized by the maximum size m=min{6n1/133,n/n44,100}, where n is sample size [43].

The Random Forest algorithm proposed by Breiman [44] is an ensemble approach that averages together the predictions of many regression trees constructed by drawing B bootstrap samples and for each sample, growing an unpruned regression tree where at each node, the best split among a subset of q randomly selected covariates is chosen. In our implementation, B was set to 1000, the minimum size of the terminal nodes was 5, and the number of randomly sampled variables at each split was p, where p was the number of covariates.

The library contained a number of Support Vector Machines (SVM), each implementing one of two types of regression (epsilon regression, ϵ=0.1; or nu regression, ν=0.2), and one of five kernels: Bessel, Gaussian radial basis, linear, polynomial, and hyperbolic tangent. The kernels are described in Table 2. Predicted values were truncated to plus or minus one the range of the observed data to ensure a bounded loss, and the cost of constraints violation was fixed at 1.

Thin-plate splines (TPS) is another common approach to spatial prediction. The observed data are presumed to be generated by a deterministic process Y(s)=g(s), where g() is an m times differentiable deterministic function with m>d/d22 and dim(s)=d. The estimator of g() is the minimizer of a penalized sum of squares,

(2)gˆ=argmingGi=1n(Yig(si))2+λJm(g),

With d-dimensional roughness penalty

Jm(g)=Rd{(υ1,,υd)}mυ1,,υdmg(s)s1υ1sdυd2ds,

where the sum in (5.1) is taken over all nonnegative integers (υ1,,υd) such that Σi=1dυi=m [45]. The tuning parameter λ[0,) in (2) controls the permitted degree of roughness for gˆ. As λ tends to zero, the predicted surface approaches one that exactly interpolates the observed data. Larger values of λ allow the roughness penalty term to dominate, and as λ approaches infinity, gˆ tends toward a multivariate least squares estimator. In our library, the smoothing parameter was either fixed to λ{0,0.0001,0.001,0.01,0.1} or estimated data-adaptively using Generalized Cross-validation (GCV) (see Craven and Wahba [46] for a description of the GCV procedure). Predicted values were truncated to plus or minus one of the range of the observed data to ensure a bounded loss.

The library also contained a main terms Generalized Linear Model (GLM) and a simple empirical mean function.

5.2 Simulation procedure

Our simulation study examined the effect of sample size (n{64,100,529}), signal-to-noise ratio (SNR), and sampling scheme. SNR was defined as the ratio of the sample variance of the spatial process and the variance of additive zero-mean normally distributed noise representing measurement error. Processes were simulated with either no added noise or with noise added to achieve a SNR of 4. We examined three sampling schemes: simple random sampling (SRS), random regular sampling (RRS), and stratified sampling (SS). Random regular samples were regularly spaced subsets of the 16, 384 point grid with the initial point selected at random. Stratified random samples were taken by first dividing the domain [0, 1]2 into n equal-area bins and then randomly selecting a single point from each bin.

The following procedure was repeated 100 times for each combination of spatial process, sample size, SNR level, and sampling design, giving a total of 10,800 simulations:

  1. Sample n locations and any associated covariates and process values from the grid of 16, 384 points in [0,1]2R2 according to one of the three sampling designs described above.

  2. For those simulations with SNR=4, draw n i.i.d. samples of the random variable εN(0,σε2) and add them to the nsampled process values {Y1,..., Yn}, where σε2 has been calculated to achieve an SNR of 4.

  3. Pass the sampled values to Super Learner, along with a library of base learners on which to train. The number of folds ν used in the cross-validation procedure depended on n: if n=64, then ν=64; if n=100, then ν=20; if n=529, then ν=10. Super learner uses cross-validation and the L2 loss function to estimate the risk of each candidate predictor and returns an estimate of the optimal convex combination of the predictions made by all base learners according to their cross-validated risk.

  4. For each base learner in the library and for the trained Super Learner, predict the spatial process under consideration at all unsampled points. Calculate mean squared errors (MSEs) and then divide these by the variance of the spatial process. We refer to this measure of performance as the Fraction of Variance Unexplained (FVU). This makes it reasonable to compare prediction performances across different spatial processes.

Table 3:

Average FVUs (standard deviations in parentheses) from the simulation study for each algorithm class. FVUs were calculated from predictions made on all unsampled points at each iteration. Algorithms are ordered according to overall performance.

Algorithm ClassSample SizeSNRSampling Design
Overall64100529None4SRSRRSSS
Super Learner0.240.400.250.070.220.270.260.250.22
(0.22)(0.26)(0.15)(0.06)(0.22)(0.22)(0.23)(0.24)(0.18)
TPS (GCV)0.420.580.440.240.400.450.460.410.40
(0.36)(0.39)(0.35)(0.25)(0.37)(0.35)(0.38)(0.37)(0.34)
Krige (UK)0.440.590.510.210.420.460.420.530.36
(0.30)(0.28)(0.27)(0.20)(0.31)(0.29)(0.30)(0.31)(0.28)
Random Forest0.450.560.490.290.440.460.480.420.45
(0.26)(0.24)(0.25)(0.21)(0.26)(0.26)(0.27)(0.24)(0.26)
Krige (OK)0.450.620.530.210.430.470.410.590.36
(0.32)(0.29)(0.28)(0.20)(0.32)(0.31)(0.29)(0.33)(0.28)
KNNreg0.500.670.560.270.470.530.530.480.49
(0.34)(0.34)(0.33)(0.21)(0.35)(0.33)(0.35)(0.33)(0.34)
TPS0.530.640.560.370.490.560.580.490.52
(0.37)(0.40)(0.37)(0.30)(0.38)(0.37)(0.40)(0.35)(0.37)
GBM0.540.690.570.360.530.550.550.540.54
(0.30)(0.26)(0.25)(0.29)(0.30)(0.30)(0.30)(0.30)(0.30)
DSA0.610.680.620.540.600.630.640.600.60
(0.28)(0.31)(0.26)(0.23)(0.26)(0.29)(0.31)(0.26)(0.26)
GAM0.650.700.650.600.640.660.680.630.64
(0.30)(0.31)(0.30)(0.29)(0.30)(0.31)(0.32)(0.29)(0.30)
GLMnet0.690.710.690.670.690.690.700.690.69
(0.25)(0.24)(0.25)(0.24)(0.25)(0.25)(0.25)(0.24)(0.24)
GLM0.690.710.690.670.690.690.700.680.69
(0.25)(0.25)(0.25)(0.24)(0.25)(0.25)(0.25)(0.24)(0.24)
Polymars0.730.840.780.560.710.740.760.700.71
(0.36)(0.40)(0.33)(0.29)(0.34)(0.38)(0.40)(0.34)(0.34)
SVM0.760.830.800.660.760.770.780.760.76
(0.30)(0.30)(0.31)(0.27)(0.30)(0.30)(0.31)(0.30)(0.30)
GP0.770.890.800.610.740.800.800.760.76
(0.67)(0.68)(0.60)(0.69)(0.62)(0.71)(0.67)(0.68)(0.66)
Mean1.011.011.011.001.011.011.011.001.00
(0.01)(0.02)(0.01)(0.00)(0.01)(0.01)(0.02)(0.01)(0.01)

5.3 Simulation results

Table 5 in Appendix B lists the average performance for each individual base learner in the library, and Table 3 summarizes prediction performance for each algorithm class in the library and for Super Learner itself. Super learner was clearly the best predictor overall when comparing across broad classes, with an average FVU of 0.24(SD=0.22). The next best performing algorithmic class was thin-plate splines using GCV to choose the roughness penalty, with an average FVU of 0.42(SD=0.36). Universal Kriging (FVU=0.44), random forest (FVU=0.35), and Ordinary Kriging (FVU=0.45) all performed similarly, which was slightly less well than TPS (GCV). Super Learner was also the best performer across noise conditions, sampling designs, and sample sizes, with performance improving markedly as sample size increased.

Table 4 breaks algorithmic class performance down by simulated surface. f1 was a mean-zero GRF, something we would expect both Kriging and thin-plate splines algorithms to predict well. TPS (GCV) and Super Learner were the best performers, with nearly identical average FVUs of 0.11(SD=0.06). The other TPS algorithms and Universal Kriging faired slightly less well, with an average FVU of 0.15. Ordinary Kriging had an average FVU of 0.26, which was actually greater than the average FVUs for Random Forest (0.16), K-nearest neighbors regression (0.19), GBM (0.22), GAM (0.24), and DSA (0.25).

f2 was a simple sinusoidal surface, another functional form where we would expect thin-plate splines to excel, provided the samples properly captured the periodicity of the process. TPS (GCV) had the best overall performance, with an average FVU of 0.07(SD=0.09). Super Learner performed only slightly less well, with an average FVU of 0.09(SD=0.11). The other TPS algorithms (0.23), Ordinary Kriging (0.24) and Universal Kriging (0.25) performed substantially less well on average.

f3 was a relatively complex function involving a “cyclone” Gaussian random field and a distance decay function of randomly selected points. Once again, the average performances of TPS (GCV) and Super Learner were nearly identical (FVU=0.30, sd- 0.11).

f4 was a smooth, heterogeneous process. TPS (GCV) (average FVU=0.42), Super Learner (0.43), Ordinary Kriging (0.45), and Universal Kriging (0.46) all performed similarly.

f5 was a clustered, rough surface we would expect to be well-suited to K nearest neighbors, GBM, and Random Forest. In fact, all three of these algorithmic classes had nearly identical performances, with an average FVU of 0.47. Super Learner, however, had an average FVU of 0.22(SD=0.14), which was dramatically better than any of the other algorithmic classes. The Ordinary (average FVU=0.70) and Universal (0.68) Kriging algorithms had similar average performances to GAM (0.62), GLM (0.67), GLMnet (0.67), and DSA (0.68). Not surprisingly, TPS (GCV) and TPS with fixed λ did poorly, with average FVUs of 0.91 and 1.01, respectively.

Table 4:

Average FVU (standard deviation in parentheses) by spatial process.

Algorithm ClassAverage FVU
f1f2f3f4f5f6
Super Learner0.110.090.300.430.220.31
(0.06)(0.11)(0.11)(0.36)(0.14)(0.19)
TPS (GCV)0.110.070.300.420.910.72
(0.06)(0.09)(0.11)(0.36)(0.17)(0.23)
Krige (UK)0.150.250.370.460.680.47
(0.11)(0.33)(0.20)(0.32)(0.23)(0.28)
Random Forest0.160.310.410.890.470.46
(0.06)(0.18)(0.12)(0.15)(0.14)(0.09)
Krige (OK)0.260.240.390.450.700.47
(0.31)(0.33)(0.24)(0.32)(0.23)(0.28)
KNNreg0.190.290.440.920.470.70
(0.10)(0.26)(0.16)(0.29)(0.34)(0.19)
TPS0.150.230.380.601.010.78
(0.07)(0.24)(0.14)(0.35)(0.23)(0.19)
GBM0.220.650.490.970.470.46
(0.07)(0.36)(0.13)(0.08)(0.24)(0.08)
DSA0.250.720.531.030.680.48
(0.05)(0.25)(0.08)(0.15)(0.11)(0.08)
GAM0.241.050.491.020.620.49
(0.02)(0.08)(0.04)(0.09)(0.08)(0.12)
GLMnet0.371.010.670.990.670.44
(0.01)(0.02)(0.03)(0.03)(0.03)(0.03)
GLM0.371.020.670.980.670.44
(0.01)(0.03)(0.02)(0.03)(0.03)(0.03)
Polymars0.280.940.601.110.780.64
(0.10)(0.30)(0.19)(0.25)(0.20)(0.34)
SVM0.490.870.711.050.800.66
(0.28)(0.27)(0.20)(0.15)(0.19)(0.33)
GP0.280.640.571.311.010.81
(0.10)(0.42)(0.20)(0.61)(0.63)(1.02)
Mean1.001.011.011.001.011.01
(0.01)(0.01)(0.01)(0.01)(0.01)(0.01)

f6 was a somewhat rough surface constructed from a Gaussian random field and point-source distance decay functions. As expected, Kriging with trend w1,,w6 had the best performance on average, with an FVU of 0.25(SD=0.14), closely followed by Kriging with trend s,w1,,w6 (average FVU=0.26,sd=0.15). Super Learner had the next best average performance, with an average FVU of 0.31(SD=0.19). GLM, GLMnet, GBM, Random Forest, the Ordinary and Universal Kriging algorithms, and DSA all performed similarly slightly less well, with average FVUs from 0.44 to 0.48. The TPS (GCV) and TPS with fixed λ were at a disadvantage given the roughness of the surface, with average FVUs of 0.72 and 0.78, respectively.

These simulation results clearly illustrate some of the chief advantages of Super Learner as a spatial predictor. For surfaces that were perfectly suited for one or more base learners in the library, Super Learner either performed almost as well as the best base learner, or it outperformed its library. For more complex, rougher surfaces, Super Learner performed significantly better than any single base learner in the library. It had the best overall performance even at the smallest sample size, and appeared to be relatively insensitive to sampling strategy.

6 Practical example: predicting lake acidity

We applied Super Learner to a lake acidity data set previously analyzed by Gu [25] and Huang and Chen [24]. Increases in water acidity are known to have a deleterious effect on lake ecology. Having an accurate estimate of the spatial distribution of lake acidity is an essential first step toward crafting effective regulatory interventions to control it. The data were sampled by the U.S. Environmental Protection Agency during the Fall of 1984 in the Blue Ridge region of the Southeastern United States [47], and consist of longitudes and latitudes (in degrees), calcium ion concentrations (in milligrams per liter), and pH values. The EPA used a systematic stratified sampling design which we treated as fixed here. Because only one sample per lake was collected, we assume some measurement error that is independent of lake pH, calcium ion concentration, and spatial location. The data are freely available in the R package gss [48]. We used the same nearly equal area projection as Gu [25] and Huang and Chen [24],

x1=cos((πxlat)/180)sin(π(xlonxˉlon)/180)x2=sin(π(xlatxˉlat)/180),

where xˉlat and xˉlon are the midpoints of the latitude and longitude ranges, respectively.

Let xi=(xi,1,xi,2) denote the ith sampling location; wi denote the calcium ion concentration observed at the ith sampling location; and Yi be the pH value observed at the ith sampling location. We assume that E[Yi|Si=s]=Y(s), where Si=(xi,wi). Our objective is to learn the lake pH spatial process from the data.

The library used to predict lake acidity was similar in composition to the simulation library described in Subsection 5.1, with some important differences. We reduced the number of parameterizations for some of the algorithm classes in the library. We used one DSA learner, which used 10-fold cross-validation and considered polynomials of up to five terms (m=5), each term being at most a two-way interaction (k=2) with a maximum sum of powers p=3. We used a reduced number of parameterizations of GAM, GBM, TPS, GP, and SVM learners, as well. We also included screening algorithms that allowed us to train learners on specific subsets of covariates: x, w, log w, (x, w), and (x, log w). We considered the L2 loss function, and the predictions from all base learners were truncated to the observed pH range in order to ensure a uniformly bounded loss.

Table 6 in Appendix B provides a detailed list of the library and shows performance results for each base learner as well as Super Learner itself. Figure 2 provides graphical representations of Super Learner’s pH predictions. The Super Learner predictions are a slightly smoothed version of the observed data, with attenuated predictions for the highest and lowest observations. We also cross-validated Super Learner (V=5) to get an unbiased estimate of its risk. Super Learner had nearly identical performance in terms of cross-validated risk to GAM (4 degrees, using × and log w) and both versions of Random Forest. The GBM predictors and OK (using log w) had slightly smaller cross-validated risks.

Many of the algorithms in the library performed better when given log w as opposed to w, but for those algorithms like GBM and Random Forest that were not attempting to fit some kind of polynomial trend, logging the calcium ion concentration made little difference in performance. This is not surprising, given the relationship between the activities of Ca2+ and H+ As expected, most algorithms had cross-validated risk estimates that were worse than their empirical risk estimates calculated from predictions made after training on the full data set. The Kriging algorithms, for instance, were all exact interpolators when trained on the full data, and thus had estimated empirical MSEs of 0, whereas their MSEs estimated via cross-validation ranged from 0.07 (FVU=0.46) to 0.11 (FVU=0.72). The Gaussian processes with RBF kernel had the most pronounced differences between the two risk estimates. For example, GP (RBF) trained on the covariates (x, w) had an empirical MSE of 0.01 (FVU=0.08) and a cross-validated MSE of 0.22 (FVU=1.46).

One should use caution in general when interpreting the weights returned by the Super Learner algorithm. It can be tempting to use them to try to draw conclusions about the underlying data generating process. However, these weights are not necessarily stable, and we would expect them to differ given a different sample, or a different partitioning during cross-validation. Performance guarantees are with respect to the ensemble, and not the ranking of individual algorithms whose predictions are being averaged.

Figure 2: (a) A map of Super Learner’s pH predictions, and (b) a plot of Super Learner’s predictions as a function of the observed data. Super Learner mildly attenuated the pH values at either end of the range, but otherwise provided a fairly close fit to the data.
Figure 2:

(a) A map of Super Learner’s pH predictions, and (b) a plot of Super Learner’s predictions as a function of the observed data. Super Learner mildly attenuated the pH values at either end of the range, but otherwise provided a fairly close fit to the data.

7 Conclusion/discussion

In this article, we have demonstrated the use of an ensemble learner for spatial prediction that uses cross-validation to optimally combine the predictions from multiple, heterogeneous base learners. We have reviewed important theoretical results giving performance bounds that imply Super Learner will perform asymptotically at least as well as the best candidate in the library. We discussed the assumptions required for these optimality properties hold. These assumptions are reasonable for many measurement error scenarios and commonly implemented spatial sampling designs, including various forms of stratified and random regular sampling.

Our simulations and practical data analysis used comparatively modest sample sizes. With the increased availability of massive environmental sensor networks and extremely large remotely sensed imagery, scalability of a spatial prediction method is often an important practical concern. The cross-validation step of Super Learner is a so-called ’embarrassingly parallel’ problem, and there are implementations of Super Learner that parallelize this step (see for instance the function CV.SuperLearner in the SuperLearner R package). There are also a number of versions of the Super Learner algorithm under active development that are intended specifically to cope with very large data sets. One is called h2oEnsemble [49], and is a stand-alone R package that makes use of a Java-based machine learning platform that can be used for massive cluster computing. Another version of Super Learner has been modified to work with streaming data [50].

In this paper, we have not addressed dependent sampling designs, where sampling at one point changes the probability of sampling at another point. This is an important area for future research. We also limited our scope to the case where measurement error is at least conditionally mean-zero. Spatially structured measurement error that is not conditionally mean zero is a common problem in many spatial prediction applications, and there have been a number of attempts to alter the cross-validation procedure to accommodate it [51, 15]. These proposed techniques generally require one to estimate the error correlation structure from the data or to know it a priori. How well these algorithms perform if the correlation extent is substantially underestimated is unknown. Ideally, it would be best to have a stronger theoretical understanding of how the degree of dependence between training and validation sets affects cross-validated risk estimates both asymptotically and in finite samples. This is an important future area for research.

APPENDIX A. Oracle Inequality for Independent, Nonidentical Experiments and Quadratic Loss

Let On=(O1,,On)P0n be a vector of independent, nonidentical observations, where each Oi=(Xi,Yi) consists of two components: a d-dimensional covariate vector XiRd, and a univariate outcome YiR. We associate with each Oi an index siS. The true unknown data generating distribution for each Oi is denoted P0,Oi(O)=P0,O|S(O|si){P0,O|s:sS}: Let Ps,O be the joint distribution of (S, O), defined by a degenerate marginal distribution of S, I(S=s), and the conditional distribution of O given S=s, PO|s. We can formulate On as n independent draws (Si,Oi)P0,(si,Oi),i=1,,n, with empirical probability distribution Pn. Let M be a set of possible probability distributions of PO|S. Define a parameter Ψ:MΨ, and let ψ0=Ψ(P0,O|S) be the true value of that parameter. Let Bn{0,1}n be a random vector indicating splits into a training sample, {i:Bn(i)=0}, and validation sample, {i:Bn(i)=1}. Let p=1/ni=1nBn(i) be the proportion of observations in the validation sample, and let Pn,Bn0 and Pn,Bn1 be the empirical distributions of the training and validation samples, respectively. Define an average joint distribution: Pˉ0,Bn1=(np)1i:Bn(i)=1P0,(si,Oi). Let L(ψ)(S,O) be a loss function such that for all i, P0,(si,O)L(ψ0)=minψΨP0,(si,Oi)L(ψ). Let {Ψˆk(Pn):k=1,,Kn} be a set of Kn estimators of ψ0. Assume P(Ψˆk(Pn)Ψ)=1 for all k=1,,Kn. We write the true cross-validated risk of ψ0 as Θ˜opt=EBn[Pˉ0,Bn1L(ψ0)]. We denote the true conditional cross-validated risk of any estimator Ψˆk as

Θ˜n(1p)(k)EBnPˉ0,Bn1LΨˆk[Pn,Bn0]=EBn1npi:Bn(i)=1P0,(Oi|si)LΨˆk[Pn,Bn0](si,Oi),

and a benchmark (oracle) selector as k˜n(1p)=argminkΘ˜n(1p)(k).

We denote the cross-validated risk of any estimator Ψˆk as

Θ˜n(1p)(k)EBnPn,Bn1LΨˆ[Pn,Bn0]=EBn1npi:Bn(i)=1LΨˆ[Pn,Bn0](si,Oi),

and the cross-validation selector as kn=argminkΘ˜n(1p)(k). Finally, we define a loss-based dissimilarity dn(ψ,ψ0)EBnPˉ0,Bn1L[ψ]L[ψ0].

Assumptions

  1. There exists a real-valued M1< such that supψΨ{supi,si,Oi|L(ψ)(si,Oi)L(ψ0)(si,Oi)|}M1, where the supremum over Oi is taken over the support of the distribution P0,Oi|si of Oi.

  2. There exists a real-valued M2< such that

supi,ψΨVarP0,(si,Oi)[L(ψ)L(ψ0)](S,O)EP0,(si,Oi)[L(ψ)L(ψ0)](S,O)M2.
Definitions

We define the following constants.

M1=2M1C(M1,M2,δ)2(1+δ)2M13+M2δ

Finite sample result. For any δ>0, we have

(3)EdnΨˆkn[Pn,Bn0],ψ0(1+2δ)EdnΨˆkn(1p)[Pn,Bn0],ψ0+2C(M1,M2,δ)1+logKnnp.
Asymptotitic implications. Equation (3) has the following asymptotic implications:
(4)logKnnpEΘ˜n(1p)k˜n(1p)Θ˜optn0EΘ˜n(1p)(kn)Θ˜optEΘ˜n(1p)k˜n(1p)Θ˜optn1.logKnnpΘ˜n(1p)k˜n(1p)Θ˜optp0Θ˜n(1p)(kn)Θ˜optΘ˜n(1p)k˜n(1p)Θ˜optp1.

Equation (4) follows from the fact that, given a sequence of random variables X1, X2,…, and a positive function g(n),E|Xn|=O{g(n)} implies Xn=OP{g(n)}. This is a direct consequence of Markov’s inequality.

Proof of theorem

We have

(5)0Θ˜n(1p)(kn)Θ˜opt=EBnPˉ0,Bn1LΨˆkn[Pn,Bn0]L(ψ0)(1+δ)EBnPn,Bn1LΨˆkn[Pn,Bn0]L(ψ0)+(1+δ)EBnPn,Bn1LΨˆkn[Pn,Bn0]L(ψ0)
(6)EBnPˉ0,Bn1LΨˆkn[Pn,Bn0]L(ψ0)(1+δ)EBnPn,Bn1LΨˆkn[Pn,Bn0]L(ψ0)+(1+δ)EBnPn,Bn1LΨˆk˜n(1p)[Pn,Bn0]L(ψ0)
(7)=EBnPˉ0,Bn1LΨˆkn[Pn,Bn0]L(ψ0)
(8)(1+δ)EBnPn,Bn1LΨˆkn[Pn,Bn0]L(ψ0)
(9)+(1+δ)EBn[Pn,Bn1{L(Ψ^k˜n(1p)[Pn,Bn0])L(ψ0)}]
(10)(1+2δ)EBn[Pn,Bn1{L(Ψ^k˜n(1p)[Pn,Bn0])L(ψ0)}]
(11)+(1+2δ)EBn[Pn,Bn1{L(Ψ^k˜n(1p)[Pn,Bn0])L(ψ0)}]

Equation (5) follows from the definition of Θ˜opt. Equation (6) follows from the definition of the cross-validation selector kn, such that for all k, Θ˜n(1p)(kn)Θˆn(1p)(k). Let Rn,kn represent the first two terms in the last expression, (7) and (8). Let Tn,k˜n(1p) represent the second two terms of the last expression, (9) and (10). The last term, (11), is the benchmark and can be written as (1+2δ)Θ˜n(1p)k˜n(1p)Θ˜opt. Hence,

(12)0Θ˜n(1p)(kn)Θ˜opt(1+2δ)Θ˜n(1p)k˜n(1p)Θ˜opt+Rn,kn+Tn,k˜n(1p).

We now show that ERn,kn+ETn,k˜n(1p)2C(M1,M2,δ)(1+logKn)/2C(M1,M2,δ)(1+logKn)(np)(np). We introduce the following notation:

HˆkPn,Bn1LΨˆkPn,Bn0L(ψ0)H˜kP0,Bn1LΨˆkPn,Bn0L(ψ0)Rn,k(Bn)(1+δ)H˜kHˆkδH˜kTn,k(Bn)(1+δ)HˆkH˜kδH˜k

Note that Rn,k=EBn[Rn,k(Bn)];Tn,k=EBn[Tn,k(Bn)]; and that by definition of ψ0,H˜k0 for all k. Note also that given an arbitrary k{1,Kn},

PRn,kn(Bn)>s|Pn,Bn0,Bn=P|H˜knHkn>s+δH˜kn1+δPn,Bn0,BnKnmaxkP|H˜kHˆk>s+δH˜k1+δPn,Bn0,Bn.

Similarly for Tn,k˜n(1p)(Bn),

PTn,k˜n(1p)(Bn)>s|Pn,Bn0,Bn=KnmaxkP|HˆkH˜k>s+δH˜k1+δPn,Bn0,Bn.

Conditional on Pn,Bn0 and Bn, consider the np random variables for which Bn(i)=1,Zk,iLΨˆkPn,Bn0L(ψ0)(si,Oi). We can rewrite Hˆk and H˜k in terms of Zk,i,

Hˆk=1npi=1npZk,i,H˜k=1npi=1npEZk,i|Pn,Bn0,Bn.

Then H˜kHˆk is the sum of np mean zero centered random variables. By assumption A1 above, the random variables Zi,k are bounded, with Zi,kM1 a.s. By assumption A2, we also have

σk,i2VarZk,i|Pn,Bn0,BnM2EZk,i|Pn,Bn0,Bn,

which implies

σk21npi=1npσk,i2M21npi=1npEZk,i|Pn,Bn0,Bn=M2H˜k.

We will apply Bernstein’s inequality to the centered empirical mean H˜kHˆk and obtain a tail probability bounded by exp{npq/npqcc}, where c is a finite, real-valued constant. This will show that the risk dissimilarities converge at a rate of (logKn)/(logKn)npnp. We state Bernstein’s inequality for ease of reference. A proof is given in Lemma A.2 on page 594 in Györfi, Kohler, and Walk [52].

Lemma 1

Bernstein’s inequality.

LetZi,i=1,,nbe independent, real valued random variables such thatZi[a,b]with probability one. Let0<i=1nVar(Zi)/Var(Zi)nσ2.nσ2.Then, for allϵ>0,

P1ni=1n(ZiEZi)>ϵexpnϵ22(σ2+ϵ(ba)/2(σ2+ϵ(ba)3)3).
This implies
P1ni=1n(ZiEZi)>ϵ2expnϵ22(σ2+ϵ(ba)/2(σ2+ϵ(ba)3)3).

By Bernstein’s lemma, for q>0,

PRn,k(Bn)>q|Pn,Bn0,Bn=P|H˜kHˆk>11+δq+δH˜kPn,Bn0,BnP|H˜kHˆk>11+δq+δσk2M2Pn,Bn0,Bnexpnp2[1+δ]2q+δσk2/q+δσk2M2M22σk2+M13(1+δ)q+δσk2/q+δσk2M2M2.

Note that

q+δσk2/q+δσk2M2M22σk2+M13(1+δ)q+δσk2/q+δσk2M2M2=s+δσk2/s+δσk2M2M22σk2q+σk2/M2+M13(1+δ)s+δσk2/s+δσk2M2M22M2δ+M13sM2δ+M13.

This shows that for q>0,

[Rn,kn(Bn)>q|Pn,Bn0,Bn]Knexp{(npq)/C(M1,M2,δ)}.

In particular, this provides us with a bound for the marginal probability of Rn,kn(Bn),

[Rn,kn(Bn)>q]Knexp{(npq)/C(M1,M2,δ)}.

As in the proof of theorem 1 in van der Laan, Dudoit, and Keles [53] and Dudoit and van der Laan [54], for each u>0, we have

ERn,knu+uKnexpnpq/npqCM1,M2,δCM1,M2,δdq.

The minimum is attained at un=C(M1,M2,δ)logKn/logKnnpnp and is given by C(M1,M2,δ)(logKn+1)/C(M1,M2,δ)(logKn+1)npnp. Thus ERn,knC(M1,M2,δ)(1+logKn)(np). The same applies for ETn,k˜n(1p).

Taking the expected values of the quantities in (A.4) yields the following finite sample result:

0EΘ˜n(1p)(kn)Θ˜opt(1+2δ)EΘ˜n(1p)Θ˜opt+2C(M1,M2,δ)1+logKnnp.

This completes the proof. ■

APPENDIX B. Tables

Table 5:

Simulation results for full library. For each algorithm, average Fraction of Variance Unexplained, (Avg FVU, standard deviation in parentheses) is the FVU averaged over all spatial processes, sample sizes, sampling designs, and noise condidtions. At each iteration, MSEs were calculated using all unsamped locations. Note that of the eight Kriging algorithms, only two were used to predict all spatial processes.

AlgorithmParametersAvg FVU
Super Learner0.24 (0.22)
DSA (v, m, k, p)(5, 5, 3, 10)0.62 (0.29)
(5, 5, 3, 5)0.61 (0.26)
(5, 5, 4, 10)0.62 (0.28)
(5, 5, 4, 5)0.61 (0.26)
GAM (degree)20.65 (0.28)
30.64 (0.29)
40.65 (0.30)
50.65 (0.31)
60.66 (0.32)
GBM (degree)10.64 (0.28)
20.56 (0.28)
30.53 (0.30)
40.52 (0.30)
50.51 (0.31)
60.50 (0.31)
GLM0.69 (0.25)
GLMnet (α)0.250.69 (0.25)
0.50.69 (0.25)
0.750.69 (0.25)
GP (Bessel)(1, 0.5, 2)1.12 (1.52)
(1, 1, 2)0.81 (0.81)
(1, 2, 2)0.74 (0.67)
(linear)0.69 (0.25)
(poly)(1, 0.001, 1)0.69 (0.25)
(1, 0.01, 1)0.69 (0.25)
(1, 0.1, 1)0.69 (0.25)
(1, 1, 1)0.69 (0.25)
(3, 0.001, 1)0.66 (0.27)
(3, 0.01, 1)0.65 (0.29)
(3, 0.1, 1)0.83 (0.65)
(3, 1, 1)0.84 (0.68)
(RBF)0.92 (0.90)
KNNreg (k)10.53 (0.38)
50.40 (0.31)
100.48 (0.32)
200.60 (0.33)
Kriging (trend)s0.46 (0.34)
f3, f4 onlys,w10.41 (0.26)
f5 onlys,w1,w20.51 (0.15)
f6 onlys,w1, …,w60.26 (0.15)
none0.49 (0.35)
f3, f4 onlyw10.42 (0.28)
f5 onlyw1,w20.52 (0.16)
f6 onlyw1, …,w60.25 (0.14)
Mean1.01 (0.01)
Polymars0.73 (0.36)
Random Forest0.45 (0.26)
SVM (Bessel; eps)(1, 1, 1)0.65 (0.27)
(1, 1, 2)0.57 (0.28)
(1, 2, 1)0.66 (0.27)
(1, 2, 2)0.59 (0.27)
(Bessel; nu)(1, 1, 1)0.67 (0.27)
(1, 1, 2)0.62 (0.27)
(1, 2, 1)0.68 (0.27)
(1, 2, 2)0.63 (0.27)
(linear; eps)0.71 (0.25)
(linear; nu)0.72 (0.28)
(poly; eps)(1, 0.001, 1)0.92 (0.12)
(1, 0.1, 1)0.71 (0.25)
(1, 1, 1)0.71 (0.25)
(3, 0.001, 1)0.85 (0.17)
(3, 0.1, 1)0.64 (0.28)
(3, 1, 1)0.83 (0.61)
(poly; nu)(1, 0.001, 1)0.97 (0.08)
(1, 0.1, 1)0.71 (0.25)
(1, 1, 1)0.72 (0.28)
(3, 0.001, 1)0.91 (0.14)
(3, 0.1, 1)0.66 (0.29)
(3, 1, 1)0.92 (0.71)
(RBF; eps)0.48 (0.34)
(RBF; nu)0.50 (0.32)
(tanh; eps)(0.01, 0.25)0.76 (0.21)
(0.01, 1)0.82 (0.18)
(0.005, 0.25)0.81 (0.19)
(0.005, 1)0.87 (0.16)
(0.002, 0.25)0.88 (0.15)
(0.002, 1)0.93 (0.11)
(tanh; nu)(0.01, 0.25)0.82 (0.19)
(0.01, 1)0.88 (0.16)
(0.005, 0.25)0.88 (0.16)
(0.005, 1)0.93 (0.12)
(0.002, 0.25)0.94 (0.11)
(0.002, 1)0.98 (0.07)
TPS (λ)(GCV)0.42 (0.36)
00.52 (0.45)
0.00010.44 (0.39)
0.0010.44 (0.35)
0.010.54 (0.33)
0.10.69 (0.28)
Table 6:

Lake acidity results for full library. S denotes the variable subset each algorithm was given. Risks were estimated via cross-validation (CV) or on the full dataset (Full). β are the convex weights assigned to each algorithm in the Super Learner predictor.

AlgorithmMSEˆ(FVUˆ)
SCVFullβ
Super Learner0.08 (0.50)0.00 (0.03)
DSA
x,w0.13 (0.85)0.12 (0.80)0
x,w0.09 (0.57)0.07 (0.46)0
GAM (degree)
2x,w0.09 (0.58)0.07 (0.49)0
x,w0.08 (0.51)0.07 (0.45)0
3x,w0.08 (0.54)0.07 (0.43)0
x,w0.08 (0.51)0.06 (0.41)0
4x,w0.08 (0.53)0.06 (0.40)0
x,w0.08 (0.50)0.06 (0.39)0
GBM (degree)
2x,w0.07 (0.49)0.05 (0.32)0
x,w0.07 (0.49)0.05 (0.32)0
4x,w0.08 (0.50)0.04 (0.29)0
x,w0.07 (0.49)0.05 (0.32)0
6x,w0.07 (0.49)0.04 (0.28)0.12
x,w0.07 (0.49)0.04 (0.29)0
GLM
x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
GLMnet (α)
0.25x,w0.12 (0.79)0.10 (0.67)0
x,w0.08 (0.55)0.08 (0.50)0
0.5x,w0.11 (0.73)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
0.75x,w0.11 (0.75)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
GP (Bessel)
(1, 0.5, 2)x,w0.13 (0.83)0.04 (0.25)0
x,w0.16 (1.03)0.03 (0.21)0
(1, 1, 2)x,w0.14 (0.90)0.04 (0.27)0
x,w0.15 (0.97)0.03 (0.22)0
(1, 2, 2)x,w0.16 (1.08)0.04 (0.29)0
x,w0.17 (1.10)0.04 (0.25)0
GP (linear)
x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
GP (RBF)
x,w0.22 (1.45)0.02 (0.16)0
x,w0.22 (1.46)0.01 (0.08)0
GP (poly.)
(1, 0.001, 1)x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
(1, 0.01, 1)x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
(1, 0.1, 1)x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
(1, 1, 1)x,w0.11 (0.74)0.10 (0.67)0
x,w0.08 (0.54)0.08 (0.50)0
KNNreg (k)
1x0.17 (1.12)0.00 (0.00)0.02
x,w0.11 (0.73)0.00 (0.00)0.08
5x0.12 (0.76)0.08 (0.52)0
x,w0.08 (0.55)0.06 (0.38)0.04
10x0.11 (0.73)0.09 (0.61)0
x,w0.08 (0.53)0.07 (0.43)0
20x0.11 (0.72)0.10 (0.66)0.03
x,w0.09 (0.56)0.08 (0.50)0
Kriging
(OK)0.11 (0.71)0.00 (0.00)0
w0.09 (0.56)0.00 (0.00)0
w0.07 (0.46)0.00 (0.00)0.58
(UK)x0.11 (0.72)0.00 (0.00)0
x,w0.09 (0.60)0.00 (0.00)0
x,w0.08 (0.51)0.00 (0.00)0
Mean0.15 (1.00)0.15 (1.00)0
Polymars
x,w0.10 (0.63)0.04 (0.27)0
x,w0.08 (0.56)0.05 (0.36)0
RF
x,w0.08 (0.50)0.02 (0.11)0.06
x,w0.08 (0.50)0.02 (0.12)0
SVM (Bessel; eps)
(1, 1, 2)x,w0.09 (0.57)0.06 (0.43)0
x,w0.08 (0.55)0.06 (0.42)0
(1, 2, 1)x,w0.09 (0.56)0.08 (0.51)0
x,w0.08 (0.52)0.07 (0.46)0
(1, 2, 2)x,w0.09 (0.56)0.07 (0.45)0
x,w0.08 (0.55)0.07 (0.44)0
SVM (Bessel; nu)
(1, 1, 2)x,w0.09 (0.57)0.07 (0.48)0
x,w0.09 (0.61)0.07 (0.46)0
(1, 2, 1)x,w0.1 (0.64)0.08 (0.56)0
x,w0.08 (0.56)0.07 (0.48)0
(1, 2, 2)x,w0.09 (0.59)0.07 (0.49)0
x,w0.09 (0.58)0.07 (0.47)0
SVM (poly, eps)
(1, 0.001, 1)x,w0.15 (0.97)0.14 (0.96)0
x,w0.14 (0.94)0.14 (0.92)0
(1, 0.1, 1)x,w0.12 (0.78)0.10 (0.69)0
x,w0.08 (0.52)0.08 (0.50)0
(1, 1, 1)x,w0.12 (0.78)0.11 (0.69)0
x,w0.08 (0.53)0.08 (0.50)0.08
(3, 0.001, 1)x,w0.14 (0.92)0.13 (0.89)0
x,w0.12 (0.81)0.12 (0.78)0
SVM (poly, nu)
(1, 0.001, 1)x,w0.15 (0.98)0.15 (0.97)0
x,w0.15 (0.97)0.14 (0.95)0
(1, 0.1, 1)x,w0.11 (0.73)0.11 (0.70)0
x,w0.08 (0.55)0.08 (0.53)0
(1, 1, 1)x,w0.11 (0.74)0.11 (0.70)0
x,w0.08 (0.54)0.08 (0.52)0
(3, 0.001, 1)x,w0.14 (0.94)0.14 (0.92)0
x,w0.13 (0.89)0.13 (0.87)0
TPS (GCV)x0.11 (0.71)0.08 (0.53)0

References

1. Cressie N. Statistics for spatial data, revised ed. Wiley Series in probability and mathematical statistics. New York: John Wiley and Sons, Inc, 1993.10.1002/9781119115151Search in Google Scholar

2. Schabenberger O, Gotway C. Statistical methods for spatial data analysis. Texts in Statistical Science. Boca Raton: Chapman & Hall, CRC, 2005.Search in Google Scholar

3. Zaier I, Shu C, Ouarda T, Seidou O, Chebana F. Estimation of ice thickness on lakes using artificial neural network ensembles. J Hydrol 2010;383:330–40.10.1016/j.jhydrol.2010.01.006Search in Google Scholar

4. Chen J, Wang C. Using stacked generalization to combine SVMs in magnitude and shape feature spaces for classification of hyperspectral data. IEEE Trans Geosci Remote Sensing 2009;47:2193–205.10.1109/TGRS.2008.2010491Search in Google Scholar

5. Rossi M, Guzzetti F, Reichenbach P, Mondini AC, Peruccacci S. Optimal landslide susceptibility zonation based on multiple forecasts. Geomorphology 2010;114:129–42.10.1016/j.geomorph.2009.06.020Search in Google Scholar

6. Kleiber W, Raftery A, Gneiting T. Geostatistical model averaging for locally calibrated probabilistic quantitative precipitation forecasting. J Am Stat Assoc 2011;106:1291–303.10.1198/jasa.2011.ap10433Search in Google Scholar

7. Polley E, van der Laan M. Super learner in prediction U.C. Berkeley Division of Biostatistics Working Paper Series, 2010.Search in Google Scholar

8. Wolpert D. Stacked generalization. Neural Networks 1992;5:241–59.10.1016/S0893-6080(05)80023-1Search in Google Scholar

9. Breiman L. Stacked regressions. Machine Learning 1996;24:49–64.10.1007/BF00117832Search in Google Scholar

10. LeBlanc M, Tibshirani R. Combining estimates in regression and classification. J Am Stat Assoc 1996;91:1641–50.10.1080/01621459.1996.10476733Search in Google Scholar

11. Stone M. Cross-validatory choice and assessment of statistical procedures. J R Stat Soc Ser B 1974;36:111–47.Search in Google Scholar

12. Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc 1975;70:320–8.10.1080/01621459.1975.10479865Search in Google Scholar

13. Polley E, Rose S, van der Laan M. Targeted learning: casual inference for observational and experimental data. New York: Springer, chapter 3: Super Learning, 43–65, 2011.Search in Google Scholar

14. Opsomer J, Wang Y, Yang Y. Nonparametric regression with correlated errors. Stat Sci 2001;16:134–53.10.1214/ss/1009213287Search in Google Scholar

15. Francisco-Fernandez M, Opsomer J. Smoothing Parameter Selection Methods for Nonparametric Regression with Spatially Correlated Errors. The Canadian Journal of Statistics 2005;33:279–95.10.1002/cjs.5550330208Search in Google Scholar

16. van der Laan M, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples U.C. Berkeley Division of Biostatistics Working Paper Series, 2003.Search in Google Scholar

17. van der Vaart A, Dudoit S, van der Laan M. Oracle inequalities for multi-fold cross validation. Stat Decisions 2006;24:351–71.10.1524/stnd.2006.24.3.351Search in Google Scholar

18. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv 2010;4:44–79.10.1214/09-SS054Search in Google Scholar

19. Lumley T. An empirical process limit theorem for sparsely correlated data UW Biostatistics Working Paper Series, 2005.Search in Google Scholar

20. Jiang W. On uniform deviations of general empirical risks with unboundedness, dependence, and high dimensionality. J Mach Learning Res 2009;10:977–96.Search in Google Scholar

21. Davis B. Uses and abuses of cross-validation in geostatistics. Math Geol 1987;19:241–8.10.1007/BF00897749Search in Google Scholar

22. Todini E. Influence of parameter estimation on uncertainty in Kriging: Part 1 – theoretical development. Hydrol Earth Syst Sci 2001;5:215–23.10.5194/hess-5-215-2001Search in Google Scholar

23. Matérn B. Spatial variation, 2nd ed. New York: Springer, 1986.10.1007/978-1-4615-7892-5Search in Google Scholar

24. Huang H, Chen C. Optimal geostatistical model selection. J Am Stat Assoc 2007;102:1009–24.10.1198/016214507000000491Search in Google Scholar

25. Gu C. Smoothing spline ANOVA models. New York: Springer, 2002.10.1007/978-1-4757-3683-0Search in Google Scholar

26. R Development Core Team. R: A language and environment for statistical computing, r foundation for statistical computing, Vienna, Austria. Available at: http://www.R-project.org/, 2012.Search in Google Scholar

27. Neugebauer R, Bullard J. DSA: Deletion/Substitution/Addition algorithm. Available at: http://www.stat.berkeley.edu/laan/Software/, r package version 3.1.4, 2010.Search in Google Scholar

28. Hastie T. gam: generalized additive models. Available at: http://CRAN.R-project.org/package=gam, r package version 1.04.1, 2011.Search in Google Scholar

29. Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab – an S4 Package for Kernel Methods in R. J Stat Software 2004;11:1–20. Available at: http://www.jstatsoft.org/v11/i09/.10.18637/jss.v011.i09Search in Google Scholar

30. Ridgeway G. gbm: Generalized boosted regression models. Available at: http://CRAN.R-project.org/package=gbm, r package version 1.6-3.1, 2010.Search in Google Scholar

31. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Software 2010;33. Available at: http://www.jstatsoft.org/v33/i01/.10.18637/jss.v033.i01Search in Google Scholar

32. Li S. FNN: fast nearest neighbor search algorithms and applications. Available at: http://CRAN.R-project.org/package=FNN, r package version 0.6-3, 2012.Search in Google Scholar

33. PJ, PD, Ribeiro J. Model based geostatistics. New York: Springer, 2007.Search in Google Scholar

34. PJ Ribeiro J, Diggle P. 2001. geoR: a package for geostatistical analysis. R News 2001;1:14–18. Available at: http://CRAN.R-project.org/doc/Rnews/.Search in Google Scholar

35. Kooperberg C. polspline: polynomial spline routines. Available at: http://CRAN.R-project.org/package=polspline, r package version 1.1.5, 2010.Search in Google Scholar

36. Liaw A, Wiener M. Classification and regression by randomforest. R News 2002;2:18–22. Available at: http://CRAN.R-project.org/doc/Rnews/.Search in Google Scholar

37. Furrer R, Nychka D, Sain S. Fields: tools for spatial data. Available at: http://CRAN.R-project.org/package=fields, r package version 6.6.1, 2011.Search in Google Scholar

38. Sinisi S, van der Laan M. The deletion/substitution/addition algorithm in loss function based estimation. J Stat Methods Mol Biol 2004;3:Article 18.10.2202/1544-6115.1069Search in Google Scholar PubMed

39. Hastie T. Statistical models in S, Wadsworth and Brooks/Cole, chapter 7: Generalized Additive Models, 1991.Search in Google Scholar

40. Williams C. Learning in graphical models. Cambridge, MA: The MIT Press, chapter Prediction with Gaussian processes: from linear regression to linear prediction and beyond, 599–621, 1999.10.1007/978-94-011-5014-9_23Search in Google Scholar

41. Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;29:367–78.10.1214/aos/1013203451Search in Google Scholar

42. Gelfand A, Diggle P, Fuentes M, Guttorp P, editors. Handbook of spatial statistics. Boca Raton: CRC Press, 2010.10.1201/9781420072884Search in Google Scholar

43. Stone C, Hansom M, Kooperberg C, Truong Y. The use of polynomial splines and their tensor products in extended linear modeling (with discussion). Ann Stat 1997;25:1371–470.Search in Google Scholar

44. Breiman L. Random forests. Machine Learning 2001;45:5–32.10.1023/A:1010933404324Search in Google Scholar

45. Green P, Silverman B. Nonparametric regression and generalized linear models. number 58 in Monographs on Statistics and Applied Probability. Boca Raton: Chapman & Hall/CRC, 1994.10.1007/978-1-4899-4473-3Search in Google Scholar

46. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 1979;31:377–403.10.1007/BF01404567Search in Google Scholar

47. Ellers J, Landers D, Brakke D. Chemical and Physical Characteristics of Lakes in the Southeastern United States. Environmental Science and Technology 1988;22:172–7.10.1021/es00167a006Search in Google Scholar PubMed

48. Gu C. gss: general smoothing splines. Available at: http://CRAN.R-project.org/package=gss, r package version 2.0-11, 2012.Search in Google Scholar

49. Ledell E. h2oEnsemble. Available at: http://www.stat.berkeley.edu/~ledell/software.html, 2015.Search in Google Scholar

50. Lendle S. OnlineSuperLearner. Available at: https://github.com/lendle/OnlineSuperLearner.jl, 2015.Search in Google Scholar

51. Carmack P, Schucany W, Spence J, Gunst R, Lin Q, Haley R. Far casting cross-validation. J Comput Graph Stat 2009;18:879–93.10.1198/jcgs.2009.07034Search in Google Scholar

52. Györfi L, Kohler M, Walk H. A distribution-free theory of nonparametric regression. Springer Series in Statistics. New York: Springer, 2002.10.1007/b97848Search in Google Scholar

53. van der Laan M, Dudoit S, Keles S. Asymptotic optimality of likelihood based cross-validation. Stat Appl Genet Mol Biol 2004;3:Article 4.10.2202/1544-6115.1036Search in Google Scholar PubMed

54. Dudoit S, van der Laan M. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol 2005;2:131–54.10.1016/j.stamet.2005.02.003Search in Google Scholar

Published Online: 2016-4-29
Published in Print: 2016-5-1

©2016 by De Gruyter

Downloaded on 1.5.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2014-0060/html
Scroll to top button