Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality—especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.


Installing and loading packages
To run this tutorial it is recommended to install ranger (Wright & Ziegler, 2017) directly from github: devtools::install_github("imbs-hl/ranger") Quantile regression random forest and derivation of standard errors using Jackknifing is available from ranger version >0.9.4. Other packages that we use here include: We also load a number of local function prepared for the purpose of this tutorial: source('./RF_vs_kriging/R/RFsp_functions.R')

Data sets in use
This tutorial uses several data sets that are either available from the R packages listed or can be loaded from the /RF_vs_kriging/data directory. This is the complete list of data sets used in the tutorial and the scientific paper: The National Geochemical Survey data set (see folder ./RF_vs_kriging/data/geochem ): 2858 points with measurements of Pb, Cu, K and Mg covering the US states Illinois and Indiana; Boulder Colorado daily precipitation (see folder ./RF_vs_kriging/data/st_prec ): 176,467 measurements of daily precipitation for the period 20142017,
To be able to plot or export predicted values as maps, we add them to the spatial pixels object: We can compare the RFsp approach with the model-based geostatistics (see e.g. geoR package), where we first decide about the transformation, then fit variogram of the target variable (Brown, 2015;Diggle & Ribeiro Jr, 2007): zinc.geo <-as.geodata(meuse["zinc"]) ini.v <-c(var(log1p(zinc.geo$data)),500) zinc.vgm <-likfit(zinc.geo, lambda=0, ini=ini.v, cov.model="exponential") ## kappa not used for the exponential correlation function  zinc.vgm ## likfit: estimated model parameters: ## beta tausq sigmasq phi ## " 6.1553" " 0.0164" " 0.5928" "500.0001" ## Practical Range with cor=0.05 for asymptotic range: 1497.866 ## ## likfit: maximised log-likelihood = -1014 where likfit function fits log-likelihood based variogram. Note that we here manually need to specify log-transformation via the lambda parameter. To generate predictions and kriging variance using geoR we run: locs = meuse.grid@coords zinc.ok <-krige.conv(zinc.geo, locations=locs, krige=krige.control(obj.model=zinc.vgm)) ## krige.conv: model with constant mean ## krige.conv: performing the Box-Cox data transformation ## krige.conv: back-transforming the predicted mean and variance ## krige.conv: Kriging performed using global neighbourhood meuse.grid$zinc_ok = zinc.ok$predict meuse.grid$zinc_ok_range = sqrt (zinc.ok$krige.var) in this case geoR automatically back-transforms values to the original scale, which is a recommended feature. Comparison of predictions and prediction error maps produced using geoR (ordinary kriging) and RFsp (with buffer distances and by just using coordinates) is given below. From the plot above, it can be concluded that RFsp gives very similar results as ordinary kriging via geoR. The differences between geoR and RFsp, however, are: RF requires no transformation i.e. works equally good with skewed and normally distributed variables; in general RF, has much less statistical assumptions than model-based geostatistics, RF prediction error variance in average shows somewhat stronger contrast than OK variance map i.e. it emphasizes isolated less probable local points much more than geoR, RFsp is significantly more computational as distances need to be derived from any sampling point to all new predictions locations, geoR uses global model parameters and as such also prediction patterns are relatively uniform, RFsp on the other hand (being a tree-based) will produce patterns that as much as possible match data,

Spatial prediction of binomial variable
RFsp can also be used to predict i.e. map distribution of binomial variables i.e. having only two states (TRUE or FALSE). In the model-based geostatistics equivalent methods are indicator kriging and similar. Consider for example the soil type 1 from the meuse data set: meuse@data = cbind(meuse@data, data.frame(model.matrix(~soil-1, meuse@data))) summary(as.factor(meuse$soil1)) ## 0 1 ## 58 97 in this case class soil1 is the dominant soil type in the area. To produce a map of soil1 using RFsp we have now two options: Option 1: treat binomial variable as numeric variable with 0 / 1 values (thus a regression problem), which shows that the model explains 75% of variability in the soil1 values. We set quantreg=TRUE so that we can also derive lower and upper prediction intervals following the quantile regression random forest (Meinshausen, 2006).
In the case of Option 2, we treat binomial variable as a factor variable: which shows that the Out of Bag prediction error (classification error) is only 0.06 (in the probability scale). Note that, it is not easy to compare the results of the regression and classification OOB errors as these are conceptually different. Also note that we turn on keep.inbag = TRUE so that ranger can estimate the classification errors using the Jackknife-after-Bootstrap method (Wager, Hastie, & Efron, 2014). quantreg=TRUE obviously would not work here since it is a classification and not a regression problem.
To produce predictions using the two options we use: pred.regr <-predict(m1.s1, cbind(meuse.grid@data, grid.dist0@data), type="response") pred.clas <-predict(m2.s1, cbind(meuse.grid@data, grid.dist0@data), type="se") in principle, the two options to predicting distribution of binomial variable are mathematically equivalent and should lead to same predictions (also shown in the map below). In practice there can be some smaller differences in numbers due to rounding effect or random start effects.

Spatial prediction of categorical variable
Spatial prediction of categorical variable using ranger belongs to classification problems. The target variable contains multiple states (3 in this case), but the model follows still the same formulation: fm.s = as.formula(paste("soil ~ ", paste(names(grid.dist0), collapse="+"), " + SW_occurrence + dist")) fm.s ## soil ~ layer. 10.32 % which shows that the classification or mapping accuracy for hard classes is about 90%. We can produce predictions of probabilities per class by: ## 'data.frame': 3103 obs. of 7 variables: ## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... where pred_soil1 is the probability of occurrence of class 1 and se_soil1 is the standard error of prediction for the pred_soil1 based on the Jackknife-after-Bootstrap method (Wager et al., 2014). The first column in pred.grids contains existing map of soil with hard classes only.

Figure: Predictions of soil types for the meuse data set based on the RFsp: (above) probability for three soil classes, and (below) derived standard errors per class.
In summary, spatial prediction of binomial and factor-type variables is straight forward with ranger: buffer distance and spatialautocorrelation can be incorporated at once. Compare with geostatistical packages where GLMs with logit link function and/or indicator kriging would need to be used, and which requires that variograms are fitted per class.

Spatial prediction of variables with extreme values
At the Spatial Interpolation Comparison exercise 2004 (G. Dubois, 2005) participants were asked to predict gamma dose rates for Germany at 800 validation points using models fitted with 200 training points. One of the data sets (called 'joker' data set) contained 2 training points with very high values. Modeling such variable with conventional geostatistics is a cumbersome, especially within a fully automated geostatistical interpolation framework such the one implemented in the intamap package (Pebesma et al., 2011 where interpolate is a fully automated framework for spatial predictions that selects from 5--6 state-of-the-art methods (Pebesma et al., 2011). The resulting error at validation points seems to be relatively high, which is probably due to the choice of transformation and/or variogram model: sd(sic.test$joker-pred.sic2004$predictions$mean)
In summary RFsp has a potential to produce maps also for variables with extereme values i.e. very skewed distributions, but this does require that some parameters of RFsp ( mtry ) are carefuly fine-tuned.

Weighted RFsp
In many cases training data sets (points) come with variable measurement errors or have been collected with a sampling bias. Package ranger allows incorporating data quality into the modeling process via the argument case.weights -observations with larger weights will be selected with higher probability in the bootstrap, so that the output model will be (correctly) more influenced by observations with higher weights. Consider for example this soil point data set prepared as a combination of ( 1 1 1 1 1 1 1 1 1 1  we focus here on mapping CLYPPT i.e. clay content in percent, for which we also would like to use the quality column CLYPPT.sd (the standard deviation of the measurement error). The number of NASIS points is of course much higher (ca. 5x) than number of NCSS points, but NCSS points contain about 3x more accurately estimated clay content.
Next we load covariate layers (49), overlay points and grids and prepare a regression matrix: carson$DEPTH.f = ifelse(is.na(carson$DEPTH), 20, carson$DEPTH) carson1km <-readRDS("./RF_vs_kriging/data/NRCS/carson_covs1km.rds") coordinates(carson) <-~X+Y proj4string(carson) = carson1km@proj4string rm.carson <-cbind(as.data.frame(carson), over(carson["CLYPPT"], carson1km)) fm.clay <-as.formula(paste("CLYPPT ~ DEPTH.f + ", paste(names(carson1km), collapse = "+"))) fm Note that, because many soil properties are measured at multiple depth, we fit here a 3D spatial prediction model that also takes DEPTH into account. This is in fact a rather large data set that we can subset to e.g. 2000 point pairs to speed up computing: in this case we used inverse measurement variance as case.weights so that points that were measured in the lab will receive much higher weights. Final output map below shows that, in this specific case, the model without weights seems to predict somewhat higher values, especially in the extrapolation areas. This indicates that using measurement errors in model calibration is important and one should not avoid specifying this in the model, especially if the training data is significantly heterogeneous.

Figure: RF predictions and predictions errors for clay content with and without using measurement errors as weights. Study area around the Lake Tahoe, California USA. Point data sources: National Cooperative Soil Survey (NCSS) Characterization
Database and National Soil Information System (NASIS).
In summary, Random Forest seems to be suitable for generating spatial and spatiotemporal predictions once the geographical distances are added to the model (see also (Behrens et al., n.d.)). Computing time, however, can be a cumbersome and working with data sets with >>1000 point locations (hence >>1000 buffer distance maps) is problably not yet recommended.
Also cross-validation of accuracy of predictions produced using RFsp needs to be implemented using leave-location-out CV to account for spatial autocorrelation in data. The key to the success of the RFsp framework might be the training data qualityespecially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For all other details please refer to our paper.