Investigating Spatial Error Structures in Continuous Raster Data

The objective of this study is to investigate spatial structures of error in the assessment of continuous raster data. The use of conventional diagnostics of error often overlooks the possible spatial variation in error because such diagnostics report only average error or deviation between predicted and reference values. In this respect, this work uses a moving window (kernel) approach to generate geographically weighted (GW) versions of the mean signed deviation, the mean absolute error and the root mean squared error and to quantify their spatial variations. Such approach computes local error diagnostics from data weighted by its distance to the centre of a moving kernel and allows to map spatial surfaces of each type of error. In addition, a GW correlation analysis between predicted and reference values provides an alternative view of local error. Full abstract can be found in the pdf.


Introduction
All spatial data are subject to error. Remotely sensed (RS) imagery routinely contains sensor-related errors, atmospheric effects, and geometric errors. Environmental datasets that describe landscape features and properties from RS products (e.g. forest aboveground biomass, species distribution, and climate change scenarios) inherently contain prediction errors. Errors can manifest themselves as systematic deviations and/or noise which require careful assessment in order to avoid mis-interpretations of the data, to support reliable conclusions and to make informed decisions (Daly, 2006;Foody, 2002). Error assessments provide a guide to data quality and reliability (Foody, 2002) and can provide earth observation (EO) scientists with an understanding of the sources of error both in RS imagery and products (Liu et al., 2007;Stehman and Czaplewski, 1998). However, conventional summary measures of error do not take any spatial information (e.g. spatial heterogeneity) of error into account (Foody, 2005(Foody, , 2002. Spatially explicit approach for the assessment is hence important. In EO studies, spatial extensions of conventional diagnostics of error or accuracy have been demonstrated for categorical raster data, such as land cover classification data (Comber et al., 2017(Comber et al., , 2012Comber, 2013;Congalton, 1988;Foody, 2005). These approaches spatially extend the usual method of estimating and reporting accuracy through a confusion matrix, which is the cross-tabulation of predicted and reference classes to generate measures of user's and producer's accuracy that correspond to commission and omission errors, respectively, along with an overall accuracy (Congalton, 1991;Stehman and Czaplewski, 1998). Specifically, Comber (2013) demonstrated the use of a geographically weighted (GW) approach to generate spatial surfaces of these measures. The GW approach calculates a series of local diagnostics of accuracy, using data weighted by their distance to the centre of a moving window or kernel to explore spatial heterogeneity (Gollini et al., 2015). This has been used to compare global land cover datasets , to assess the consistency of such classification over time (Tsutsumida and Comber, 2015), and to construct hybrid global land cover datasets from multiple inputs (See et al., 2015). Comber et al. (2017) proposed GW confusion matrices for further generic applications. The GW framework itself Gollini et al., 2015;Lu et al., 2014) has been widely adopted across many scientific disciplines (e.g. Geography, Ecology, Health), where GW regression (Brunsdon et al., 1996) is the most popular GW model.
The developments of spatially explicit approaches for error assessment in continuous raster data in the EO domain have been limited. Comber et al. (2012) proposed a fuzzy GW difference analysis which estimates absolute deviations between the predicted and reference fuzzy membership, essentially applying a fuzzy generalization of the categorical accuracy measures. Khatami et al. (2017) proposed a spatial interpolation approach for soft classification maps in which a linear kernel function was applied to interpolate spatial deviations between predicted and reference proportions, with a focus on weight of spectral or class proportion as a soft classification measure. Willmott and Matsuura (2006) described maps of cross-validation error. Continuous raster data are commonly assessed using mean signed deviation (msd), mean absolute error (mae), root mean square error (rmse) and Pearson's correlation coefficient (r). Accurate predictions are reflected by msd, mae and rmse to be zero, coupled with r to be one. Although these conventional diagnostics are useful in reporting error, each of them provides an overall, global or 'whole map' measure only. In this respect, Harris and Juggins (2011) demonstrated GW r for assessing UK freshwater acidification prediction accuracy. Harris et al. (2013) demonstrated GW mae for UK freshwater acidification and London house price prediction accuracy, as separate case studies. Monteys et al. (2015) demonstrated GW r for assessing water depth prediction accuracy in Irish coastal waters. These studies either directly extend GW summary statistics (e.g. GW averages, GW variances) as first proposed by Brunsdon et al. (2002), or directly use GW r , but in a model accuracy context. Further advances of GW summary statistics can be found in Harris and Brunsdon (2010) and Harris et al. (2014). However, the previous studies have only reported spatial error briefly as part of a suite of diagnostics. That is, spatial extensions of conventional diagnostics of error for continuous raster data have not been described in a comprehensive way, specifically in an EO context. Here we demonstrate the linked use of all four diagnostics, msd, mae, rmse and r, through their GW msd, GW mae, GW rmse and GW r counterparts and advance them through the application of Monte Carlo permutation tests to identify unusual clusters of error applied to two EO case studies. The first case study evaluates datasets of the fractional impervious surface area (%ISA) with the aim of investigating spatial structures of error in multiple predictions by four different models.
The second case study evaluate four different forest aboveground biomass (AGB) datasets in order to compare spatial structures of error in multiple independent datasets.

Study 1
In order to explore how spatial structures of error can differ according to different models, four independent predictions of %ISA in the Jakarta Metropolitan Area (JMA), Indonesia, for 2012 were produced. The %ISA was inferred from the enhanced vegetation index (EVI) stored in moderate resolution imaging spectroradiometer (MODIS) MOD13Q1 product, which are 16-days composite RS imagery with a 231 m spatial resolution. Annual minimum, mean, maximum, and standard deviation of EVI were calculated on a pixel by pixel basis from the 24 images in 2012. These data were classified and assessed using training and reference (validation) samples collected at 984 randomly selected grid squares of the same size and at the same locations as the MODIS MOD13Q1 product. The %ISA was visually interpreted from fine resolution images in available Google Earth from the same year (Comber et al., 2016;Tsutsumida et al., 2016;Tsutsumida and Comber, 2015). When fine resolution images were not available at a sampling grid in 2012, %ISA were interpolated from images dated before and after the year 2012, only if the %ISA is stable over the period (in most cases, %ISA is zero). It is a reasonable approach because impervious surfaces do not change frequently. The reference values of %ISA were interpreted twice to minimize human error. The sample grids were randomly divided into training (n = 434) and reference data (n = 550) as shown in Figure 1. Four different models were implemented to predict %ISA in the JMA: Logistic regression, Maximum Entropy (MaxEnt), Random Forest (RF) regression, and class probability of the RF classifier (hereafter RF probability). All four models return a continuous classification value between 0-100%. Logistic regression is a parametric generalized linear model for response data following a binomial distribution. The outcomes are within the range between 0 and 1 (rescaled to 0-100%). MaxEnt is a non-parametric model, which naturally extends from logistic regression (Phillips and Dudík, 2008). MaxEnt returns the probability of presence from presence-only training data (i.e. without labelled "absent" data), resulting %ISA predictions. RF regression and RF probability are machine learning techniques using ensemble logistic trees (Breiman, 2001). For RF regression, each tree is constructed by bootstrapped random sampling so that random sample selection leads to a weak correlation between trees. For RF probability, each tree votes for the most popular class and a random sample selection to grow trees is used to minimize the classification error.
Due to its voting system, RF produces a probability of class presence, predicting %ISA. The %ISA predictions of these four models are different and clearly vary spatially ( Figure 2).
Note that apparent water surfaces are masked by a MODIS MOD44W product which represents the water surface in the same spatial scale of the MOD13Q1. Thus, submerged areas (e.g., those found in the North-East edge of the JMA) are excluded in this analysis.

Study 2
In order to explore how spatial structures of error can differ according to available different datasets, four AGB spatial datasets for the Campeche, Yucatan, and Quintana Roo

Methods
The GW versions of msd, mae, rmse, and r are described as follows. At any location , GW (3) , where 3 and 3 are the reference and predicted values at sample location , respectively, *3 weights controlled by a distance-decay kernel function (Equation 8) with respect to location and , and is the total number of sample data points. Observe that this always holds, msd ≤ mae ≤ rmse (Willmott and Matsuura, 2005) and their GW counterparts have the same characteristics.
A GW r at any location , is found using: . For both case studies, the weights *3 are found using a bi-square kernel as follows: , where *3 is the Euclidean distance between locations and , and the kernel bandwidth is specified either as a fixed distance or an adaptive distance which includes a fixed number of data points for the local diagnostic calculation. In this study, an adaptive kernel was used as it suits the reference points of both case studies were not distributed uniformly. Its size was arbitrarily defined as 10% of nearby data to location i. The validity of this subjective bandwidth size is discussed in detail in section 5.
Observe that the chosen diagnostics complement each other: measures of msd, mae and rmse and their GW counterparts, all summarize the error in some manner, whilst r and GW r measure specifically the slope of the linear relationship between the predicted and reference values. Furthermore, r and GW r are scale invariant meaning that they cannot capture a consistent and uniform over-or under-prediction bias. outcomes and ascertaining where the single, actual outcome lies). In this instance, the randomization hypothesis is that any pattern seen in the error occurs by chance and therefore any permutation of the error is equally likely. For GW r, the arguments are analogous, but where the investigation centers on the correlation between the predicted and reference values, rather than some summary of the error. In all instances, the permutation test should be viewed as informal and conditional on the GW diagnostic specification (i.e. bandwidth size, kernel type, etc.). Thus, throughout this study, the term 'significance' is used in an informal manner also, for this test.
In addition to calculating the global diagnostics of msd, mae, rmse and r, estimates and pvalues for the significance of the Moran's I of the deviation between predicted and reference values were calculated. These provide useful context and global information about spatial autocorrelation in the error. Weights were generated using an inverse distance squared function for the Moran's I calculations. Nevertheless, no local spatial information about the errors is reported in Table 2. Next, the spatial structure of the errors resulting from the %ISA predictions were explored using the three GW error diagnostics, together with GW r between the predicted and reference values as shown in Figure 5. The GW mae and GW rmse maps show that peri-urban areas (surrounding the city core) tend to have larger mae/rmse values than others, suggesting the difficulty in predicting %ISA in complex urban frontiers between urban/non-urban areas. 'Significant' local clusters differ according to the models, but they tend to be distributed along such urban frontiers.

Study 1
The results for GW r show South-Western and South-Eastern areas have consistently weak negative correlations in all four models, and the permutation tests indicate that such correlations are 'significantly' unusual for all models. As GW r represents spatial variation in the slope of the linear relationship between the predicted and reference values, maps of GW r can behave differently from those of the other three GW diagnostics which relate to error. Here only the GW mae and GW rmse maps show similarities to each other as expected (see section 5). As would also be expected from the results of Table 2, RF regression tends to provide the best local accuracy in most areas, but with clear spatial variation in this accuracy.

Study 2
Conventional diagnostics for the four AGB datasets are shown in Table 3. Rodríguez-Veiga's dataset clearly provides the best accuracy amongst the four AGB datasets, as is evident from the closeness to zero of msd (-2.05), the smallest mae (31.52), the smallest rmse   where 'significant' negative correlations are of concern. Such clusters occur in quite different areas to the cluster observed in south for unusually large GW mae, and GW rmse values. Similar to the first case study, GW r provides an alternative assessment of local error to GW mae and GW rmse. A possible explanation for this, is that GW r can be sensitive to bandwidth size. For example, a few anomalous pairs of predicted and reference data points that fall close to the kernel centre can exert an undue influence on the correlation estimate (see section 5 In summary, mapping GW diagnostics provides useful spatial indications of the reliability of each dataset, not only individually, but also in comparison with each other. Despite all four datasets depicting the same AGB measure, the spatial patterns of error and accuracy vary in each dataset. Rodríguez-Veiga's dataset would be the best choice in terms of the conventional diagnostics (Table 3), but not necessarily the best choice everywhere, for example in central-Eastern areas, where Saatchii's dataset may be more accurate and preferred. Figure 6. Spatial distributions of GW msd, GW mae, GW rmse, and GW for four forest aboveground biomass datasets of Rodríguez-Veiga et al. (2016), Baccini et al. (2011), Saatchi et al. (2011), and Hu et al. (2016

Discussion
The use of GW diagnostics has allowed investigations of the spatial structure of error between predicted and reference values for two EO case studies. This approach extends conventional (single-valued) whole map diagnostics of error spatially, through their localized (multiple-valued) counterparts. The associated permutation tests can highlight unusually accurate or unusually inaccurate error, providing a means to focus EO or other research activity on specific areas. This work is novel, but a number of points warrant discussion.

The effects of sample information
In this study, the use of the same reference data to evaluate the GW diagnostics of different datasets ensures results are comparable. However, an independent reference sample is not always available. This is a limitation for any error assessment: any results are only ever relative to the reference sample. For case study 2, Rodríguez-Veiga's dataset yielded the most accurate AGB performance amongst the four AGB datasets in most parts of the study area. The reference sample used here, despite being independent from the training data used for the Rodríguez-Veiga's dataset, originated from the same source (i.e. in-situ INFyS data), whilst the other three datasets used a completely different training dataset (i.e. GLAS footprints). Additionally, Rodríguez-Veiga's dataset is at a 250 m spatial resolution which is closer to the size of the reference data than the other datasets with spatial resolutions of 500 m or 1 km. Such characteristics need to be accounted for when comparing continuous raster datasets.

Bandwidth specification
In this work, a user-specified bandwidth of 10% was used for all outputs. This in part, reflected the need to use only one bandwidth throughout, so that multiple datasets could objectively be compared. However, in any GW approaches, the selection of the bandwidth is critical, to identify which levels of spatial heterogeneity should be focused. For example, Figure 7 shows the results of generating GW mae for Saatchi's dataset in case study 2, with a range of adaptive bandwidth sizes (5% to 50% in increments of 5%  (Gollini et al., 2015;Harris et al., 2014), their use commonly results in one 'best on average' bandwidth choice that maximizes the precision of the predictor or statistic (e.g. via a leave-one-out cross-validation). Such data-driven procedures should not be regarded as a panacea for bandwidth selection or the degree of smoothing to use (Ruppert et al., 1995).

Figure 7. Comparison of results of GW mae for Saatchi's aboveground biomass dataset with
different adaptive kernel size between 5% to 50% as an example.

The difference between mae and rmse
It is important to acknowledge the difference between mae and rmse, which is often overlooked. The mae represents average error magnitude (averaged absolute error) (Willmott and Matsuura, 2005), and rmse reflects the mean and variation in the error and is therefore highly sensitive to outliers (Pontius et al., 2008;Willmott and Matsuura, 2005). In this sense, mapping GW mae captures the spatial variation of the average error magnitude, whilst GW rmse highlights larger errors compared to GW mae. This is originated from the fact that the mean squared error, which is the squared rmse, is composed of the squared msd and the variance (Friedman, 1997). Because of this characteristic, rmse has no clear interpretation, unlike mae. A GW-based extension of such discussions would be an interesting topic of future work.

Conclusions
Conventional diagnostics of error, such as msd, mae and rmse provide global, 'on-average' measures. These summary measures of error do not capture any spatial information of the error. Ignoring spatial structures in error may result in a false interpretation and misuse of the data that the errors stem from. This work develops and applies localized diagnostics of error to investigate spatial heterogeneity of each types of these diagnostics.  (Gollini et al., 2015;Lu et al., 2014). These adapted functions are available to interested researchers on request. Options for GW msd and GW r and their tests are already provided in the same functions of the GWmodel.