kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation

Linnenbrink, Jan; Milà, Carles; Ludwig, Marvin; Meyer, Hanna

doi:https://doi.org/10.5194/egusphere-2023-1308

Preprints

https://doi.org/10.5194/egusphere-2023-1308

Preprints

05 Jul 2023

| 05 Jul 2023

kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation

Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer

Abstract. Random and spatial Cross-Validation (CV) methods are commonly used to evaluate machine learning-based spatial prediction models, and the obtained performance values are often interpreted as map accuracy estimates. However, the appropriateness of such approaches is currently the subject of controversy. For the common case where no probability sample for validation purposes is available, in Milà et al. (2022) we proposed the Nearest Neighbour Distance Matching (NNDM) Leave-One-Out (LOO) CV method. This method produces a distribution of geographical Nearest Neighbour Distances (NND) between test and train locations during CV that matches the distribution of NND between prediction and training locations. Hence, it creates predictive conditions during CV that are comparable to what is required when predicting a defined area. Although NNDM LOO CV produced largely reliable map accuracy estimates in our analysis, as a LOO-based method, it cannot be applied to large datasets found in many studies.

Here, we propose a novel k-fold CV strategy for map accuracy estimation inspired by the concepts of NNDM LOO CV: the k-fold NNDM (kNNDM) CV. The kNNDM algorithm tries to find a k-fold configuration such that the Empirical Cumulative Distribution Function (ECDF) of NND between test and train locations during CV is matched to the ECDF of NND between prediction and training locations.

We tested kNNDM CV in a simulation study with different sampling distributions and compared it to other CV methods including NNDM LOO CV. We found that kNNDM CV performed similarly to NNDM LOO CV and produced reasonably reliable map accuracy estimates across sampling patterns with strong reductions in computation time for large sample sizes. Furthermore, we found a positive linear association between the quality of the match of the two ECDFs in kNNDM and the reliability of the map accuracy estimates.

kNNDM provided the advantages of our original NNDM LOO CV strategy while bypassing its sample size limitations.

Received: 14 Jun 2023 – Discussion started: 05 Jul 2023

Download & links

Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer

Status: final response (author comments only)

CC1: 'Comment on egusphere-2023-1308', Nils Tjaden, 07 Jul 2023

Just a quick hint that https://doi.org/10.1016/j.jag.2023.103364 was just published - may or may not be relevant for your discussion.

Citation: https://doi.org/10.5194/egusphere-2023-1308-CC1
RC1:
'Comment on egusphere-2023-1308', Italo Goncalves, 23 Aug 2023
The manuscript presents a much-needed methodology for cross-validation of spatial data. In my opinion, the strongest point is the use of the W statistic to identify the best CV split. However, there are a few points which I feel should be addressed in the discussion.
The proposed methodology using clustering algorithms seems valid, but how can we know if it provides the best possible result? An algorithm that optimizes the W statistic directly as a function of the CV fold indices would be more desirable, instead of relying on the clustering algorithm´s internal metric as a proxy. As a suggestion for future work, I recommend using a genetic algorithm to assign CV indices to the data points directly.

The W statistic explained 60% of the variability in map accuracy, but would this be consistent across different datasets? At least one more case study would be needed to verify this.

Minor comment:
Line 90: cross “the”.
Citation: https://doi.org/10.5194/egusphere-2023-1308-RC1
RC2: 'Comment on egusphere-2023-1308', Anonymous Referee #2, 23 Aug 2023

The study proposes a novel cross-validation method for spatial data that aims to deliver more representative measurements of spatial map accuracy than commonly-used methods. This is a relevant concern for GMD readers with the rise in use of machine learning methods for geoscientific modelling. Issues with model evaluation in the spatial setting have been identified in a number of recent studies. The paper is well-written and contributes a practical solution for a common issue.
In my opinion, the most exciting/innovative idea in this work is the concept of defining the evaluation method based on the desired data for which he model is intended to return predictions. This would require researchers to more carefully define the purpose of their models before and during the model creation process, which should be common practice. In reality, this is often not done, or done in a ‘standard’ way which doesn’t accurately reflect the intended use of the model.
The method presented in this paper is a very practical solution to this, where the desired target dataset is an input of the evaluation algorithm and therefore researchers are required to clearly consider and define it. I think this is a significant contribution to model development methodology and should be more clearly emphasised in the manuscript. The possibilities, benefits and disadvantages of this concept could also be discussed - for example, when models are used in production, the prediction area is a moving target; would that require continual re-evaluation?
The paper suggests that kNNDM is, essentially, a computationally-cheaper alternative to the previously-published method by the authors, NNDM LOO. In the article, the only limitation of leave-one-out CV methods described is that of computational time. However, to my knowledge, even if computation is not considered, LOO CV methods may not be the optimum method due to higher variation in the resulting models (due to the bias-variance tradeoff). Could this explain why kNNDM 10-fold seems to perform better in the case of strong clusters (Figure 5)? For me, this would be more convincing than the computation speedup comparison, which is relatively trivial given that LOO CV is the most extreme version of k-fold CV.
Following on from this, it seems likely that the value of k would impact the results. Use of 10 folds is very common; is there theoretical justification for this? It would be useful to see some comparisons of the results with multiple values of k.
In Figure 1, it is shown that the W statistic will also be larger if training points are regularly distributed, as well as when clustered. Does this mean that the null hypothesis might be rejected for regularly distributed datapoints? Does this explain why NNDM LOO performed better for regularly distributed data (Figure 5)?
Finally, I would recommend testing the method on at least one additional dataset, as the results presumably depend on the spatial autocorrelation present in the dataset used.
Minor comment: I assume the hyperparameters of the models are not tuned as it is not mentioned, but this could be stated explicitly.

Citation: https://doi.org/10.5194/egusphere-2023-1308-RC2
AC1: 'Comment on egusphere-2023-1308', Jan Linnenbrink, 19 Oct 2023

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/egusphere-2023-1308-AC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2023-1308-AC1

Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer

Model code and software

kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer https://doi.org/10.6084/m9.figshare.23514135.v1

Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer

Viewed

Total article views: 1,060 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
782	244	34	1,060	33	22

HTML: 782
PDF: 244
XML: 34
Total: 1,060
BibTeX: 33
EndNote: 22

Views and downloads (calculated since 05 Jul 2023)

Month	HTML	PDF	XML	Total
Jul 2023	247	73	10	330
Aug 2023	96	30	4	130
Sep 2023	49	24	1	74
Oct 2023	61	21	6	88
Nov 2023	23	12	1	36
Dec 2023	44	12	4	60
Jan 2024	95	14	2	111
Feb 2024	43	28	2	73
Mar 2024	64	16	0	80
Apr 2024	60	14	4	78

Cumulative views and downloads (calculated since 05 Jul 2023)

Month	HTML	PDF	XML	Total
Jul 2023	247	73	10	330
Aug 2023	96	30	4	130
Sep 2023	49	24	1	74
Oct 2023	61	21	6	88
Nov 2023	23	12	1	36
Dec 2023	44	12	4	60
Jan 2024	95	14	2	111
Feb 2024	43	28	2	73
Mar 2024	64	16	0	80
Apr 2024	60	14	4	78

Viewed (geographical distribution)

Total article views: 1,060 (including HTML, PDF, and XML) Thereof 1,060 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 27 Apr 2024

Short summary

Estimation of map accuracy based on Cross-Validation (CV) in spatial modeling is pervasive but controversial. Here, we build upon our previous work and propose a novel, prediction-oriented k-fold CV strategy for map accuracy estimation in which the distribution of geographical distances between prediction and training points is taken into account when constructing the CV folds. Our method produces more reliable estimates than other CV methods and can be used for large datasets.


Total:	0
HTML:	0
PDF:	0
XML:	0