Predicting into unknown space? Estimating the area of applicability of spatial prediction models

Machine learning algorithms have become very popular for spatial mapping of the environment due to their ability to fit nonlinear and complex relationships. However, this ability comes with the disadvantage that they can only be applied to new data if these are similar to the training data. Since sp...

Full description

Saved in:

Bibliographic Details
Published in:	Methods in ecology and evolution Vol. 12; no. 9; pp. 1620 - 1633
Main Authors:	Meyer, Hanna, Pebesma, Edzer
Format:	Journal Article
Language:	English
Published:	London John Wiley & Sons, Inc 01-09-2021
Subjects:	Algorithms Error analysis Estimates Learning algorithms Machine learning Mapping model transferability Outliers (statistics) Performance prediction Prediction models predictive modelling Random Forest remote sensing Simulation Spatial data spatial mapping Training uncertainty
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Machine learning algorithms have become very popular for spatial mapping of the environment due to their ability to fit nonlinear and complex relationships. However, this ability comes with the disadvantage that they can only be applied to new data if these are similar to the training data. Since spatial mapping requires predictions to new geographic space which in many cases goes along with new predictor properties, a method to assess the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the ‘area of applicability’ (AOA) that we define as the area where we enabled the model to learn about relationships based on the training data, and where the estimated cross‐validation performance holds. We first propose a ‘dissimilarity index’ (DI) that is based on the minimum distance to the training data in the multidimensional predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold which is the (outlier‐removed) maximum DI of the training data derived via cross‐validation. We further use the relationship between the DI and the cross‐validation performance to map the estimated performance of predictions. We illustrate the approach in a simulated case study chosen to mimic ecological realities and test the credibility by using a large set of simulated data. The simulation studies showed that the prediction error within the AOA is comparable to the cross‐validation error of the trained model, while the cross‐validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross‐validation is applied. Using the relationship between DI and cross‐validation performance showed potential to limit predictions to the area where a user‐defined performance applies. We suggest to add the AOA computation to the modeller's standard toolkit and to present predictions for the AOA only. We further suggest to report a map of DI‐dependent performance estimates alongside prediction maps and complementary to (cross‐)validation performance measures and the common uncertainty estimates.
Bibliography:	Handling Editor Robert Freckleton
ISSN:	2041-210X 2041-210X
DOI:	10.1111/2041-210X.13650