The current and future uses of machine learning in ecosystem service research

• Machine learning (ML) is increasingly being used in ecosystem service research. • ML is used for describing data and predictive modelling. • Many ecosystem service (ES) studies lack rigour in how ML is used. • Capacity to use ML on big ES data has not been fully realised. • We highlight best practice for ongoing use of machine learning in ES research.


H I G H L I G H T S
• Machine learning (ML) is increasingly being used in ecosystem service research. • ML is used for describing data and predictive modelling. • Many ecosystem service (ES) studies lack rigour in how ML is used. • Capacity to use ML on big ES data has not been fully realised. • We highlight best practice for ongoing use of machine learning in ES research.

G R A P H I C A L A B S T R A C T
a b s t r a c t a r t i c l e i n f o

Introduction
Ecosystem service (ES) research involves the study of complex systems comprising interactions between biodiversity, human activity and the abiotic environment (MEA, 2005). The interactions underpinning ES are highly nonlinear and our mechanistic understanding of these processes is under-developed (Daw et al., 2016;Spake et al., 2017). This complexity makes implementing standard process-based modelling and statistical null hypothesis testing in ES problematic (Mouchet et al., 2014;Villa et al., 2014;Martínez-López et al., 2019). Furthermore, data relevant to ES research, e.g. remotely sensed data, often has high-dimensionality, can be unstructured, and the volume of data is increasing at a rate beyond our ability to make sense of it using traditional approaches (Reichstein et al., 2019).
Machine learning (ML) is an emerging and rapidly developing discipline and what constitutes ML, as opposed to other, more traditional statistical approaches, remains fuzzily defined (Bock et al., 2019). Here we broadly define ML according to (Reichstein et al., 2019) as 'a field of statistical research for training computational algorithms that split, sort and transform a set of data to maximize the ability to classify, predict, cluster or discover patterns in a target dataset'. Using ML, data are empirically modelled with few or no prior assumptions about the system, using computer algorithms that can automatically learn from data. Since ML techniques can make data inferences without relying on causal theory, they can have useful application in highly non-linear, complex, and poorly characterised systems such as those producing ES. Furthermore, due to automation, ML approaches are particularly advantageous considering recent developments in social and environmental 'big data' relevant to ES research (Ghani et al., 2019;Xia et al., 2020). ML approaches are therefore a valuable expansion to traditional data analyses and the diversity of ML techniques presents a range of opportunities as a data-driven approach to ES research (Willcock et al., 2018). As such, ML is increasingly utilised within ecology and the environmental sciences and is enabling useful data inferences in domains in which traditional data analyses have had limited utility (Lucas, 2020). ML has enabled useful data inferences using data that has been collected automatically i.e. via remote sensing or other autonomous sensors (Lary et al., 2016), or without experimental design (e.g. recording of species sightings by the public; (Torney et al., 2019)), or open data that has been collected often for another purpose (Rammer and Seidl, 2019). ML is also used to analyse environmental data collected via social media platforms (Wäldchen and Mäder, 2018) or that has been generated synthetically via another modelling process (Chen et al., 2018a).
ML approaches can be divided to two main categories according to the type of task or research objective being pursued: descriptive (e.g. identifying unknown groups) and predictive (e.g. projections of future outcomes; Box 1) (Delen and Ram, 2018). Descriptive ML approaches group data with little or no prior domain specific assumptions, they can aid in hypothesis generation and can be used to automatically sort data prior to other data analyses. This allows for rapid processing of 'big data', where dataset size and high-dimensionality make organising or describing ES data by traditional methods not practically viable (Willcock et al., 2018). ML clustering and ordination can be viewed as descriptive techniques, and in ES research they can identify ES bundles or hotspots in ES supply and demand, i.e. areas where two or more ES are consistently associated (Raudsepp-Hearne et al., 2010;Mouchet et al., 2014). ML classification of remotely sensed images involves describing large and complex datasets by grouping the data into meaningful classes, often for further analyses or to aid in hypothesis generation (Maxwell et al., 2018).
Predictive ML techniques are used to complete classification and regression tasks to use in models and make predictions about a system. This can allow for predictive modelling of highly non-linear systems where causal mechanisms are poorly understood (Huntingford et al., 2019). The potential for the use of ML in a data-driven approach to predictive modelling of ES has already been highlighted and ML ES models have been shown to have comparable accuracy to conventional predictive modelling techniques (Willcock et al., 2018). ML has a range of potential advantages over other modelling approaches in ES. Firstly, the inherent difficulty in making inferences with patterns in 'noisy' biological data results in high levels of uncertainty, and different models of the same system often diverge in their predictions (Knudby et al., 2010;Willcock et al., 2019). As such ES models may not meet the needs of ML algorithms can broadly be divided into two kinds, from a learning perspective: supervised and unsupervised learning. In supervised learning a response variable is specified a priori. The user first labels and groups the system input variables and supplies the algorithm with the target output variable. The algorithm then finds a function that links the inputs with the outputs such that it can then make predictions of what the output will be from a given set of input variables. Classification and regression tasks are carried out using supervised learning approaches (Jordan and Mitchell, 2015). Types of supervised learning methods include Classification and Regression Trees (CARTs), Support Vector Machines (SVMs) and Maximum Likelihood approaches. In supervised ML the dataset is split into two subsets. One subset, the training data, is used to 'train' the algorithm how to carry out the task e.g., how to classify. This training data contains the target output and the user indicates what this is. The second subset, the test data, is reserved to 'test' the performance of the algorithm in carrying out its task. In this phase the target is not supplied to the algorithm so that the output produced by the algorithm can be compared to target output data (Breiman, 2001). When model tuning is involved, a part of the training set is held out from training and used for evaluating the training performance (during training) and to assist in selecting the optimal hyperparameter values. Model tuning can substantially increase the accuracy of the ML model, with only the optimal (i.e. most accurate) model being then used on the test set (Willcock et al., 2018). However, we note that there is potential for confusion as both the tuning and testing processes are sometimes referred to as validation in the ML literature. Some studies also test the generalisability of the model to either arbitrary model decisions (e.g. how the datasets are subset into training and testing data) and/or to data outside the parameter space of the training and testing data subsets. Supervised learning approaches are especially useful in predictive modelling and in the analysis of variable importance. In unsupervised learning prior knowledge of what the output should be is not given to the algorithm; no variables are labelled as outputs by the user. Unsupervised algorithms structure data by identifying groups that the user has not indicated a priori. Cluster analysis is an example of unsupervised learning. Some types of ML e.g., Artificial Neural Networks (ANNs), include supervised and unsupervised approaches. Unsupervised techniques are useful for data exploration and hypothesis generation because they allow insights into unstructured data (Solomatine et al., 2009). As with other forms of data analyses, a variety of ML techniques can be used to carry out different tasks within a single study and ML can also be used in combination with tradition techniques. For example, a clustering algorithm might be used to group data prior to a regression either by ML or another statistical approach (Crisci et al., 2012). Generally, unsupervised approaches are used for descriptive/organisational tasks whilst predictive modelling tasks tend to be carried out using more supervised approaches.
stakeholders (Willcock et al., 2016;Martínez-López et al., 2019). ML models often have in-built measures of uncertainty that may be useful to stakeholders (Willcock et al., 2018). Secondly, ML often allows the combination of continuous with categorical predictor variables (Cutler et al., 2007), which is a particular advantage in modelling ES where data is often of disparate forms (Burkhard et al., 2012). Thirdly, datasets relevant to ES research can have missing or unknown data that can be problematic to model construction (Willcock et al., 2020). However, several ML algorithms (e.g. Classification and Regression Trees, some Support Vector Machines, and Neural Networks) can operate with gaps in the data without the need to impute missing data points (García-Laencina et al., 2010). Finally, ML approaches can deal with many predictors, are robust to correlations in explanatory variables, and can allow for varying functional relationships between predictor and response variables (Hochachka et al., 2007). These features make ML well suited to the analysis of complex systems with high dimensionality such as those producing ES.
Although automation in ML allows for rapid processing of large and complex datasets, which is clearly advantageous for both descriptive and predictive tasks considering the current challenges of 'big data', the lack of reliance on causal theory is also a potential pitfall of ML approaches. Essentially, by modelling correlations ML does not standardly incorporate any process-based theory, and this limits the generalisability of ML inferences outside of the input space of the data. It is therefore especially important that predictive ML models incorporate a process of validation whereby models are tested on independent data (Lucas, 2020). Likewise, any hypotheses or subsequent analyses based upon descriptive applications of ML should consider that the inference may not be transferable outside the parameter space (Spake et al., 2017). ML approaches are also criticised as being 'black-box', in that it can be difficult to understand how or why they work (Zednik, 2019). Whilst, to some extent, opacity can be an inherent characteristic of some ML algorithms, it is nevertheless important that ML methodologies are as transparent as possible if research utilising ML is to be robust. As such, the input data used should be available to other researchers and any model settings, software used or relevant computer code necessary to run the model should be reported.
Considering these possible benefits but also pitfalls of using ML, here we conduct a review to quantify the use of ML in ES research. The aim is to explore how ML is used in ES research for descriptive and predictive tasks, to identify and quantify trends in ML approaches for ES, and to assess ML methodological repeatability. Specifically, we: 1) quantify the use of ML for descriptive and predictive modelling tasks in ES; 2) assess the extent to which applications of ML in ES research follow transparent and repeatable methodologies; 3) quantify the extent to which ES publications report model generalisability; and 4) review the size and complexity of datasets that have been used in ML approaches to ES.

Methods
We followed a quantitative review methodology that involved a two-step search strategy. We used the Web of Science database to find publications from which information was extracted according to categorisation criteria. The aim of step one was to generate a list of relevant machine learning (ML) terms that represent the use of ML in ecosystem service (ES) research. In step one we entered the search string: "machine learning" AND ("ecosystem services" OR "ecosystem service"). The Keywords and Keywords Plus were taken from all the resulting articles, and these were then classified as being terms either relevant to ML or not according to the mutual agreement of the review team. Thus, we generated a list of 33 relevant ML terms that represent the use of ML in ES research e.g., 'data mining','neural network', 'decision tree', etc. (see SI1 for list of all Keyword and Keyword Plus terms and how they were classified). We then ran a new search by entering the search string: "relevant-key-word" AND ("ecosystem services" OR "ecosystem service") for all the relevant ML terms identified in step one. All papers for each relevant term were assessed according to inclusion criteria: a) papers with no mention of ES in the title or the abstract were not included in the review; b) papers which did not use a machine learning algorithm as part of the data analyses were not included. Here an ML algorithm was defined as one which splits, sorts and transforms a set of data enabling it to classify, predict, cluster or discover patterns in a target dataset (Reichstein et al., 2019). Those that did not meet the inclusion criteria were not included in this review. Papers that met the inclusion criteria were categorised and data extracted (below). If there were over 100 papers for each term, then random numbers were used to select 100 for inclusion in the review. For example, for the relevantkey-word 'classification' there were 1779 papers, so we selected a random sample of 100; while for relevant-key-word 'support vector machine' there were only 74, so all papers were reviewed. Note the search was not exhaustive because the Web of Science database is not totally comprehensive (Martín-Martín et al., 2018) but provides a representative sample of important research in this area.

Data extraction and categorisation criteria
From our pool of articles, we categorised all applications of ML as either descriptive or predictive. Publications that had applications of both descriptive and predictive ML were included in both descriptive and predictive categories. Such articles included, for example, studies that carried out an ML cluster analysis prior to predictive modelling. All applications of unsupervised ML (i.e., clustering, PCA etc., see Box 1) were classed as descriptive methods. We also categorised ML applications used in the classification of remotely sensed data, and ML image recognition, as descriptive because the primary aim is to describe the data by sorting it into meaningful classes, with those descriptive papers not falling into this category termed 'organisational'. All other applications of ML were classed as predictive. These predictive models either directly predicted specified ES (hereafter 'direct ES prediction'), or the model did not directly predict a specified ES but was indirectly relevant to ES (hereafter 'indirect ES prediction'). For example, if a study predictively modelled carbon sequestration this would be categorised as direct ES prediction but if it predictively modelled forest land cover then this could be used to indirectly predict ES. Thus, descriptive publications could be subdivided into either a) organisational or b) remote sensed/ image recognition; and predictive publications could be subdivided into either a) direct ES prediction; b) indirect ES prediction. Note that membership of the subdivisions is mutually exclusive (i.e., 'a' or 'b') but a publication could be categorised as using both descriptive and predictive approaches.
The following information was also extracted from each manuscript: • Dataset size and complexity -The number of data points (often referred to as the number of instances in a machine learning problem) and the number of variables (attributes) in the dataset used by the ML algorithm were recorded. If more than one application of ML was used in the analysis, then the largest of the sample sizes and number of variables is recorded. • Data availability -The data used in the ML analysis were classed as being freely available if the data could be accessed for free. • ML rationale given -Papers were considered as presenting a rationale for their use of ML if they provided an explanatory justification for its use in the analysis with reference to supporting literature. • Generalisability -Papers were classified as having tested the generalisability of the model if: i) the impact of the training-testing subsets on the model were investigated (e.g. using cross validation to indicate how robust the model is to different subsets of the data), and/or ii) the transferability of the model outside the parameter set of the training and testing data were investigated (i.e. how well the model performs in a different geographic location or time frame; Box 1). • Model tuning -A paper was classed as carrying out model tuning if adjustments were made to the standard parameters of the ML algorithm and either these adjustments were justified with reference to the literature or through testing of the effects on the ML output (Box 1). • Software -A paper was classed as reporting the software if the software used to carry out the ML analyses was detailed. • ML technique -The type of approach(es) used was recorded for each study. Approaches included: Classification and Regression Trees, Artificial Neural Networks, Bayesian, Maximum Likelihood, Support Vector Machines, Clustering algorithms.
Firstly, the percentage of reviewed publications using each ML approach was calculated per category of ML study (Organisational, Remote sensed and Image recognition, Direct ES prediction, and Indirect Prediction). Secondly, the percentage of publications meeting each of the other above criteria was calculated per category of ML study. Finally, the median, maximum, and minimum number of data points and variables for each category were also calculated. All analyses were carried out in R (version 4.0.4.)

Results
A pool of 1012 publications resulted from the search with a total of 308 publications applying machine learning (ML) in ecosystem service (ES) related research between 01/2008 and 07/2021 ( Fig. 1; see SI2 for a comprehensive list). ML is increasingly being used in ES research and a wide variety of ML techniques are utilised for provisioning, regulating and cultural ES. In some ES studies (e.g. Funk et al., 2019;Schirpke et al., 2019;Havinga et al., 2020), ML represents part of a methodology involving a range of other statistical and modelling techniques, sometimes involving application of more than one type of ML technique. In other studies (e.g. Richards and Tunçer, 2018;Nguyen et al., 2019), ML represents the entire modelling process. In a further set of studies, different approaches are compared in terms of their ability to model similar data, either a range of ML techniques (e.g. Hirayama et al., 2019;Sannigrahi et al., 2019;Wu et al., 2019) or ML in comparison to process based modelling (e.g. Willcock et al., 2018). The median number of data points in each publication using ML was 1138 (maximum = 9,500,430; minimum = 17; n = 225; Table 1). The median number of variables was 13 (maximum = 2317; minimum = 3; n = 215).
3.1. ML for descriptive tasks ML was used for data description in 64% (n = 308) of studies, which can be divided into those using remotely sensed data or image recognition (53% of all studies; section 3.2.) and organisational studies (11%; Fig. 1). Clustering or ordination algorithms were commonly used to identify groupings, splits or other structure in data without theoretical assumptions (19%). Organisational studies used clustering algorithms to identify ES bundles or hotspots (7% of all studies). For example, Kmeans cluster analysis was used to describe bundles of supply, flow and demand of ES by identifying groups of ES according to spatial concurrence (Schirpke et al., 2019), hierarchical cluster analysis was used to identify groups of ES according to social preferences (Martín-López et al., 2012), and an Artificial Neural Network (ANN) with a clustering function was used to identify bundles of ES (Liu et al., 2019).
In 16% of studies, ML clustering or dimensionality reduction was used in an additional methodological step for predictive modelling with a supervised learning technique. For example, Agglomerative Hierarchical Clustering was utilised to identify groups of structurally similar forest stands prior to the application of Random Forest to assess importance of structural variables on carbon storage (Thom and Keeton, 2019); and K-means cluster analysis was used to identify areas of homogeneous sets of species prior to the predictive modelling of floodplain biodiversity using a Bayesian Belief Network (BBN) (Funk et al., 2019).

ML for remote-sensing and image recognition
ML was implemented in publications using remotely sensed data (52%; n = 308; Fig. 1) for feature extraction or the classification of remotely sensed images to produce land cover maps (Zhang et al., 2016;Traganos and Reinartz, 2018;Erker et al., 2019;Pouliot et al., 2019;Trinder and Liu, 2020) or landscape or vegetation feature extraction from remotely sensed images (Chen et al., 2018b;Jiang et al., 2018;Dash et al., 2019;Fujimoto et al., 2019). In other studies (12%), remotely sensed data is used but as one of a range of spatially explicit predictor variables to model, e.g., carbon storage (Sanderman et al., 2018;Silveira et al., 2019;Havinga et al., 2020), land use and ES change (Liu, 2014;Mahmoud and Gan, 2018;Hashimoto et al., 2019), or for other ecological predictions such as Bark Beetle outbreaks (Rammer and ML was also utilised in descriptive image recognition tasks, such as cultural ES studies involving the analysis of large datasets from social media platforms using an ANN (3%). Online ANN image analysis models, specifically Deep Convolutional Neural Networks on cloud computing platforms Google Cloud Vision (Google Cloud Vision, 2021) and Clarifai (Clarifai General Model, n.d.), were used to analyse the thematic content of user uploaded geo-tagged photographs on Flickr and clustering algorithms were used to group the photographs according to the themes. These themes were used as indicators of cultural ES, and were combined with spatial and temporal information associated with the photographs, enabling modelled cultural ES mapping (Richards and Tunçer, 2018;Bernetti et al., 2019;Gosal et al., 2019;Chang et al., 2020;Gosal and Ziv, 2020;Runge et al., 2020) Similarly, an ANN image analysis model was used to classify geo-tagged photographs from Wikiloca sports photo-sharing platform -(Wikiloc, 2021) according to thematic content, and inferred cultural ES were mapped (Callau et al., 2019).

ML for predictive modelling
ML was used in predictive modelling in 44% (n = 308) of publications. A wide range of ML techniques were used for predictive modelling of ES (Fig. 1). Classification and Regression Trees (CARTs)a form of supervised learning (Box 1)are the most widely used approach (60%, n = 308), and Random Forest (RF) (44%; Fig. 1) is an especially popular example of a CART. CARTs were used in supervised classification tasks to predict membership of a user-labelled class. For example, RF was used in the process of modelling timber production by predicting the ageclass of forestry tree species from remotely sensed and historic forestry data (Gao et al., 2016). CARTs were also used in supervised regression tasks. For example, RF was used in modelling carbon-diversity hotspots in agricultural soil from remote sensing, terrain and climate variables (Silveira et al., 2019) and a regression tree model was used to predict soil carbon stocks under future land use and climate change from soil survey data (Adhikari et al., 2019).
ES studies have used other supervised ML techniques in predictive modelling, 26% used an ANN, 4% used a BBN, 24% used a Support Vector Machine (SVM). For example, an ANN was used in a regression task to predict rice crop yields from environmental and socio-economic variables (Dang et al., 2019), and a BBN was used in a classification task to predict firewood use from environmental and socio-economic variables (Willcock et al., 2018). ANNs were also used to predict future land use change (e.g. Akinyemi and Mashame, 2018;Beygi Heidarlou et al., 2019;Hashimoto et al., 2019). In addition to the prediction of target variables some techniques, most notably CARTs, were used to assess variable importance or for the selection of relevant predictor variables. For example, RF was used to identify the most important variables controlling organic carbon stocks in agricultural soils (Mayer et al., 2019) and in forest stands (Thom and Keeton, 2019), and a CART was used to assess variable importance for the supply of a range of provisioning and regulating ES in an agroecosystem (Rositano et al., 2018).

Repeatability, model tuning and generalisability
Altering ML model settings can optimise model performance (Box 3). However, 67% (n = 308; Fig. 2 56% of all publications report model settings used (50%, organisational; 57%, remote sensing; 50% predictive direct; 63% predictive indirect). For those studies that do detail the model setting used, but do not experiment with model tuning, 51% (n = 102) give justification with reference to literature, but the rest of the studies provide no explanatory justification for the use of the particular model settings chosen. Some publications (4%; n = 308) do not report in their methods the kind of data used (e.g., categorical or nominal) as input or output in the ML model. Most publications (61%) give a rationale for the use of ML rather than an alternative modelling approach, but many studies do not. Publications tend to detail the software and the version used, but 28% do not report what software is used to carry out the ML technique. Model input data is sometimes freely available via supplementary material or an open data source but this is not the case in half of publications. Less than half of all publications reviewed report testing the generalisability of the ML model (Box 3) used within their study with an independent data set (41%, all publications).

Discussion
Machine learning (ML) is used in ecosystem service (ES) research as both a descriptive tool, where aspects of automation enable speedy processing of high volumes of complex data, and in predictive modelling, in which accurate predictions can be made about ES. The variety of ways by which ML is incorporated in ES research methodologies highlights its value as an adaptable extension to traditional data analyses across all ES domains. Supervised ML approaches such as Classification and Regression Trees (CART) and Artificial Neural Networks (ANN) algorithms tend to be used for predictive model tasks, whilst descriptive tasks are often carried out using unsupervised ML, such as clustering algorithms to group data (Fig .1). While there are examples of studies that apply ML with a repeatable and rigorous methodology (Box 3), many studies fall short of methodological best practice; failing to report which software was used, model settings or tuning, or test of generalisability (Fig. 2). In some instances, these methodological shortcomings affect the repeatability of the study, such as not being able to identify the exact algorithm used, but in other instances they might mean that the findings of the study may be flawed. We suggest that future studies may use the findings of poorly reported models, but should do so with caution. Such models may well be valid, but the lack of repeatability means that that validity cannot be independently tested. For example, algorithm parameter optimisation has been shown to affect ML model accuracy (Daelemans et al., 2003), so using default model settings might lead to reduced model performance. Thus, if a paper does not report model tuning then it is likely that the authors used the default parameters in the model settings in the relevant software. This may mean that, given the data the authors had at their disposal, the model presented may not be the best fit model to that data, and likely has higher uncertainty than could be achieved if tuning was performed. Similarly, without testing generalisability on an independent dataset, a ML model might be 'overfitted' to the data, this results in poor model accuracy when applied to new data from a parameter space that was not used to train the model, and so this should be done with caution (Hawkins, 2004;Kuhn and Johnson, 2013). The potential impact of these methodological shortcomings varies with the type of ML approach used and the task for which the ML is being used. For example, the effect of altering algorithm hyperparameters away from defaults (Box 3) varies between ML techniques; e.g. increasing the of number of tree splits in a Random Forest above the default setting may have a marginal effect on model accuracy (Kulkarni and Sinha, 2012) compared to large effect on model performance that can result from altering the number of hidden layers in an ANN (Srivastava et al., 2014). However, this largely depends on the problem at hand, therefore an investigation of hyperparameters is always recommended. Likewise, there is arguably less need to test for generalisability when, for example, using a CART to estimate variable importance, as compared to the need to a predictive classification model, because an estimation of variable importance does not explicitly generalise beyond the learnt parameter space (Kuhn and Johnson, 2013). Furthermore, for some descriptive tasks, testing generalisability may not be necessary; such as for some basic data sorting tasks or in applications to aid hypothesis generation (Lucas, 2020).
We found some examples of studies that use large and complex datasets (Box 2), but the capacity of ML to analyse available 'big data' has not yet been fully realised in ES research (Table 1). In remote sensing studies, large amounts of data are generated from satellites and manned and unmanned aerial vehicles. Automation in ML allows for rapid and accurate processing of these datasets (Lary et al., 2016). Due to its capacity to process data of high dimensionality and to map classes with complex characteristics, ML is an effective and efficient geoscientific classification method, and now the standard approach for remote sensing image classification (Maxwell et al., 2018). In ES research, classification of remotely sensed images can provide estimates of the spatial distribution of ES supply via mapping of ES proxies, such as land use and land use change (Martnez-Harms and Balvanera, 2012) or factors that drive ES supply namely, ecosystem service providers, ecosystem processes and functional traits (Andrew et al., 2015). That remote sensing ML methods tend to have a higher degree of repeatability and generalisability and utilise larger datasets compared to other methods (Fig. 2, Table 1) is likely testament to the maturity of the use of ML in the field of remote sensing. However, it suggests the under-utilisation of ML in other areas of ES research not associated with remote sensing, or that other areas of research have not amassed such high amounts of data.
In conducting our review, we noticed that the use of ML in ES research perhaps focuses on predictive modelling of the potential biophysical supply of ES, and often indirectly via ES proxies such as landcover or via hypothesised service providers. In these areas of ES research, ML can offer advantages over process-based models and standard statistical modelling in terms of improved predictive accuracy and ability to make use of disparate kinds of data. However, this is a relatively narrow subset of ES research and there is scope for further utilisation of ML in other areas, including ES demand and flows. For example, ES can be defined in terms of interactions between the service provider and service beneficiaries. In this sense they are co-produced, and to inform land management and policy decisions, ES research needs to quantify supply of ES relative to demand (Burkhard et al., 2012).
Thus, ES modelling could better incorporate social science data (Daw et al., 2016). This has been explored in part with the analysis of large datasets from social media platforms using deep convolutional neural networks (DCNNs; e.g. the automated content analysis tool, Google's Cloud Vision (Google Cloud Vision, 2021); Gosal et al., 2019), which highlights the potential for ML to utilise very large social media datasets (Runge et al., 2020 media have been largely limited to data from single social media platforms and there is further potential to use ML with a variety of social media platforms to analyse cultural ES (e.g. Ruiz-Frau et al., 2020). More generally, social science datasets potentially relevant to ES research seem yet to be utilised. For example, it has been established there is a need to better understand the flows of ES beneficiaries (Bagstad et al., 2013) and to better incorporate ES demand into predictive models (Martínez-López et al., 2019). However, whilst big data from social science has recently been used effectively in some disciplines (e.g. in the development human mobility theory (Alessandretti et al., 2020), such data has yet to be used by ES researchers. The availability of big data from social science together with the capacity of ML to both effectively utilise data from mixed sources and deal with a high number of variables, suggests that ML could be used in a more holistic system-scale modelling approach that captures the co-productive nature of ES.
Box 2 Examples of ecosystem service (ES) studies using machine learning (ML) that demonstrate the benefits of ML approaches.
Many of the papers we reviewed highlight the benefits ES science can derive by adopting ML methods: • Big data -ML allows for the rapid processing of data and one of its key strengths is that it can support analysis of larger datasets than many conventional methods (Reichstein et al., 2019). Richards and Tunçer (2018) analyse over 20,000 images uploaded to photo sharing platform Flickr. They used Google Cloud Vision (an ML algorithm for image analysis) (Google Cloud Vision, 2021) to classify the thematic content of the images to map recreational beneficiaries. The time required to manually classify so many images would make this task impractical without the use of ML.
• Clustering -ML enables the grouping of data without the use of domain-specific theory. In ES science this can have useful application to identify bundles of ES provision or groups of ES beneficiaries. Schirpke et al. (2019) use K-means cluster analysis to identify areas where ES repeatedly occur together in the European Alps. Gosal et al. (2019) use the Ward-D clustering algorithm to identify six groups of recreational beneficiaries in the Camargue based on annotation of photos uploaded to Flickr.
• Uncertainty measures -Transparent estimates of model uncertainty are produced as an integral part of many ML predictive modelling algorithms. These measures can be useful to decision makers who can determine acceptable levels of uncertainty and use their own expertise for potentially contentious decisions. Willcock et al. (2018) model fuel use in South Africa and biodiversity in Sicily using ML Bayesian Belief Networks. They report associated estimates of uncertainty which were produced as part of the modelling process and highlight that the level of certainty might influence management decisions as well as the predicted level of ES.
• Hypothesis generation and variable importance assessment -In addition to the prediction of target variables, ML allows for the assessment of variable importance and the selection of relevant predictor variables. Mayer et al. (2019) use the Random Forest algorithm (an example of a classification and regression tree) to identify the most important variables controlling organic carbon stocks in agricultural soils in Bavaria. They input 13 predictor variables and the algorithm identified the variables that explained the majority of variance in carbon stocks. This identification of important variables aids in the generation of hypotheses, e.g., theory about why these variables determine carbon stocks.
Our review of 308 ES papers using ML revealed a wide range in their ML protocols. Here, we highlight a sample of papers that we consider provide 'gold-standard' or best practice for key aspects of ML reporting.
• Methodological transparency -Each application of ML needs to be fully repeatable. As such, the input data used should be available to other researchers. Ideally the data would be open access and links to data sources provided in the publication. Furthermore, any model settings, the software used and relevant computer code necessary to run the model should be reported. Funk  They alter model structure such as network size and parameters of the training process including the loss function and optimizer used. They report iteratively the evaluation of model variations by calculating performance measures including model accuracy, precision, recall, F1 Score, Conditional Kappa and True Skill Statistic for each model run. The source code they use to build the model can be found here: https://github.com/wernerrammer/BBPredNet.
• Generalisability -Model testing, where model performance is tested using a random subset of the data not used to train the algorithm, is an integral part of most supervised learning algorithms. However, without validation against an independent dataset outside the parameter space of the training-testing data, a ML model might be 'overfitted' and not generalisable to other spaces/times (Hawkins, 2004). This can result in poor model accuracy when applied to new data which was not used to train the model (Alpaydin, 2020). This can be overcome by dividing the training dataset in two: with one set used for training and the other for testing generalisability, or by additionally testing the model on a dataset outside the learnt parameter space. For example, Hashimoto et al. (2019) use historical land use data to predict future land use change using an Artificial Neural Network. They model land use change using historic land use data for 1997 and 2006 and randomly split 50% of the data for training and 50% for testing the model (n = 1275), but also reserve an independent data set (data for 2014) for testing model generalisability (n = 1275).
The use of ML in ES research, whilst increasing, is still in its infancy. As such, ES scientists can benefit greatly from the experience of other disciplines. For example, recent developments in deep learning algorithms have enabled detailed modelling of spatial-temporal dynamics in the Earth Sciences (Reichstein et al., 2019) and this is potentially applicable in a dynamic holistic ES modelling approach. In addition, hybrid ML models, which combine purely data-driven machine learning modelling with theory-bound, process-driven approaches, have been shown to have improved predictive power outside of the learnt parameter space in areas such as climate science (Huntingford et al., 2019) and could be useful in the development of more transferable ES models.
In conclusion, this review found that a wide range of ML approaches have been used effectively in a variety of ES studies and that ML offers exciting potential in future ES research. However, for the full potential of ML in ES to be realised and confidently used by stakeholders, ML models should be transparently reported and readily repeatable (Martínez-López et al., 2019). Our review identifies 'gold standard' studies that exemplify methodological best practice and could be used as a benchmark for ML reporting in ES research.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.