Benchmark estimates for aboveground litterfall data derived from ecosystem models

Litter production is a fundamental ecosystem process, which plays an important role in regulating terrestrial carbon and nitrogen cycles. However, there are substantial differences in the litter production simulations among ecosystem models, and a global benchmarking evaluation to measure the performance of these models is still lacking. In this study, we generated a global dataset of aboveground litterfall production (i.e. cLitter), a benchmark as the defined reference to test model performance, by combining systematic measurements taken from a substantial number of surveys (1079 sites) with a machine learning technique (i.e. random forest, RF). Our study demonstrated that the RF model is an effective tool for upscaling local litterfall production observations to the global scale. On average, the model predicted 23.15 Pg C yr−1 of aboveground litterfall production. Our results revealed substantial differences in the aboveground litterfall production simulations among the five investigated ecosystem models. Compared to the reference data at the global scale, most of models could reproduce the spatial patterns of aboveground litterfall production, but the magnitude of simulations differed substantially from the reference data. Overall, ORCHIDEE-MICT performed the best among the five investigated ecosystem models.


Introduction
Litterfall is a particularly key process for determining the carbon and nutrient cycling of terrestrial ecosystems, and it controls the main respiration substrates on the forest floor (Roig et al 2005, Chen andChen 2018). The magnitude of litterfall regulates the rate of soil respiration and soil organic carbon content indirectly (Sayer 2006, Hansen et al 2009, Neumann et al 2018. Moreover, litterfall maintains the soil fertility as it is the most important resource of soil organic matter and soil nutrients (Gairola et al 2009). Litterfall can also regulate the properties of the underlying surface by changing the hydraulic conductivity and albedo (Liu et al 1997), and impact the responses and feedbacks of terrestrial ecosystems to climatic conditions (Winkler et al 2010). Therefore, litterfall is one of the key parameters in measuring, modeling and predicting terrestrial ecosystem dynamics (Liski et al 2005).
All ecosystem models have simulated litterfall production and its spatial variability, however, to date, different models remain inconclusive regarding the magnitude of production. For example, Rotmans and Den Elzen (1993) used an empirical model to estimate total litterfall flux (including aboveground and belowground) on a global scale at 47.5 Pg C yr −1 , which is 3.28 times the estimate of 14.5 Pg C yr −1 , reported by Lonsdale (1988). These results imply that the models, which have been well validated on parameters related to productivity (and then are in good agreements), are Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence.
Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. totally different in simulating litterfall production, and have not been compared against data. Therefore, these models should be evaluated over regional scales against defined references (i.e. benchmarks) that can be used to diagnose their strengths and deficiencies for future improvement. However, to our knowledge, no study has been conducted to evaluate model performance with respect to the litterfall production, and there is still very limited knowledge of the performance capabilities. Therefore, benchmark analysis is urgently needed to evaluate ecosystem models against observations as it allows us to identify uncertainties in predictions, as well as guides the priorities for model development (Luo et al 2012).
The most critical component of any benchmark analysis is to define the benchmarks, which must be objective, effective, and reliable for evaluating model performance. Currently, as advanced empirical models, machine learning methods have been increasingly developed for explicitly quantifying carbon cycle variables regionally and globally (Jung et al 2010, Xia et al 2018. Several approaches including artificial neural networks, regression trees, support vector regression, and random forest (RF) have been widely employed to predict regional biomass and other carbon cycle variables. Machine learning methods are independent of the relationships between response variables and predictive variables, especially when compared with traditional empirical models such as linear regression that requires a Gaussian distribution for the input variables. More importantly, the increasing observations now available cover a wide range of geographic and climate regions, which is of clear value for upscaling site-level observations to the regional scale. In this study, we combined four global aboveground litterfall production datasets and the related environmental conditions. The primary objectives of this study were to: (1) develop the RF model for predicting aboveground litterfall production by using adequate site observations, (2) quantify the aboveground litterfall production based on the RF model, and (3) conduct a benchmarking evaluation of ecosystem models regarding to the aboveground litterfall production.

Aboveground litterfall production datasets
In forest ecosystems, aboveground litter includes mainly foliage, branches, bark, and reproductive organs, and usually, foliage litterfall occupies a major fraction of total litterfall (Liu et al 2003).
In this study, we used four datasets of observed aboveground litterfall production data (Zhang et al 2014, Holland et al 2014, Jia et al 2016, Neumann et al 2018 and some other relevant studies were also included (Mina 1955, Remezov et al 1959, Marchenko and Karlov 1962, Rodin and Bazilevich 1967, Goryshina 1974a1974b, Ranawat and Vyas 1975, Gaur and Pandey 1978, Djhalilov and Safarov 1981, Jakucs 1985, Breymeyer 1991, Berg et al 1993, Vedrova 1995. These observations were collected from published studies, and have been critically reviewed and incorporated into comprehensive forest litterfall datasets. Litterfall data were measured by litterfall-trap experiments. Replicate measurements obtained from a series of plots or litterfall traps at a single site were averaged. In these datasets, litterfall production referred to plant material shed in one year, and was composed primarily of leaves, twigs (<2.5 cm in diameter), flowers, fruits and bark. Dead roots and coarse woody detritus (CWD) were not included. CWD include a wide variety of types and sizes of materials. Generally, items larger than 2.5 cm in diameter are referred to as CWD. In total, the litterfall datasets included measurements from 1079 study sites. The observation sites dispersed across various climate zones with latitudes from −42.45 to 71.3 and longitudes from −156.7 to 176.3, and covered different forest types (figure 1). To match our observed aboveground litterfall data, CWD were excluded from the aboveground litterfall outputs of the five ecosystem models.

Random forest
RF is a machine learning method for classification and regression. It combines tree predictors, such that each tree depends on the values of a random vector that is sampled independently, with the same distribution for all trees in the forest. RF operates by constructing a multitude of decision trees for a given training time and outputting the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees. The generalization error for RF converges to a limit as the number of trees in the forest increases. The first algorithm for random decision forests was created by Ho (1998) using the random subspace method, which is an extension of the algorithm developed by Breiman (2001). This study constructed the aboveground litterfall production model based on RF in R. The R package 'randomForest' used in the study has been modified by Liaw and Wiener (2002) from the original Fortran by Breiman and Cutler (https://cran.rproject.org/web/packages/randomForest/).
We developed a predictive aboveground litterfall production model using RF. The explanatory variables included: mean air temperature, maximum air temperature, minimum air temperature, precipitation, relative humidity, wind speed, solar radiation and the normalized difference vegetation index (NDVI), by year and by the four seasons (i.e. winter, spring, summer and autumn), for a total of 8 variables. The period of the time series of all these variables was from 1982 to 2013. NDVI were derived from the Global Inventory Modeling and Mapping Studies (GIMMS) datasets and the rest meteorological factors were derived from the CRU-NCEP (National Centers for Environmental Prediction) v8 datasets. We used complete combinatorial method to produce the optimal combination out of the 8 variables. Full combinations of 2-8 variables were examined, and totally there are combinations of 247 variables (1235 combinations for four seasons and one year). To select the best model, we evaluated the performance of each model based on their root mean squared error (RMSE). For each model, 80% of the observations were selected randomly for training, leaving 20% for validation, and the model was run 2000 times.
To analyze the model uncertainty of the RF model, we used different forcing datasets to produce an ensemble of RF predictions. In our RF model, mean annual temperature and NDVI were selected as predictor variables of aboveground litterfall production. Thus, we used four different air temperature datasets of CRU-NCEP (National Centers for Environmental Prediction) v8, Climate Research Center (CRU) TS3.25, the Modern Era Retrospective-Analysis for Research and Applications, Version 2 (MERRA-2) and ERA-Interim, together with NDVI of the GIMMS to force the RF model. The time period of these input data was from 1982 to 2013. Then we took the median absolute deviation (MAD) across different ensemble members as model uncertainty for the RF model ( figure 3(b)).

Terrestrial ecosystem models
In this study, we compared the RF with five terrestrial ecosystem models about their predictions or estimations of aboveground litterfall production. These models included: BioGeochemical Cycles  Ito and Oikawa 2002). While VISIT was driven by Climate Research Unit (CRU) TS3.25 datasets (http://data.ceda.ac.uk/badc/cru/data/cru_ ts/cru_ts_3.25/), the others were forced with CRU-NCEP (National Centers for Environmental Prediction) v8 datasets (https://vesg.ipsl.upmc.fr/thredds/ catalog/work/p529viov/cruncep/V8_1901_2016/ catalog.html). Both of two climate datasets are derived from quasi-point based measurements and have the same spatial scale (0.5°). CRU-NCEP data are provided every 6 h and CRU data are provided monthly. CRU-NCEP data averaged on a monthly time step are equivalent with CRU data, in that CRU-NCEP data are based on CRU monthly data and NCEP data are used to simulate the 6-hourly variability of different parameters. Consequently, it is reasonable to consider that all models including VISIT were based on the same climate forcing. This study examined the performance of the RF model and the five ecosystem models, and the results were presented with a Taylor diagram ( figure 4). Specifically, we used Pearson's correlation to evaluate the relationship between the observed and simulated aboveground litterfall production. Furthermore, we used the centered pattern root mean square (RMS) difference in order to isolate the differences in models from differences in the means of the observed and simulated patterns.
In this study, we also analyzed the correlation between aboveground litterfall production of the five ecosystem models and their leaf area index (LAI) and heterotrophic respiration (RH). The LAI and RH estimations were derived from the ecosystem model that used in this study respectively. RH estimations include the CO 2 emission from the decomposition of litterfall and soil organic carbon.

Litterfall predictions by RF method
Based on the aboveground litterfall production observations, we evaluated model performance on all possible combinations of the explanatory variables (see section 2.2). Annual mean air temperature and annual mean NDVI were found to be the best combinations for predicting litterfall flux. We then validated the model in the spatial domain using cross-validation. To analyze the performance of the RF, the average predicted litterfall production of the same site from 20% validation datasets of each model run were calculated (figure 2). The RMSE of the aboveground litterfall production for the RF was 0.0994 kg C m −2 yr −1 . Based on the RF model, we generated global aboveground litterfall production from 1982 to 2013. On average, there was 23.15 Pg C yr −1 aboveground litterfall production  during this period. Aboveground litterfall production was the highest in the tropical moist forest regions and lowest in cold tundra and dry desert regions ( figure 3(a)). Overall, the model uncertainties of aboveground litterfall production for the RF were small and there were larger uncertainties in low latitude area ( figure 3(b)). As figures S2 and S3 showed, both annual mean air temperature and annual mean NDVI (driving data for the RF) were positively correlated with cLitter predicted by the RF, implying that the two identified drivers theoretically affected the prediction of aboveground litterfall production for the RF.

Model-data comparison
Based on the observations of aboveground litterfall production, we examined the performance of the RF model and the five ecosystem models (figure 4). Here we used Pearson's correlation to evaluate the relationship between the observed and simulated aboveground litterfall production. Furthermore, we used the centered RMS difference in order to isolate the differences in models from differences in the means of the observed and simulated patterns. The centered RMS difference approaches zero as two patterns become more similar. The standard deviations of Biome-BGC was the most similar to the reference point, the distance from which to the origin indicates the standard deviation of the observed value. For the correlation coefficient and the centered RMS difference, the RF's correlation coefficient was the largest while its RMSE was the lowest, which indicates that the RF's performance was the best when modeling aboveground litterfall production. Overall, ORCHI-DEE-MICT performed the best among the five ecosystem models.

Benchmarking evaluation
There were large differences in the simulated aboveground litterfall production among the five investigated ecosystem models , the simulations derived by the LPJ model (91.41 Pg C yr −1 ) was 9 times the estimate of 9.85 Pg C yr −1 , derived from the Biome-BGC model (figure 5(g)). Specifically, LPJ simulated high aboveground litterfall production at high latitudes. In terms of the global aboveground litterfall production trend, five of the six models (i.e. RF, Biome-BGC, LPJ, ORCHIDEE-MICT and VISIT) showed a significant cLitter increase from 1982 to 2013, with the trend ranging from 0.13 to 0.92 g C m −2 yr −1 (figure 6). The greatest increase of aboveground litterfall production was found in the VISIT model ( figure 6(f)). The IBIS model presented relatively constant long-term changes (0.004 g C m −2 yr −1 , p=0.92). In general, the trend of cLitter for Biome-BGC was the nearest to the RF's, implying that the simulations of cLitter's time variation for Biome-BGC were the best among the five ecosystem models.
The LPJ model simulated the extremely high leaf litterfall at high latitudes, and which probably results from two potential causes. First, LAI for LPJ is high in the latitude around 60 degrees ( figure 8(g)). Second, the leaf longevity of boreal needleleaf evergreen trees for LPJ is 2 years (table S1 is available online at stacks.iop.org/ERL/14/084020/mmedia), which is smaller than the other models. For instance, the leaf longevity of evergreen needleleaf forest-cool climate for Biome-BGC is 4 years and the leaf longevity of boreal needleleaf evergreen trees for ORCHIDEE-MICT is 2.49 years (table S1). In contrast, the leaf longevity in tropical for LPJ is the longest, which makes the cLitter for high latitudes for LPJ much higher than the other models and the cLitter in tropical area extremely low.

Relationships between LAI and litterfall production
In order to explain the substantial differences of aboveground litterfall production between the five ecosystem models, we analyzed the relationship between LAI and aboveground litterfall production for these models. The LAI was derived from each ecosystem models used in this study respectively. The simulations of LAI have strong connection with the simulations of cLitter in terrestrial carbon cycle models. We analyzed the relationship between the cLitter and LAI of the outputs for all models, their correlation coefficients are shown in figure 7. Additionally, the ratios of the different correlations (significant negative correlation, negative correlation, positive correlation and significant positive correlation) are shown in figure 7. In this study, the RF model was only used to predict aboveground litterfall production. We did not use the RF model to predict LAI, so we analyzed the relationship between cLitter and its forcing data, the GIMMS NDVI. For Biome-BGC, IBIS, ORCHIDEE-MICT and VISIT, cLitter was significantly and positively correlated with LAI in most areas. However, for RF and LPJ, the correlations were much weaker, especially for LPJ ( figure 7(d)).
There were obvious differences in the LAI simulations of the five ecosystem models ( figure 8(f)). The same as with cLitter, the simulated values of LAI for LPJ were the largest, while those for Biome-BGC were the smallest. We also analyzed the LAI trend of the five ecosystem models. Figure 8 illustrates that there were slight increasing trends in the LAI simulations of

Discussion
Our results demonstrated that in combination with other satellite-derived and climatic variables, the RF performed well for predicting litterfall production, as was confirmed by cross-validation and Taylor diagram (figures 2 and 4). Often, the performance of the datadriven methods are highly restricted by the quantity of training datasets (Chen et al 2014). Our current predictions greatly benefited from the abundant observations, which covered the major geographical and climate regions.
The predicted aboveground litterfall production by the RF model showed a strong correlation with the satellite-based GIMMS NDVI ( figure 7(a)). Previous studies have shown that leaf litter substantially contributes to total litterfall, and the contribution percentages from 64%-73% (Zhang et al 2014). All five investigated ecosystem models consistently exhibited strong relationships of simulated LAI on aboveground litterfall production (figure 7), which implies that the leaf area simulation is important for reproducing the aboveground litterfall production. However, there remain large uncertainties in predicting leaf area in the ecosystem models due to certain ecological processes involved, such as vegetation production and carbon allocation (Gower et al 1999, Kucharik et al 2000, Xia et al 2015. Moreover, leaf longevities and turnover rates are important plant traits that substantially determine litterfall production (Schleip et al 2013). Existing vegetation models usually assumed inaccurate leaf longevities and turnover rates for each plant function type (PFTs) (Kucharik et al 2000, Sitch et al 2003. White et al (2000) reported that the specified leaf longevities in Biome-BGC showed large discrepancies with the observed  values in the United States. Similarly, Zhang et al (2016) also indicated that there were significant differences between the observed and default leaf longevities of ecosystem models for four major evergreen PFTs. For example, the observed leaf longevity of boreal needleleaf forest was greater than three times the default value (Zhang et al 2016). The significant errors and uncertainties in leaf longevities and turnover rates resulted in more than 10% of predicted errors for aboveground litterfall production (Zhang et al 2016). There are large differences in leaf longevities between different ecosystem models and observations. For example, the leaf longevity of needleleaf evergreen forest in Biome-BGC was 4 year, which was almost three times the leaf longevity in VISIT and nearly one third of the observed values (table S1; Zhang et al 2016). Therefore, more accurate leaf longevities are critical for simulating aboveground litterfall production.
Our results also indicated the substantial regulations of litterfall production on heterotrophic respiration. In all five investigated ecosystem models, there were strong correlations between aboveground litterfall production and heterotrophic respiration (figure S1). Specifically, for ORCHIIDEE-MICT and VISIT, nearly 90% and 80% of the vegetated areas showed significant positive correlations between these variables (figures S1(d) and (e)). Numerous field experiments have highlighted that the aboveground litterfall production supply is a significant source of heterotrophic respiration (van Groenigen et al 2014). A meta-analysis of 100 published experimental studies found that a 100% aboveground litter addition (i.e. double litter) increased heterotrophic respiration by 26.1% (Chen and Chen 2018). Additionally, increased inputs of fresh organic matter resulting from litterfall could result in 'priming effects'. Priming is the extra decomposition of soil organic matter that occurs when microbes are stimulated by the addition of easily decomposable organic matter (Bingeman et al 1953), causing a disproportionate increase in soil CO 2 efflux. For example, a large-scale litter manipulation experiment combined with carbon isotope measurements found that the efflux of CO 2 derived from soil organic carbon was significantly increased by the addition of litter (Sayer et al 2011).
Our study provides a new global prediction of aboveground litterfall production and offers an opportunity to bridge the gap between sparse data and correct parameterization of future ecosystem models. Undoubtedly, ecosystem models must attempt to better characterize the spatial and temporal heterogeneity of ecosystem processes and pursue further validation against observations (Baldocchi et al 1996, Friend et al 2007, Yuan et al 2010. To reliably simulate the dynamics of litterfall production, the models should accurately reproduce the relevant key ecosystem processes, namely vegetation primary production, carbon allocation, and turnover rate (Bonan et al 2013, Hararuk et al 2014. Based on the RF model, our predictions will be useful as a benchmark for evaluating ecosystem models.

Conclusions
The magnitude of litterfall production is a crucial estimate for projecting the terrestrial carbon budget. Based on a substantial number of field surveys, this study used a machine learning method (i.e. random forest, RF) to develop a data-driven model for predicting global aboveground litterfall production. The results show that the RF enables the adequate retrieval of the global pattern of aboveground litterfall production. The predicted global aboveground litterfall production was 23.15 Pg C yr −1 . Moreover, our study revealed substantial model differences regarding the aboveground litterfall production among five ecosystem models. Compared to the reference data at the global scale, most of models could reproduce the spatial patterns of aboveground litterfall production, but the magnitude of simulations differed substantially from the reference data. Overall, ORCHIDEE-MICT performed the best among the five ecosystem models. Our study thus provides an extensive and normalized model benchmark for aboveground litterfall production, which should be useful for advancing ecosystem models and their parameterization and validation.