Photosynthetic productivity and its efficiencies in ISIMIP2a biome models: benchmarking for impact assessment studies

Simulating vegetation photosynthetic productivity (or gross primary production, GPP) is a critical feature of the biome models used for impact assessments of climate change. We conducted a benchmarking of global GPP simulated by eight biome models participating in the second phase of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP2a) with four meteorological forcing datasets (30 simulations), using independent GPP estimates and recent satellite data of solar-induced chlorophyll fluorescence as a proxy of GPP. The simulated global terrestrial GPP ranged from 98 to 141 Pg C yr−1 (1981–2000 mean); considerable inter-model and inter-data differences were found. Major features of spatial distribution and seasonal change of GPP were captured by each model, showing good agreement with the benchmarking data. All simulations showed incremental trends of annual GPP, seasonal-cycle amplitude, radiation-use efficiency, and water-use efficiency, mainly caused by the CO2 fertilization effect. The incremental slopes were higher than those obtained by remote sensing studies, but comparable with those by recent atmospheric observation. Apparent differences were found in the relationship between GPP and incoming solar radiation, for which forcing data differed considerably. The simulated GPP trends co-varied with a vegetation structural parameter, leaf area index, at model-dependent strengths, implying the importance of constraining canopy properties. In terms of extreme events, GPP anomalies associated with a historical El Niño event and large volcanic eruption were not consistently simulated in the model experiments due to deficiencies in both forcing data and parameterized environmental responsiveness. Although the benchmarking demonstrated the overall advancement of contemporary biome models, further refinements are required, for example, for solar radiation data and vegetation canopy schemes.


Introduction
Photosynthetic productivity is a fundamental function of the terrestrial biosphere and is related to various ecosystem properties and services (Whittaker and Likens 1973). Gross primary production (GPP) is one of the largest carbon flows in the global carbon cycle and responds to changes in various environmental conditions such as light, temperature, humidity, nutrient, and ambient carbon dioxide (CO 2 ) conditions. Any changes in GPP can affect atmospheric CO 2 and thus the climate-carbon cycle feedback (e.g. Cox et al 2000).
Quantifying GPP under a changing environment is a challenging task, however, not only for modeling but also for observation. This issue is a serious limitation of present carbon cycle models, which have been used for future projection and impact assessments (Friedlingstein et al 2006, Arora et al 2013. Even with updated micrometeorological techniques such as the eddy-covariance method (Baldocchi et al 2001), GPP cannot be directly measured and must be estimated from net CO 2 flux using appropriate separation algorithms (Reichstein et al 2005). Beer et al (2010) scaled up field-based GPP data using a statistical model approach to produce a global map of GPP, although the assumptions used in the estimation can introduce certain biases (Wehr et al 2016). GPP has also been estimated by satellite remote sensing based on the relationship between canopy-absorbed solar radiation and GPP, using the light-use efficiency approach (e.g. Zhao et al 2005). Recent global GPP estimation studies have provided useful data to investigate spatial and temporal patterns of the terrestrial carbon budget (Ryu et al 2011, Koffi et al 2013, Yebra et al 2015, Zhang et al 2016a. Nevertheless, consistent global GPP values have not been attained (Baldocchi et al 2015). For example, an isotopic study by Welp et al (2011) implied that global GPP is much higher than the values obtained by flux and satellite studies. In contrast, an analysis by Ma et al (2015) implied that previous studies have overestimated global forest GPP because of an incorrect assumption of forest coverage change. Terrestrial ecosystem models have been successfully adopted in carbon cycle studies, but large uncertainties remain in their estimates (Anav et al 2015, Schwalm et al 2015. They adopt different leafand canopy-photosynthetic schemes, which respond diversely to environmental variability, and they use different forcing data, leading to serious uncertainties and preventing reliable risk assessments. In this study, we examined global terrestrial GPP estimated by eight biome models of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP). The chief aim of the second phase (ISIMIP2a) is the benchmarking of impact models focusing on inter annual variability (Rosenzweig et al 2017, Frieler et al in prep). To clarify the reliability and limitations of existing models, benchmarking (or validation) has become increasingly important in all research areas of modeling (Luo et al 2012, Kelley et al 2013. Indeed, Chang et al (2017) conducted a benchmarking of ISIMIP2a biome models with respect to the net ecosystem carbon budget, focusing on the impact of El Niño and Southern Oscillation (ENSO) events. In this paper, we examine consistency in major aspects of (mainly global annual ones for brevity) of GPP estimated using different biome models and forcing data thorough comparisons with independent estimates and observational data.

Biome models and simulations
In ISIMIP2a, to examine model responsiveness and limitations in forcing data, we used four meteorological forcing datasets for the historical period (Sheffield et al 2006, Weedon et al 2014: Global Soil Wetness Project 3 (GSWP3, 1901(GSWP3, -2010, Water and Global Change (WATCH, 1901(WATCH, -2001, WATCH Forcing Data with ERA-Interim (WFDEI, 1901(WFDEI, -2010, and Princeton's Global Meteorological Forcing Dataset (Princeton, 1901(Princeton, -2012. These datasets provide daily solar radiation, temperature, precipitation, humidity, and wind conditions at a spatial resolution of 0.5°× 0.5°in latitude and longitude. They were produced from different reanalysis data and by different downscaling methods to cover the range of uncertainty in forcing data. De-trended 30-yr data were also provided for each dataset to conduct spin-up runs under a stationary condition. Historical changes in atmospheric CO 2 concentration were prescribed by combining data from Meinshausen et al (2011) through 2005and Dlugokencky and Tans (2014 from 2006 to 2013. See the ISI-MIP web page (www.isimip.org/) for the detailed simulation protocol and model description.
In the biome sector, eight models provided simulation output (table 1). These models differ in structure of the biogeochemical scheme and parameterization of vegetation dynamics, and thus in their responsiveness to environmental change. As revealed in the first phase of ISIMIP, the biome models produce substantially different projections of productivity, biomass, and soil carbon pool (Friend et al 2014, Nishina et al 2015. In terms of GPP calculation, the models differ in parameterizations of canopy radiation transfer, leaf phenology, and physiological limitation on photosynthetic capacity. Moreover, the model simulations differ in consideration of land-use change, with several models assuming fixed land use throughout the simulation period.

Benchmarking datasets
We used two global GPP data-driven products, which are independent of the ISIMIP simulations, for benchmarking of annual and monthly GPP. First, we adopted the satellite-derived GPP product based on Moderate Resolution Imaging Spectroradiometer (MODIS; Zhao et al 2005) fAPAR and a light-use efficiency model from 2000 to 2012. The original data, which were in GeoTIFF format with a 5 0 -mesh grid, were averaged and converted into 0.5°-mesh. Second, we adopted up-scaled flux measurement data (Beer et al 2010) from 1982 to 2011; the original spatial resolution of the data was 0.5°. The data were produced using site-based flux measurement data, climate data, SeaWiFS vegetation index, and a regression-model ensemble algorithm (Jung et al 2009). These data were used for benchmarking of GPP by gridbased linear regression. Note that the CARBONES GPP dataset (Kuppel et al 2014), although adopted by other studies (e.g. Anav et al 2015), was not used in this study, because the dataset is a hybrid of observational data and the ORCHIDEE simulation.
Simulated GPP values were also correlated with solar-induced fluorescence (SIF; mW m −2 nm −1 sr −1 ) data at the far-red peak (wavelength 737 nm) measured by the Global Ozone Monitoring Instrument 2 (GOME-2; Joiner et al 2013). SIF is expected to be closely correlated with biophysical and biochemical properties and processes of photosynthesis (e.g. quantum yield; Genty et al 1989). In particular, SIF from satellites tends to capture vegetation activity under clear sky conditions, in which photosynthesis and fluorescence respond in a similar manner (Porcar-Castell et al 2014). Therefore, observed SIF data have been used for model benchmarking as a proxy of GPP (Guanter et al 2014, Zhang et al 2016b. However, SIF is not completely linked with photosynthetic biophysical processes, and the data contain intrinsic noises and biases due to low signal levels. The GOME-2 SIF data are, even at a monthly basis, a bit noisy for grid-based comparison at the original 1°-mesh resolution, so they were used after aggregation into a 5°-mesh during the period from 2007 to 2010.

Analyses and metrics
We obtained data of 30 simulations (table 1) and structured the comparisons in terms of forcing data and biome models. Basic metrics such as mean, standard deviation, and coefficient of variation (standard deviation divided by the mean) were examined first. Metrics specific to biospheric studies such as long-term mean, seasonal change, interannual change, and linear trend of GPP were also examined. To separate the variabilities caused by forcing data and biome models, two-way analysis of variance (ANOVA) was conducted (R 3.4.0; R Core Team 2017) for the 30 global annual mean GPP estimates in 1981-2000. For grid-based benchmarking, the mean annual GPP in 2001-2010 for each grid was correlated with those estimated by using MODIS and flux up-scaling data. We also performed correlation analysis on the 5°-averaged GPP values and GOME-2 SIF data to assess their relationship.
Several metrics related to GPP properties were examined. Seasonal-cycle amplitude (SCA) of GPP, which reflects vegetation activity and affects atmospheric CO 2 (Graven et al 2013, Wenzel et al 2016, was defined as the difference between the maximum and minimum monthly values for each year. Resource-use efficiencies of GPP are expected to provide insights into underlying mechanisms and limiting factors of photosynthetic production. Based on the incoming solar radiation data and simulated GPP and evapotranspiration, we calculated radiationuse efficiency (RUE) and water-use efficiency (WUE) as follows: and WUE ¼ GPP=TR ð2Þ where DSR is downward short wave radiation at the land surface and TR is vegetation transpiration. In general, photosynthetically active radiation (PAR) accounts for about 48% of DSR (McCree 1972), and RUE as defined by equation (1) may be easily converted into a PAR-based value. For the DLEM, JULES, and VEGAS models, WUE was calculated using total evapotranspiration rates, because TR data files were not supplied for these models. Therefore, in this study, inter-model comparisons were made only for relative change from the 1996-2000 annual average. Note also that RUE could be defined using solar radiation absorbed by the canopy (e.g. Ruimy et al 1999), which is inferred from leaf area index and the canopy attenuation coefficient (typically 0.5) of Lambert-Beer's law. However, because a few models do not supply leaf area index data, we used equation (1), which is applicable to all model runs.

Results and discussion
3.1. Global terrestrial GPP by biome model Global terrestrial GPP simulated by the biome models using the four forcing datasets during the common overlapping period of 1981-2000 was 115.7 ± 11.0 Pg C yr −1 (mean ± standard deviation of all simulations), ranging from 98.4 to 141.2 Pg C yr −1 (table 2). Although this may seem to be a broad range, most values fell within the range of previously reported values (table 3) As shown by the ANOVA result (table 4), the variability in the estimated global GPP was primarily attributable to inter-model variability (84% of total); inter-data variability and residual (interaction term) accounted for 4.8% and 11%, respectively. Apparent inter-model differences were found, ranging from 100.1 ± 1.8 Pg C yr −1 for JULES to 133.1 ± 6.2 Pg C yr −1 for CARAIB. In contrast, model-ensemble GPP was not significantly different among the four forcing datasets, ranging from 112.5 ± 9.2 Pg C yr −1 for WATCH to 116.3 ± 9.9 Pg C yr −1 for WFDEI (table 2). In terms of differences among the forcing datasets, DLEM showed the smallest variability (2.9 Pg C yr −1 ) and VISIT the largest variability (20.0 Pg C yr −1 ) between the maximum and minimum values. Few common patterns were found in the dependence on forcing data among the biome models. Four biome models (DLEM, LPJ-GUESS, LPJmL, and ORCHIDEE) simulated the highest GPP when using the Princeton forcing data, and four models (CARAIB, JULES, ORCHIDEE, and VEGAS) simulated the lowest GPP when using the GSWP3 forcing data. For three datasets, CARAIB estimated the highest GPP among the models. Different, inconsistent patterns were found in other cases.

Trends in GPP
The simulated global terrestrial GPP showed similar interannual variability, irrespective of biome model and forcing data. The average slope of the linear trend was estimated as 0.30 ± 0.07% yr −1 or 0.35 ± 0.09 Pg C yr −2 , and differed among simulations: from the shallowest slope in VEGAS driven by GSWP3 data (0.19% yr −1 ) to the steepest slope in LPJ-GUESS driven by WATCH data (0.41% yr −1 ). These GPP slopes were steeper than those of the flux up-scaling (0.14% yr −1 , 1982-2000; 0.08% yr −1 , 1982-2011) and satellite observation (0.035% yr −1 , 2000-2012). Kolby Smith et al (2015) found similarly higher incremental trends of terrestrial net primary productivity estimated by the CMIP5 models: 7.6 ± 1.67% in 1982-2011 (i.e. 0.25 ± 0.06% yr −1 ). They implied that the present models overestimate the vegetation response to elevated CO 2 due to insufficiencies in model parameterization such as nitrogen limitation (Zaehle et al 2014, Schimel et al 2015. In contrast, a recent study (Campbell et al 2017) of long-term atmospheric observation of carbonyl sulfide showed that terrestrial GPP has increased at a higher rate than previously thought before: 31 ± 5% in the 20th century (assuming linearity, 0.31 ± 0.05% yr −1 ). While their results show higher incremental trends in the late 20th century, the mean incremental trend by ISIMIP2a models is comparable with the atmospheric observation-based estimate. Because inconsistency remains in observational evidence, model studies as conducted here are important to perform sensitivity analyses and to resolve underlying mechanisms.
The SCA of GPP also showed incremental trends ( figure 1(b)), on average 0.20 ± 0.11% yr −1 , implying that temperate and boreal vegetation contributed at least partly to the GPP increase and that vegetation phenology (e.g. growing season length) could have been affected in these regions. The estimated SCA incremental trends spans from −0.018% yr −1 in LPJmL driven by Princeton to 0.42% yr −1 in CARAIB driven by WFDEI (see table 2 for data-ensemble results). Such incremental trends and inter-model variability have also been found in previous studies Zeng 2014, Ito et al 2016) and are consistent with satellite-observed greening of northern vegetation (Zhu et al 2016). Among the GPP-related properties, the SCA trends showed the largest variability among the simulations. The average RUE Environ. Res. Lett. 12 (2017) 085001 of GPP was calculated as 0.156 ± 0.015 g C MJ −1 in the simulations, and its mean trend was calculated as 0.28 ± 0.06% yr −1 (figure 1(c)). None of the forcing data showed a clear decadal trend such as global dimming and brightening (Wild 2012), so the incremental trend in RUE was primarily caused by the GPP increase. Simulated global transpiration (data not shown) indicated moderate incremental trends: 0.088 ± 0.037% yr −1 . This transpiration increment was mainly related to vegetation activity (e.g. leaf area expansion), while land precipitation did not show a clear decadal trend. Consequently, estimated WUE showed clear incremental trends (figure 1(d)): 0.20 ± 0.076% yr −1 . Among the biome models, the WUE trend spanned from 0.06% yr −1 for VEGAS to 0.287% yr −1 for LPJ-GUESS. The incremental trend of WUE was largely attributable to the increase of GPP. Both experimental and theoretical studies suggest that water loss by transpiration does not increase quantitatively in parallel with photosynthetic carbon assimilation due to stomatal regulation of gas exchange (Medlyn et al 2001, Bonan et al 2014. Also, observational and modeling studies indicated that biospheric WUE has increased, mainly as a result of elevated CO 2 (Keenan et al 2013, Xue et al 2015, which enhances photosynthesis but restricts transpiration by stomatal closure.

Meteorological variability and GPP
To clarify the characteristics of environmental responsiveness, which should account for inter-model variability, the estimated GPP values were correlated with temperature, precipitation, solar radiation, and CO 2 conditions for 1981-2000. For temperature, the inter annual variability was comparable among the four forcing datasets ( figure 2(a)). The estimated GPP responded similarly to temperature variability irrespective of forcing data and biome models, with values ranging from 3.1 Pg C yr −1 K −1 (ORCHIDEE driven by WFDEI) to 6.1 Pg C yr −1 K −1 (DLEM driven by WATCH). For precipitation, a moderate difference was found among the forcing data, from 772 kg H 2 O m −2 yr −1 (Princeton) to 821 kg H 2 O m −2 yr −1 (WATCH; figure 2(b)). The biome models showed comparable positive responsiveness to precipitation, spanning from 0.064 Pg C yr −1 (kg H 2 O m −2 yr −1 ) −1 (JULES driven by Princeton) to 0.117 Pg C yr −1 (kg H 2 O m −2 yr −1 ) −1 (VISIT driven by GSWP3). For solar Environ. Res. Lett. 12 (2017) 085001 radiation, a conspicuous difference was found among the forcing data, spanning from 172 W m −2 (WATCH) to 190 W m −2 (Princeton). These forcing data also differed in decadal trends and interannual variability. For example, only the Princeton data showeda clear incremental trend (+0.14 W m −2 yr −1 ) with large interannual variability, whereas the others showed small negative trends. Remarkably, the biome models responded differently to solar radiation among the forcing data (figure 2(c)); the estimated GPP only responded positively to solar radiation in Princetondriven simulations. The high solar radiation in the Princeton dataset may be associated with the high GPP of five models (CARAIB, DLEM, LPJ-GUESS, LPJmL, and ORCHIDEE; table 2). Because solar radiation is important not only for GPP estimation but also for Earth's energy budget (Kiehl and Trenberth 1997), improvement of the solar radiation dataset is of great significance.

Benchmarking
The spatial distribution of GPP simulated by different biome models agreed for overall continental-scale patterns (supplementary figures S1 and S2, available at stacks.iop.org/ERL/12/085001/mmedia), such as gradients from high GPP in equatorial rain forests to low GPP at high latitudes and in deserts. However, intermodel differences were evident regionally at subcontinental scales. For example, JULES showed a sharp transition of GPP from tropical forests to surrounding rangelands, whereas LPJ-GUESS and VEGAS showed more gradual transitions. The spatial distribution of the coefficient of variation of simulated GPP values (figure S2(c)) showed higher variability in low-GPP areas such as arid regions (e.g. central Eurasia, North Africa, and Australia) and Arctic tundra. Among the models, LPJmL showed the highest coefficients of determination (R 2 ≈ 0.92) with the reference data (figure 3); others also showed R 2 values of 0.8 or higher. The benchmarking of annual GPP using independent datasets confirmed that the biome models could capture the major global patterns of photosynthetic productivity as obtained by upscaling and remote sensing approaches. Along with advancements in the flux measurement dataset and machine-learning algorithms, further useful fieldbased datasets (e.g. Jung et al 2017) will become available for model benchmarking.
The relationship between simulated GPP and satellite-observed SIF data (figure 4) seems reasonable, considering the noisiness of the data. In most biome models, simulated GPP was linearly related   Zhang et al (2016c) used SIF data to obtain an ensemble GPP by weighting estimates by multiple models. We confirmed that monthly GPP simulated by the biome models shows comparable seasonal change with that in SIF (figure S3). In tropical areas, however, the seasonal signal was very weak, and such correspondence was unclear ( figure S3(c)), implying that several models have difficulty with the environmental responsiveness of tropical ecosystems. In accordance with its effectiveness as a proxy of photosynthetic activity, more and higher-quality SIF data will soon be provided by satellites, including the Greenhouse gas Observation Satellite (GOSAT; Frankenberg et al 2011) of Japan and the Fluorescence Explorer (FLEX) of the European Space Agency. We expect a considerable increase in usage of SIF data by terrestrial vegetation studies in the coming years. The benchmarking of global terrestrial GPP revealed different characteristics of the participant models. For example, LPJmL and ORCHIDEE captured long-term mean GPP with higher agreement with observational data than the others. DLEM showed better agreement with SIF data and smaller inter-data variability. VEGAS showed slightly lower GPP and smaller incremental trends, whereas . Global relationship between sun-induced chlorophyll fluorescence (SIF) from GOME-2 and gross primary production (GPP) simulated by the eight biome models (a-h), during the period 2007-2010 using the GSWP3 forcing dataset. Black lines show linear regression (R 2 in parentheses). Both SIF and GPP data were aggregated into 5°-mesh to reduce the influence of observational noise; bars show the standard deviation within the 5°-grids. Red dots show tropical areas (25°N-25°S), and blue dots show temperate and borealareas.
Environ. Res. Lett. 12 (2017) 085001 CARAIB showed relatively higher GPP and larger trends. LPJ-GUESS captured intermediate mean annual GPP and larger incremental trends. JULES and VISIT showed lower or higher total GPP but captured intermediate incremental trends. Therefore, we should be aware that the current biome models have their own pros and cons, and we need caution in specifying which model works best for a given purpose (e.g. analysis of temporal variability, spatial mapping, and biogeochemical process study). Better understanding of such model characteristics is useful for interpret impact simulation results.

Impacts of extreme events
In the context of model benchmarking and global change study, extreme events are gathering increasing attentions. Here we focused on two extreme events: the huge eruption of Mt. Pinatubo in 1991 (McCormick et al 1995) and the strong ENSO in 1997-1998. Simulated global GPP markedly decreased after the Mt. Pinatubo eruption irrespective of biome model and forcing dataset ( figure 5(a)). Model-mean GPP anomalies in 1992 differed among the forcing datasets, ranging from −1.51 Pg C yr −1 for WFDEI cases to −2.25 Pg C yr −1 for GSWP3 cases. In contrast, observational evidence suggests that the terrestrial biosphere could absorb more carbon, which would account for the decline in the atmospheric CO 2 growth rate observed after the eruption (Keeling et al 1995).
The time-series of solar radiation in the forcing data showed few volcanic signals after the eruption event (data not shown): only the WATCH dataset shows a clear decline in solar radiation by (about 2 Wm −2 ) in 1992. Moreover, the forcing datasets did not contain a variable for the diffuse component of solar radiation. Several studies revealed the importance of diffuse radiation for the increased canopy radiation absorp-tion and carbon assimilation (Roderick et al 2001, Gu et al 2002, Kanniah et al 2013, which would explain the low CO 2 growth rate after the huge volcanic eruption. By appropriately including solar radiation factors, several model studies have successfully simulated the change in terrestrial carbon budget after the volcanic eruption (Lucht et al 2002, Frölicher et al 2013. In the present study and other multimodel studies (e.g. Le Quéré et al 2015), however, most biome models were unsuccessful in simulating the anomalous GPP increase after the eruption. Many studies have assessed the impacts of ENSO events on the terrestrial carbon budget, which affects the atmospheric CO 2 growth rate (e.g. Zeng et al 2005, Betts et al 2016. The ENSO event in 1997-1998 was the strongest one in the simulation period, and was accompanied by a high atmospheric CO 2 growth rate. In 1998, the warmest year in the 20th century, most models estimated negative GPP anomalies in the tropics ( figure 5(b)). In the cases of LPJml driven by WFDEI and JULES driven by Princeton, tropical GPP dropped down by as much as 4 Pg C yr −1 as compared with other years. In contrast, several models responded only moderately to the ENSO event. The warmth in the ENSO period could lead to a longer growing period for temperate and boreal vegetation, leading to a positive GPP anomaly offsetting the negative tropical one. Also, the high atmospheric CO 2 growth rate in the ENSO period could be attributable partly to increased respiratory CO 2 emission. As implied by figure 2, the interplay of multiple meteorological factors (i.e. radiation, temperature, and precipitation) makes it difficult to isolate the GPP response to specific events, even an extreme one. As a result, the present models have been insufficiently constrained and calibrated in terms of responsiveness to the eruption event and similar extreme environmental variability.  -4 -6 1990 1991 1992 1993 1994 1990 1991 1992 1993 1994 1990 1991 1992 1993 1994 1990 1991 1992 1993 1994 Year Year  1996 1997 1998 1999 2000 1996 1997 1998 1999 2000 1996 1997 1998 1999 2000 1996 1997 1998 1999 2000 ENSO (h) WFDEI Environ. Res. Lett. 12 (2017) 085001 3.6. Mechanistic findings from benchmarking The incremental GPP trends (figure 2(a)) could be attributable to several mechanisms: (1) enhancement of photosynthetic capacity, (2) elongation of growing period, (3) expansion of canopy leaf area, and (4) land-use and land-cover change. As attempted by Xia et al (2015), factors related to leaf area (phenology and vegetation structure) and photosynthetic capacity (physiology) could be combined using an appropriate metric. The simulated GPP trend occurred heterogeneously over the land area. At a global scale, it is well correlated with the trend of mean leaf area index (LAI) trend in the different biome models and forcing datasets (figure 6). Note that several models assumed a stationary land-use and plant functional type distribution but showed a similarly positive relationship. Therefore, the incremental GPP trend was at least partly attributable to LAI expansion, which in turn, has several causes. Apparently, the CO 2 fertilization effect played an important role, but observational evidences provide inconsistent implications for the model sensitivity as noted previously (Kolby Smith et al 2015, Campbell et al 2017. We need to seek additional data and metrics to reduce the uncertainty; for example, Wenzel et al (2016) implied that SCA is a useful metric to constrain the simulated GPP trend. Spatial variability in the simulated GPP trends (figure S4) has additional mechanistic implications. For each forcing dataset, positive trends occurred from tropical to boreal regions, especially in North America, Eurasia, and eastern Australia. In contrast, consistently negative trends occurred in Central America, a part of South America, and eastern Central Africa. The simulated regional GPP showed different temporal patterns in response to meteorological regimes ( figure  S5). For example, it is remarkable that the incremental trend continued in Asia, Africa, Europe, and North America after 2000 but almost completely disappeared in Oceania and South America. This period of the 2000s is known as the hiatus of global warming, leading to an unexpected depression of the atmospheric CO 2 growth rate probably due to suppressed ecosystem respiration (Keenan et al 2016, Ballantyne et al 2017. A regional analysis of GPP may provide a supporting evidence for the recent perturbations in the global carbon cycle, although further in-depth analyses are required.
Another implication for biome models and meteorological data is related to the solar radiation used in global simulation analyses. Although solar radiation is the ultimate driver of photosynthesis, this benchmarking study suggests that it was underrepresented in forcing data and that biome models are insufficiently constrained to capture responses to solar radiation variability (figure 2(c)). Solar radiation is acknowledged to be highly heterogeneous over the land surface and therefore difficult to accurately quantify at broad scales. Also, atmosphere-ecosystem interaction studies have focused more heavily on other environmental factors such as temperature, precipitation, ambient CO 2 , and nutrients. However, as shown by the poor simulation results for the GPP anomaly after the Mt. Pinatubo eruption, we need more accurate solar radiation data and further refinement and constraints on modeled radiation use by terrestrial vegetation. This task is necessary not only for refining historical simulations but also for improving the reliability of future projections, including the increase of industrial aerosol emissions and deployment of solar radiation management (Mercado et al 2009, Caldeira et al 2013, Ito et al 2017.

Concluding remarks
We conducted a benchmarking of eight biome models and confirmed that these current biome models can approximately capture basic features such as the spatial distribution of productivity (i.e. capability of CO 2 fixation). Such benchmarking of model-estimated GPP has been conducted by previous model inter comparison studies (table 3), and this ISI-MIP study provided several new findings. By using multiple meteorological forcing datasets, it revealed contributions of inter-model (primary) and inter-data (secondary) variabilities. Focusing on recent decades, it was possible to assess model responses to several wellknown extreme events. Importantly, the results of the biome sector could be combined with other sectors such as water and agriculture, allowing us to conduct more human-relevant analyses. Recent advancements in instruments and data-analysis studies are providing an increasing amount of data for benchmarking, allowing us to examine the validity and limitations of models in more detail. Although several benchmarking studies of biome models have been conducted, this study provided some new insights. First, this study provided results of WUE and RUE and the seasonalcycle amplitude of GPP, allowing us to address the underlying mechanisms of the historical changes. Second, our use of multiple forcing datasets permitted us to account for a wider range of estimation uncertainty than in previous studies. The comparison emphasized the large discrepancy in solar radiation among the forcing datasets. Third, the use of flux upscaled GPP and satellite-derived SIF data provided more insights than a simple comparison between models. These findings imply that we need to improve parameterization of GPP, for example, in terms of radiation responsiveness. Although further observation and modeling studies are required, this type of benchmarking study helps to refine biome models and improve our confidence in future projections (e.g. +1.5 and +2°C impact assessments for the Paris Agreement, conducted as the next phase of ISIMIP, i.e. ISIMIP2b) using these models.