Divergence in land surface modeling: linking spread to structure

Divergence in land carbon cycle simulation is persistent and widespread. Regardless of model intercomparison project, results from individual models diverge significantly from each other and, in consequence, from reference datasets. Here we link model spread to structure using a 15-member ensemble of land surface models from the Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP) as a test case. Our analysis uses functional benchmarks and model structure as predicted by model skill in a machine learning framework to isolate discrete aspects of model structure associated with divergence. We also quantify how initial conditions prejudice present-day model outcomes after centennial-scale transient simulations. Overall, the functional benchmark and machine learning exercises emphasize the importance of ecosystem structure in correctly simulating carbon and water cycling, highlight uncertainties in the structure of carbon pools, and advise against hard parametric limits on ecosystem function. We also find that initial conditions explain 90% of variation in global satellite-era values—initial conditions largely predetermine transient endpoints, historical environmental change notwithstanding. As MsTMIP prescribes forcing data and spin-up protocol, the range in initial conditions and high levels of predetermination are also structural. Our results suggest that methodological tools linking divergence to discrete aspects of model structure would complement current community best practices in model development.


Introduction
We define divergence as the spread in output from multiple models or, equivalently, the spread in the difference between model outputs and an observational constraint. Results from offline land surface simulations and fully- Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence.
Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. generations has not acted to reduce divergence or increase skill Sedláček, 2013, Wuebbles et al 2014). As models are, by definition, approximations of a set of physical and biogeochemical processes, inter-model spread must reflect choices made in this approximation. Such choices include which processes are represented (e.g., presence versus absence of carbon-nitrogen coupling; Huntzinger et al 2014), how they are coded mathematically (e.g., light use efficiency versus enzyme kinetics for photosynthesis; Wang et al 2011), and parameterizations used (Mendoza et al 2015). Divergence however also depends on spatiotemporal resolution (Schwalm et al 2010, forcing data such as precipitation (Samaniego et al 2017) and boundary conditions such as land cover history (Jain and Yang 2005). This complicates isolating useful approximations and therefore more correct model representations .
Reducing divergence across models and identifying appropriate representations are however highly desirable in Earth system modeling-both to ensure potential and realized predictability are commensurate (Luo et al 2015) and to improve the quality of predictions and projections under anticipated global environmental change. As model improvement assumes better agreement with observed values, resolving model divergence requires validation. That is, a model formulation is useful if it matches a reference set of observations within some tolerance . Here we link model spread to model structure by moving beyond point-based benchmarking, e.g., calculating the distance between simulated and observed values. Instead, we apply three analytical approaches to link model-data mismatch to its source. First, we use functional benchmarks to help localize model subroutines that contribute to mismatch. Second, we use information on model structure to predict model skill in a machine learning framework. Third, we quantify how initial conditions prejudice model outcomes after centennial-scale transient simulations.
We demonstrate these three approaches using the Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP; Huntzinger et al 2013Huntzinger et al , 2017, a 15-member model ensemble of standardized simulations and their outputs, as our analysis test bed. MsTMIP is focused on carbon and water cycling in land surface models-the land component of ESMs-and is based on a constrained model protocol that prescribes spatiotemporal resolution, forcing data, boundary conditions, and spin-up procedures. In additional, divergence seen in previous MIPs (model-intercomparison projects) is present in MsTMIP. For example, global mean satellite-era gross primary productivity (GPP) varies 2-fold from 91 to 185 PgC per annum across the ensemble relative to a benchmark value based on upscaled FLUXNET data of 117 PgC per annum. Using individual eddy covariance towers and corresponding model grid cells reveals a similar 2-fold range (2.2 to 4.4 gC/m 2 /d) in GPP, relative to the benchmark value of 3.3 gC/m 2 /d. Metrics of ecosystem structure vary wider still-global mean satellite-era leaf area index (LAI) ranges from 1.4 to 4.1 m 2 m −2 relative to a AVHRR benchmark value of 1.5 m 2 m −2 . Overall, the MsMTIP ensemble provides the correct 'model space' to link skill to structure as it excludes confounding factors while preserving inter-model spread.

Benchmarking
Even though our emphasis herein is on addressing why models diverge we still need to quantity model-data mismatch to calculate divergence, model skill, and functional benchmarks. Here we use ILAMB (International Land Model Benchmarking) (https://ilamb.ornl.gov/doc/; Collier et al 2018), a generic benchmarking framework based on a series of python widgets that allows for the standardized comparison of simulation output and reference datasets. ILAMB can also calculate functional benchmarks relating one variable to another such as GPP as a function of LAI. For this study we use the Permafrost Benchmark System (PBS) version of ILAMB (https://permamodel.github.io/pbs). This version is hosted on the Community Surface Dynamics Modeling System (CSDMS; https://csdms.colorado.edu) and removes the need for individual modelers to install ILAMB locally. Instead, through an ingest tool, simulation output is uploaded to a server host and ILAMB is executed server-side on a high-performance computing cluster controlled through a simple web interface.

Reference datasets
ILAMB contains a default set of monthly, point-based and gridded reference datasets. In this study, composite skill-used as the target variable in the skill-to-structure mapping exercise-is calculated globally for (1) four fluxes (evapotranspiration [ET] or latent heat, GPP, TER [total ecosystem respiration], and NEP [net ecosystem productivity]) using upscaled and tower-based (model grid cells containing the flux tower are used in the intercomparison) FLUXNET data, (2) LAI from AVHRR and MODIS satellite data, and (3) four mapped biomass reference products (Collier et al 2018). Composite skill for each variable ranges from zero (no agreement) to unity (perfect agreement with all reference datasets) and is based on a weighted combination across all skill metrics: correlation, bias, root mean square error, and phase shift (Collier et al 2018). For this study the default configuration of weights and score metrics is used.  (table 2). As an example, SiB3 lacks carbon pools and only 6 models (CLASS-CTEM-N, CLM4, CLM4VIC, ISAM, LPJ-wsl, and SiB3) simulate snow depth.

Skill-to-structure mapping
Linking skill to a discrete aspect of model structure is a data-driven exercise. Initially, model structure  is encoded as a set of indicator variables. These variables span all aspects of model structure and are grouped into four broad themes: carbon cycling, energy exchange, nitrogen cycling, and vegetation dynamics (supplementary tables 1-4). As an example, we use a vegetation dynamics indicator variable based on the presence or absence of a maximum value of LAI beyond which there is no allocation of biomass to leaves. As a second step, ILAMB is used to determine composite skill by variable (Collier et al 2018). For this exercise we use GPP, ET and LAI as in the functional benchmarking exercise as well as TER, NEP and total live biomass. Lastly, we predict skill using only structure. Here, the Random Forests algorithm is used to generate 10 000 individual decision trees for each MsTMIP output variable (n=6) separately to simultaneously predict composite skill for all models using the same set of presence/absence indicator variables from the full MsTMIP model ensemble. The variance explained ranges from 69% to 96%: NEP, 69%; GPP, 78%; LAI, 79%; TER, 81%; ET, 94% and biomass, 96%, i.e., the structural determinants of skill are well-captured. We also calculate the gain in skill for each structural indicator variable in the topmost position-the initial splitting variable-across all 10 000 decision trees for each MsTMIP variable separately. Gain quantifies skill improvement based on structural choice. For this we first navigate each decision tree from top to bottom to maximize composite skill at each decision point through to the terminal node. Second, we calculate the difference in skill-the initial composite skill based on using the topmost structural indicator variable only is subtracted from the composite skill from the terminal node-across all 10 000 decision trees. Gain is then the mean skill difference across all decision trees by MsTMIP variable, is always positive, and is expressed in units of standard deviation of skill across the MsTMIP ensemble; analogous to a z-score transformation. Before calculating gain, variable importance is used to select relevant model structural attributes. Variable importance quantifies how much predictive power each predictor variable has, i.e., serves as an indicator for the overall impact of a predictor on composite skill. The permutation variable importance measure used herein quantifies the loss in skill, the algorithm's ability to predict composite score based on structural indicator variables, by randomly permuting the values of a single predictor variable and comparing that to the unpermuted version. We use this metric to filter structural indicator variables not useful in understanding how structure impacts composite skill, i.e., those with negative values are excluded from further analysis (Janitza et al 2018). Using variable importance scores as a filter and then calculating gain for topmost splitting variables allows us to identify which structural attributes are most relevant to divergence and to quantify this dependence.

Initial conditions
To further explore divergence we examine initial conditions (X intital ) relative to satellite-era conditions (X transient ), where X refers to a given variable and the subscripts denote initial conditions and those after 110 years of transient forcings, respectively. X intital is taken from the first 30 years (nominally 1901-1930)

Functional benchmarking
For functional benchmarking we highlight GPP, LAI, and latent heat. GPP represents the dominant input of carbon into the terrestrial carbon cycle. If a model does not correctly simulate GPP, it cannot correctly simulate biomass and respiration. All models in this study simulate GPP for a single leaf and scale this to the entire canopy using LAI, defined as the ratio of leaf area to ground area. For GPP as a function of LAI (figure 1), reference data supports a 5 m 2 /m 2 upper limit for LAI while modeled values reach 15 m 2 m −2 . Both models and observations show a near linear response. All models however underestimate GPP as a function of LAI, i.e., the benchmark functional response serves as an upper limit for the models. This is supported by the ratio of GPP to LAI (ratio of totals) of 1.8 gC/m 2 /d for observations relative to the lower modeled values of 0.6 to 1.7 gC/m 2 /d (−59 to −2% difference with a mean difference of −31%). This could result from an underestimation in either simulated stomatal conductance or leaf-to-canopy scaling. Both GPP and latent heat flux depend very strongly on simulated stomatal conductance. Looking further, we see that the models-similar to GPP versus LAI-also underestimate latent heat as a function of LAI (differences range from −66 to 0% with a mean of −27%). In contrast, the models all roughly get the right ratio of GPP to latent heat flux; the percent difference ranges from −74 to +26% with 7 of 9 models between −26 and +20% and a mean difference of +13%. This suggests that the models are better able to simulate stomatal conductance, but underestimate leaf-to-canopy scaling and overestimate LAI.

Model structure to predict model skill
Using the Random Forests machine learning algorithm, we find several model structural characteristics that serve as important controls of skill for multiple MsTMIP variables. For instance, the absence of a carbohydrate reserve pool is associated with higher skill for GPP, LAI, and TER (figure 2) while yielding an average skill gain of 0.75σ, where σ is the standard deviation of skill across the full ensemble. In contrast, while the presence of a below ground litter pool is associated with above average skill for LAI, the absence of this same structural characteristic is linked to skill gains for NEP. A similar pattern is also present for the below ground sapwood carbon pool. More broadly, the analysis is does not reveal an optimal structure of carbon pools bur rather suggests a trade-off between carbon pool structure-or allocation heuristics-and skill by variable within a given model. A second general tendency is that the absence of thresholds is associated with higher skill. As an example, ET skill increases by 0.88σ when not considering nitrogen limitation on GPP. Similarly, higher skill in GPP and TER is linked to not having autotrophic respiration limited by nitrogen availability. Finally, the absence of a non-zero threshold light level for GPP (figure 2)-GPP may occur at the lowest levels of insolation as opposed to a parameterized threshold-is associated with higher skill for both GPP and TER (0.65σ). The otherwise largest gains in skill are achieved by (1) not using whole canopy stomatal conductance parameters to better simulate LAI (0.75σ), (2) not having LAI dynamically calculated for higher skill in ET (0.97σ), and (3) not prescribing leaf albedo for TER and NEP (0.89σ and 0.94σ respectively). These recommendations all highlight the skillrelevance of correctly simulating LAI and leaf-to-canopy scaling as revealed by functional relationships.
Overall, the skill-to-structure mapping exercise emphasizes the importance of ecosystem structure in correctly simulating carbon and water cycling, highlights uncertainties in the structure of carbon pools, and advises against hard parametric limits on ecosystem function. This latter point does not suggest that hard limits, such as cell denaturation at critical temperatures, are wrong, merely that constraining the shape of an idealized response function limits skill and/or that the parameters controlling these hinges are misspecified or unnecessarily invariant across enviroclimatic space (Mendoza et al 2014).

Initial conditions as conditional endpoints
Beyond functional benchmarking and machine learning we find that divergence is embedded a priori in all MsTMIP variables due to initial conditions. As an example, soil carbon is known to be particularly 'sticky' and predetermined by initial pool size, i.e., levels of soil carbon before transient forcings persist with minimal change (Exbrayat et al 2014;Todd-Brown et al 2013). Across MsTMIP (figure 3) initial conditions explain, in a leastsquares sense, 90% (median value; table 2) of variation in transient endpoints. Alternatively, 110-years of MsTMIP transient forcings (i.e., dynamic climate, land cover, CO 2 fertilization and nitrogen subsidy; see Methods) explain but one-tenth of variance in the Earth system globally whereas nine-tenths is based on initial conditions. This prejudice is not solely limited to global mean values but is also broadly uniformly present across the land surface at pixel scale as well. Spatially, initial conditions predetermine transient endpoints to the largest degree in the Northern Hemisphere (80% variance explained on average by pixel;  variability (biomass) and 0.62 for extreme behavior (ET). Maximum values are 0.86 for variability (GPP) and 0.95 for extreme behavior (soil carbon). This occurs over manifold variations in initial conditions, e.g., initial condition ET varies more than 4-fold from 0.68 to 2.99 mm/d. Snow depth and net ecosystem productivity Figure 3. Influence of initial conditions on transient simulation endpoints. Each panel shows, for a given variable (row-wise labels), satellite-era transient simulation (y-axis) versus initial condition (x-axis) global aggregate across all models. Individual models (circles; outlying models are labeled), 1:1 reference (line), and correlation (inset values) shown for each variable. Correlation excluding labeled outliers is 0.99 for ET and 0.56 for NEP. Variance explained, in a least-squares sense, is the square of correlation and ranges from 0.1% for ET with the SiB3 outlier to 96% for soil carbon. (NEP) show 12-fold variation ( figure 3). Across all variables, the departure from the 1:1 line for initial and satellite-era values is small regardless of the initial value which effectively serves to anchors transient run endpoints.

Discussion
Carbon cycle modeling continues to evolve, as does model evaluation and benchmarking (Collier et al 2018, Haughton et al 2018. Missing from these developments, however, is a set of tools that allow us to address why models diverge by mapping model skill to model structure. In other words, as a community, carbon cycling modeling needs a mechanism to attribute poor skill to identifiable aspects of model structure. Our study suggests a three-pronged approach to achieve this goal: (1) Emphasize functional benchmarking over comparing individual variables to reference values. Simulated point estimates are invariably different from any observational point estimate. Incorporating uncertainty in the comparison is useful as the best any model can do is to match observations within uncertainty (Schaefer et al 2012, Schwalm et al 2010. This also points to the limit of benchmarking because we cannot improve models until we improve observations. However, using confidence bounds does not provide a pathway to assess structure per se. As an example, if GPP at some level of spatiotemporal aggregation is 20% larger than observed (and this 20% is outside the uncertainty envelope of the observation) there is no insight into which model choice produced the mismatch. In the past, we've focused on adding new processes to improve skill, made practical by advances in computational frameworks and improvements in our knowledge base (Maslin and Austin, 2012, Stockmann et al 2013, Bailey et al 2018. However, the added complexity of including more physical and biological processes does not equate to reduced divergence or increased skill Sedláček, 2013, Wuebbles et al 2014). Functional benchmarking allows one to identify a specific process independent of model complexity. For MsTMIP, looking at GPP, LAI, and latent heat in isolation does not highlight leaf-to-canopy scaling.
(2) Encode metadata on model structure to allow data analytics. Apart from MsTMIP no large-scale MIP includes, as published metadata with a controlled vocabulary, a systemic survey or database of model structural characteristics. The MsTMIP case nonetheless offers scope for improvement as the structural metadata does not capture all information about a given model and, subsequently, the full ensemble undersamples the range in process representations (cf Annan et al 2011). A 15-model ensemble, as in MsTMIP, allows for 2 15 or 32,768 unique process representations using presence/absence coding. In this study, only 135 structural variables are inventoried which are then truncated to 69 (only 0.2% of the theoretical potential) after semantic duplicates are removed. Any extension in structure encoding need not be limited to presence/absence but should also include increases in ensemble size to better sample model structural space overall, ordered categorical variables (e.g., to traverse a complexity gradient of radiation transfer schemes) as well as a range in parameter values (e.g., to link within-parameter uncertainty to divergence, cf Zaehle and Friend 2010). Here it's important to note that MsTMIP structural attributes highlighted through functional benchmarking have key parameters, with emphasis on V cmax (unstressed Rubisco catalytic capacity) or J max (the maximum electron transport rate) for limits on GPP, that may potentially compensate leaf-to-canopy scaling (Schaefer et al 2012). More broadly and regardless of MIP size, model structural characteristics must be encoded in a form amenable to machine learning and traceability (Zhou et al 2018). This allows process representations to be linked to gradients of skill and thus provides a mechanism to isolate 'winners' from 'losers'. Process representations that are repeatedly associated with below-average skill levels across multiple MIPs merit consideration for possible deletion from the catalogue of model structural choice. Our results highlight the need for careful curation of metadata on model structure and ensemble broadness to maximize the discriminatory power of our datadriven approach. Model structure encoding combined with data analytics offers a heretofore underutilized approach to discriminate among thousands of model choice decisions and thus improve model reliability .
(3) Acknowledge and resolve how initial conditions prejudice transient endpoints. Initial conditions explain (median value;  (2010). Nitrogen subsidy increases by a factor of five; land cover and land use changes result in a 17% net loss in forest cover, a five-fold increase in grasslands, and a doubling of cropland extent (Wei et al 2014). The changes in forcing data over the simulation period do not however translate into large departures from initial conditions across MsTMIP simulation outputs. Superimposed on this is a large range in initial conditions themselves (figure 3), which are in turn solely attributable to structural differences due to the MsTMIP protocol that constrains forcing data, boundary conditions, and steady-state spin-up protocols.
The assumption of steady state results from a lack of knowledge on ecosystem state, especially for carbon pools (Carvalhais et al 2010). The most common approach is to assume human impacts in the preindustrial era (typically before 1750) were nonexistent and thus a steady-state equilibrium condition was the norm. Thus, models are fed thousands of trend-free randomized blocks of forcing data until the net change in, at least, carbon pools is zero over an arbitrary time period within some tolerance. When steady state is achieved the corresponding values for all simulated quantities form the catalogue of initial conditions. While mathematically tractable-and historically computational expensive (Xia et al 2012)-the basis for this assumption is inaccurate (Ruddiman 2003, 2007, Kaplan et al 2011, Lewis and Maslin, 2015, Ruddiman et al 2015. In the preindustrial Holocene alone humans caused a 9 ppm CO 2 increase (Ruddiman 2007). This is 10% of the change in CO 2 seen in MsTMIP from 1901 to 2010 but is at odds with equilibrium conditions prior to industrialization or the existence of a pre-anthropogenic baseline in the mid to late Holocene (Ruddiman, 2007) that underpins the steady-state modeling assumption. Furthermore, there is evidence that skill improves in the absence of steady state (Carvalhais et al 2008. Recent developments in semi-analytical solutions for steady state (Huang et al 2018Xia et al 2012, Luo et al 2017 treat transfers between carbon and nitrogen pools in a single unified matrix solution and greatly reduce spin-up time. This suggests a land carbon MIP centered on steady state, something missing from the landscape of existent and planned MIPs, is both needed and achievable. Such a MIP must include simulation experiments that include varying degrees of steady state relaxation (e.g., Carvalhais et al 2008Carvalhais et al , 2010, random initial or seed values (e.g., Hashimoto et al 2011) of carbon pools, and a gradient for the number of grid cells or sites that are required to achieve some threshold (usually no change over an arbitrary time period). Only such an effort can determine how steady state itself leads to divergence.
While these tools provide a means to link divergence to discrete aspects of model code they are not perfect. Using model skill conditions on the truthfulness of the reference datasets. Similarly, the linkage between skill and structure is data-driven and not aware of mechanistic interdependencies in a given model. Lastly, the predetermination of endpoints through initial states requires follow-on research, our analysis is necessarily descriptive. Nonetheless, the persistence of divergence in carbon cycle simulation over several generations of models suggests an urgent need for additional methodological approaches, as described herein, to inform model development.