Integrated assessment model diagnostics: key indicators and model evolution

Integrated assessment models (IAMs) form a prime tool in informing about climate mitigation strategies. Diagnostic indicators that allow comparison across these models can help describe and explain differences in model projections. This increases transparency and comparability. Earlier, the IAM community has developed an approach to diagnose models (Kriegler (2015 Technol. Forecast. Soc. Change 90 45–61)). Here we build on this, by proposing a selected set of well-defined indicators as a community standard, to systematically and routinely assess IAM behaviour, similar to metrics used for other modeling communities such as climate models. These indicators are the relative abatement index, emission reduction type index, inertia timescale, fossil fuel reduction, transformation index and cost per abatement value. We apply the approach to 17 IAMs, assessing both older as well as their latest versions, as applied in the IPCC 6th Assessment Report. The study shows that the approach can be easily applied and used to indentify key differences between models and model versions. Moreover, we demonstrate that this comparison helps to link model behavior to model characteristics and assumptions. We show that together, the set of six indicators can provide useful indication of the main traits of the model and can roughly indicate the general model behavior. The results also show that there is often a considerable spread across the models. Interestingly, the diagnostic values often change for different model versions, but there does not seem to be a distinct trend.


Introduction
Integrated assessment models (IAMs) are widely used for climate policy and climate change analysis (van Beek et al 2020). They offer the means to assess the linkages between long-term climate policy goals and near-term policy choices. They can also look into mitigation strategies taking into account cross-sectoral and, cross-regional and systems interactions (energy, land, economy, climate). As such, they form a key information source feeding into the climate change mitigation policy process, e.g. via IPCC Assessment Reports (ARs) (Halsnaes et al 2000, IPCC 2014. Within IAMs, a distinction can be made between cost-benefit IAMs (mostly highly stylized) and detailed process IAMs that are mostly used to explore different pathways to reach selected policy goals. The latter comprise a diverse group of models with different functional structures.
A thorough understanding of how IAM structure and assumptions affect IAM behavior is critically important for assessing IAM based policy analysis and advice. For both policy makers and researchers, it can provide insights into why results differ between models and link projections to policy-relevant model assumptions and structure. It is the goal of diagnostic tools to foster such understanding. In fact, such tools can serve key functions: (a) characterizing model behavior by use of stylized diagnostic experiments, and (b) relating model behavior patterns to model structure and input assumptions. We focus mostly on the first in this study, but aim to cover the second, where possible. A subsequent function, but beyond the limits of this study is to qualify the model behavior and assess models' policy applicability.
In other modeling disciplines, similar diagnostic tools have been developed. For instance, in climate research, diagnostic metrics have been applied to compare climate models and to evaluate their performance (Andrews et al 2012, Flato et al 2013, Eyring et al 2016. Such indicators, for instance, include climate sensitivity (indicating the temperature increase for a doubling of the CO 2 concentration) and the transient climate response (indicating warming over a more limited time period). These tools are not only used to regularly compare models and thus qualify their behavior, but even in validation experiments, leading to assessment of the quality of models for specific experiments and their evaluation over time.
Also the IAM community has undertaken several model diagnostic activities in the past (Gaskins and Weyant 1993, Weyant 2004, 2010, van Vuuren et al 2009, Wilkerson et al 2015 resulting in the most recent and comprehensive diagnostic assessment by Kriegler et al (2015). Here, we propose an updated and expanded set of widely applicable, key diagnostic indicators to be used as a community standard. We determined these by revisiting the approach by Kriegler et al (2015) and improving them in terms of precision, simplicity and completeness. In particular, we propose a novel, standardized approach to compare different model versions to assess and monitor model differences over time. The approach is analogous to the climate model diagnostics in the sense that they are based on stylized scenarios with exogenous assumptions. It has been tested on 17 IAMs and 32 model versions, as part of two EU model development projects, ADVANCE (www.fp7-advance.eu/) and NAVIGATE (https://navigate-h2020.eu/), thus providing coverage of all main process-based IAMs (and much higher than in preceding studies), including all latest model versions. Especially the latter is highly needed in light of the forthcoming AR6.
A standard set of diagnostics for the community has obvious advantages. It provides a tool to systematically and consistently assess model behavior in all future studies. Model diagnostic results can be part of model documentation that can be referenced and highlighted in papers. Future model-intercomparison projects could require participating models to regularly run the core set of diagnostics, to analyze model behavior of newly developed models or model versions. Ultimately, this will lead to greater transparency and comprehensibility of IAM applications, together with model documentation. It will also allow tracking the development of IAMs over time-and possibly, in the future, confronting the outcomes with empirical information or information from other science disciplines.
An important innovation of the present study is the introduction of two diagnostic indicators in addition to the ones established by Kriegler et al (2015), namely inertia timescale (IT) and fossil fuel reduction (FFR). IT provides a measure of the models' level of inertia in response to the introduction of climate policy, a crucial determining factor in deep mitigation projections. FFR highlights the models tendency to reduce fossil fuels as part of climate policy, a key element in model studies that examine the energy transition.
Here, we present the results for six key indicators, adding IT and FFR to the original set of indicators from Kriegler et al (2015); relative abatement index (RAI), carbon intensity over energy intensity (CoEI), transformation index (TI) and cost per abatement value (CAV). The indicators have been simplified to make them more suitable to be used as a community standard, namely with a focus on one strong mitigation case and one benchmark year, 30 years in the future (here 2050, but later in post-2020 assessments). The latter allows for comparability with future diagnostic assessments. To ensure precision in the diagnostic results, we define single, unique values to indicate model behavior.
In method section 2, we explain the study design and list the participating models. The results are splitup in subsections for each of the indicators and conclude with an overview table to classify all the participating models. In the section 4, we reflect on the research questions: Can these indicators be easily used as diagnostic tools for IAMs, including their development over time? And what insights do these tools provide?

Diagnostic experiments and indicators
The experiments described in this study form a small selection from a larger set of stylized, diagnostic scenarios that have originally been developed as part of the EU FP7 ADVANCE project (www.fp7-advance. eu/). These are: Base (a zero carbon tax, i.e. a noclimate policy baseline) and C80-gr5 (a run with an exponential carbon equivalent price growth of 5% per year starting in 2020 and a price level of 80 (2010)$/tCO 2 eq. reached in 2040). C80-gr5 is used for each key indicator presented here. For two indicators (RAI and IT) extra scenarios were used, as will be explained in the next section. Note that the C80-gr5 scenario represents a 1.5-2 degree case in most models (see supplement S7 (available online at stacks.iop.org/ERL/16/054046/mmedia)), in line with the Paris agreement's climate ambitions. This makes it a highly relevant showcase for assessing model behavior in frequent deep mitigation scenarios. Preferably, model groups used SSP2, the middle-of-the-road socioeconomic projection baseline scenario (Riahi et al 2017) for all assumptions, including population and economic growth.
The indicators are originally chosen and adapted here based on criteria set by Kriegler et al (2015): • Identification of heterogeneity in model responses • Diagnosis of relevant features for climate policy analysis • Applicability to diverse models • Accessibility and ease of use Here, we add the following criteria: • Standardization and comparability between diagnostic studies • Precision/quantifiability Based on these criteria, we derive a set of six indictors that describe model responses to climate policy. These indicators go beyond the work of Kriegler et al, because we provide a standardized formulation-in each case leading to a single value that characterizes the model. We specify set rules (benchmark year, scenario used, socio-economic assumptions) to allow for comparability between studies in a quantitative way. The main focus is on the year 2050 as it is (a) policy relevant and (b) provides a reasonable indication of model behavior throughout the century. For future use of the indicators, we define all indicators based on C80-gr5, using the value 30 years after the introduction of the tax (here 2020). While the focus is on 2050, we also show the 2100 results in the supplement (S3) to assess if the 2100 numbers would lead to different conclusions. Table 1 gives an overview of the key diagnostic indicators proposed and assessed in this study. Below, we shortly summarize the setup and rationale behind the indicators and particularly indicate differences with and additions to the Kriegler et al (2015) approach. The combination of the indicators, focuses on (a) the responsiveness of the model, (b) the type of mitigation, (c) the scale of the transformation of the energy system, and (d) mitigation costs as a function of the carbon price signal.
As in earlier diagnostic exercises, the indicators are based on global totals to assess the overall behavior related to global climate policy. A regional assessment would be possible in a follow-up study. All emission indicators are based on CO 2 energy and industrial process (E&I) emissions. This allows for all models to participate (the land-use system and non-CO 2 emissions are modeled by about half of the models). Moreover, CO 2 E&I makes out more than two thirds of all GHG emissions (Olivier and Peters 2020).
The RAI characterizes the emission reductions in a carbon tax scenario relative to the baseline. It can be considered the main indicator in the sense that it measures the overall response to a climate policy incentive and correlates with elements from the other indicators (demand and supply side emission reductions, transformation rate, FFRs and limited inertia). Hence, it can also be considered a 'mitigation sensitivity' indicator, analogous to the 'climate sensitivity' in climate models. In order to assess mitigation of the full suite of GHGs, we also provide a full Kyoto GHG analysis in the supplement (S4). In addition, an additional scenario (C30-gr5, with a two thirds lower tax) is used to visualize a stylized 'derived MAC curve' from the RAI, by connecting the projected relative abatement at ∼0, 50 and 130 $/tCO 2 . The ERT indicates the share of supply side measures (e.g. renewable energy) in bringing down emissions. 1 minus ERT shows the share of the RAI that that can be attributed to reduced final energy demand. Values higher than 0.5 imply supply models (= most common), lower than 0.5 imply demand models. This indicator replaces the CoEI indicator from Kriegler et al 2015): CI (as a fraction of CI in the baseline) over energy intensity, which did not strongly reflect reductions in energy intensity (e.g. a model with no energy efficiency at all could still be classified as a demand focused model).
Two energy system transformation indicators have been assessed: FFR, which is new in this study and transformation index (TI, from Kriegler et al 2015). FFR is a simple, policy relevant indicator that shows the relative reduction of fossil energy compared to the base year (2020). The FFR indicator was added to the transformation analysis, since it represents a less abstract alternative to TI and relates directly to recent studies aimed at fossil fuel phase out and renewable integration (in in the result section, we also compare FFR to TI to understand what drives transitions in models). TI shows the extent of transformation in the energy system (2 = max, 0 = none). Note that in table 1, the shares of energy sources in primary energy system (S), are based on the following aggregated energy sources: fossil, non-bioenergy renewables, bioenergy, nuclear, since these are reported by all models, thus allowing for a complete comparison.
In this study, we adopt a new indicator that describes the level of inertia (i.e. persistence of path dependency) in the models: IT. Path dependencies are of particularly relevance for the energy system, due to long-lived capital stocks, technological learning, and other sources of inertia in the upscaling of new technologies, as well as behavioral inertia on the demand side. They are also highly policy-relevant in the context of delayed climate policy adoption and carbon lock-in, as analyzed in several scenario studies (Riahi et al 2015, Luderer et al 2018. We here introduce a new diagnostic indicator that captures inertia in response to the introduction of climate policy as a crucial characteristic of IAMs. It is based on a newly introduced diagnostic carbon price shock scenario to quantify model representation of inertia. In our scenario set, the shock scenario follows baseline developments with zero carbon prices until 2040, followed by an instantaneous carbon price of 80$/tCO 2 in 2040, as in the default scenario, with an exponentially growing carbon price thereafter. For the shock scenarios, models with perfect foresight were instructed to disable the anticipation of future carbon pricing. The difference between the shock scenario and the default scenario can be measured in terms of the 2040 'emissions gap' . After 2040, the shock scenarios and corresponding early pricing scenarios can be expected to converge, since they are subject to the same carbon prices. However, during a transition period, the shock scenarios will continue to have higher emission levels than the corresponding early pricing scenarios, due to the systems inertia. The IT (in units of years) is defined as the ratio between the cumulative emission difference between the two scenarios after 2040, and the 'emissions gap' in the model year prior to 2040. For more information and visualization see supplement (S2).
The CAV is a dimensionless measure of economic implications of emissions abatement at a certain carbon price. It shows the ratio between the policy costs and marginal abatement costs (MACs). For PE models, this can be seen as an indicator for the shape of the (implicit) MAC curve. The closer to 1 this indicator is, the more concave the MAC curve and the higher the projected policy costs. In other words, a low value indicates more mitigation potential at lower carbon prices. For GE models, macro-economic feedbacks are also factored in. Here, a value higher than 1 implies that these feedbacks are a dominant factor in the costs. We simplified the original indicator by looking at a benchmark year (2050) instead of discounting to a net present value. Note that for this indicator, we include all greenhouse gases represented by the models (this differs per model), since that corresponds with the model's projected policy costs. Reported policy cost metrics also differ per model type. We used consumption loss compared to the baseline for all GE models and area under the MAC for all PE models, except for PROMETHEUS and TIAM-Grantham where the additional total energy system costs were applied. Although the metrics differ, they are comparable in the sense that they (at least) factor in first-order economic expenditures, which make out a considerable part of the policy costs.

Models
In total 17 IAMs, of which 32 unique model versions have participated in the diagnostic exercise, see table 2. The models have been broadly grouped based on their typology. One dimension in this typology is the coverage of the economy. Partial equilibrium (PE) models describe parts of the economy (e.g. such as the energy or agriculture sector) in detail, while having exogenous assumptions for the rest of the economy. PE models typically calculate climate mitigation policy costs as first order sector costs, such as area under the MAC curve for reducing greenhouse gases. General equilibrium (GE) models represent the full economy with varying levels of detail in the representation of sectors. GEs typically express policy costs in terms of consumption losses or GDP losses. The second dimension in the typology is the level of foresight in the solution function (for reaching climate targets), which is either high ('inter-temporal optimization' (ITO))) or low/ myopic ('recursive dynamic' (RD)). RD models do not attempt to optimize costs over time, but use another set of rules for this. Dynamic recursive computational GEs (CGEs, see table 2) are a subgroup of GEs that follow such a myopic approach. These have a more detailed representation of sectors than ITO-GEs and derive costs based on deviation from market equilibria in individual years. The classification in table 2 is applied in all the analyses in this study, to determine any correlations between model type and behavior. Figure 1 shows the RAI in 2050 for different price levels (essentially showing a stylized derived MAC curve per model, 1a) and the RAI per model in the default scenario (1b). Models generally show the same characteristic at different price levels in 1a. As a result the RAI (1b) can be considered as representative for model response. When considering latest model versions, high RAI models (i.e. one standard deviation from mean) are IMAGE, REMIND-MAgPIE and AIM/Hub and low RAI models are POLES, IMACLIM and TIAM-Grantham. The high-low order in models generally persists at higher prices in 2100 (supplement S3) and when considering all greenhouse gases (supplement S4), implying that the 2050 CO 2 -based benchmark is a relatively robust indicator. There is some indication that GE-ITO and PE-RD models have a relatively high response, while GE-RD models are generally lower-but there are large variations in each group.

RAI
There are considerable differences in RAI between model versions (notably of GEM-E3, MESSAGE, FARM and POLES) that can be traced back to specific model developments. However, there seems to be no consistent trend across the models towards either higher or lower abatement in newer model versions. The higher emission reduction achieved in the latest version of GEM-E3 is a result of improvements in representation of the energy system, especially in transport and in power generation. The new model version also captures the recent cost reduction of low-carbon technologies (e.g. photovoltaics (PV), wind, electric vehicles) thus enabling accelerated diffusion of these options. The lower abatement in the latest MESSAGE-GLOBIOM version results from model calibration (lowering the baseline emissions), reduced sustainable bioenergy potential and more pessimistic techno-economic assumptions on carbon capture and storage (CCS) deployment, despite more optimistic assumptions on nonbio renewables. Higher abatement in FARM results from more favorable CCS assumptions, both for fossil-electricity and bio-electricity. Lower abatement in the most recent POLES version is predominantly caused by slower deployment of CCS in power, industry and energy transformation (hydrogen, biofuels production). This outweighs several developments that increased abatement potential (inclusion of direct air capture, e-fuels and a more detailed representation of mitigation potential in final demand sectors (buildings, aviation, maritime and road transport). The low abatement potential in IMACLIM results from a persistence in fossil fuel use (see section 3.3). Figure 2 shows the ERT in 2050 (2b) and underlying reductions in carbon intensity (CI) and energy intensity (EI) in the default scenario (2a). Note that all models can be considered supply models, i.e. that emission reductions are realized more via changes in energy supply (e.g. renewable energy) than in energy demand. This is indicated by all models being located right from the x = y line in 2a and the >0.5 ERT values in 2b. Compared to the model mean, TIAM-UCL, GCAM, DNE21+ and IMACLIM can be considered high ERT models (strongly preferring supply side options) and POLES, MESSAGE-GLOBIOM and WITCH low ERT models (more demand-side focused). There is no apparent effect of model type on ERT.

Emission reduction strategy
At higher carbon prices (in 2100), supply side mitigation becomes more dominant for all models, indicated in 2a by the strong reduction in CI in 2100 compared to 2050, and higher ERT values in 2100 (see supplement S3). In IMAGE, higher prices also invoke a strong demand response, which is smaller for other models. In the case of REMIND, high prices even lead to an increase in energy intensity, caused by an increased energy demand for direct air capture and storage of CO 2 (DACCS, included in the last two  versions) and to a lesser extent by higher electricity demand. The effect is magnified due to the exponentially growing carbon price and is less common in less extreme REMIND projections.
POLES and MESSAGE-GLOBIOM both show a large decrease in ERT compared to earlier versions. For both models, this is mainly caused by a decrease in supply side mitigation options (see description for RAI indicator). However, both models have a larger demand response in the latest versions.

Energy system transformation
The energy system transformation assessment in this study is based on two indicators: fossil fuel reduction (FFR, see figure 3(b)) and transformation index (TI, see supplement S5). Here we describe FFR and indicate large differences with TI, which signify a different level of transformation in the nonfossil parts of the primary energy system (renewables, bioenergy, nuclear). There is a large spread in FFR, varying from 43% reduction to an increase of 23%. Figure 3(a) (primary energy decomposition) shows that for most models, a considerable share of the remaining fossil energy consists of fossil energy without CCS. GE-ITO models generally seem to favor relatively high FFR. High FFR-models are REMIND-MAgPIE, iPETS, MESSAGE-GLOBIOM, TIAM-UCL, TIAM_Grantham and WITCH, Low-FFR models are IMACLIM, DNE21+, COFFEE and FARM. There is a high correlation between FFR and TI, as would be expected. A notable exception is COF-FEE, which has a medium TI, due to large increases in bioenergy and to a lesser extent non-bio renewables. The high-low order of FFR and TI in models is very  Several large differences in model versions can be explained by model developments. The recent version of MESSAGE-GLOBIOM has more pessimistic assumptions about CCS, leading to stronger carbon price induced reduction in fossil fuel consumption. The increase of FFR in TIAM-UCL is caused by a reduction in capital expenditure for solar and wind and reduced growth constraints for renewables and CCS. High FFR in REMIND-MAgPIE is largely the result of a strong natural gas phase out and high renewables integration in the power sector. Similarly, in WITCH, it is due to updated renewables learning rates, and more costly assumptions about CCS storage. In contrast, COFFEE projects a large increase of natural gas, implemented as a mitigation option in industry sector, leading to a small increase of fossil fuels (= negative FFR). In DNE21+, fossil fuel persists due to an increase in oil demand and cost-effective mitigation via gas power generation, including CCS. Fossil fuel persistence is largest in IMACLIM, due to large capital inertia and myopic expectations of future carbon prices. Figure 4 shows the IT indicator for the models that took part in the inertia experiment (with 4b = IT and 4a = underlying data). Note that models with a 2050 time horizon are excluded, since IT is based on a fullcentury integral. There are large differences across models, with most showing IT in the 10-20 year range. iPETS is an extreme low-inertia scenario, with instantaneous convergence of the price shock and default scenario. TIAM-UCL and POLES also show a relatively low IT. WITCH and IMACLIM indicate the highest inertia. There is no apparent effect on model type on IT. Note that a sensitivity analysis in the supplement (S6) shows that the results are largely similar when the IT is based on a lower carbon price, implying that the default approach yields robust results.

Inertia
Several large differences between model versions are the result of model developments. The more recent REMIND-MAgPIE versions favor electricity production from renewables, making it easier to reduce emissions in a short timeframe. Similarly, TIAM-UCL has reduced growth constraints for renewables and CCS, leading to a strong decrease in the IT. For the WITCH model, the current version has seen several updates based on latest insights: update of CCS storage potential (leading to less reliance on CCS in the second half of the century), renewables learning rates, short-term fossil fuel demand for India and China, introduction of time-varying elasticities of substitution and a reduction of the social discount rate to 2%-3%. This results in more stickiness of investments in the short term.

Policy costs
In figure 5, the policy CAV indicator is shown (5b) and a plot to visualize the policy costs (in % of GDP) versus the relative abatement in 2050 and 2100 (5a). The 2050 CAV is relatively comparable for most models, being in the 0.3-0.5 range, implying that the projected policy costs are around 30%-50% of the marginal costs. There is no clear trend towards either more costly or less costly mitigation in recent model versions. By design, GE models can produce higher CAV values, due to inclusion of macro-economic feedbacks. High CAV-models are IMACLIM and to a lesser extent AIM-Hub. In the case of IMACLIM, this results from assumed market imperfections in combination with imperfect foresight, leading to substantial GDP losses in a mitigation scenario. Notable low-CAV models are FARM and MESSAGE-GLOBIOM. The low-high model order in 2050 is almost identical to the order in 2100 (supplement S3), implying that the 2050 CAV provides a robust representation of the mitigation costs, also at high prices. Note however that the actual projected costs in a budget scenario (e.g. a 2 degree scenario) also depends on the assumed mitigation potential and carbon price.
Large CAV differences between model versions can be explained by the following model developments. The considerably lower CAV in GEM-E3 is mainly due to capturing of recent trends of cost reduction of low-carbon technologies (e.g. PV, wind, EV, batteries) as well as recalibration, which captures new trends of lower energy and carbon intensities. In the latest version of AIM/Hub, which represents an exception from the similar 2050-2100 behavior (showing a strong relative decrease in CAV in 2100), policy costs in 2050 are projected to be relatively high compared to 2100 due to limited availability of CCS (main factor) and bioenergy (the latter due to high population and lower yields). Table 3 summarizes the classification of the models. Each model is indicated by a specific combination of six values that highlight its general responsiveness, the type of response and the responsiveness of the overall costs indicator. Table 3. Overview of indicators & classification. All indicators are based on 2050 results (exception IT). 2100 results are shown in the supplement. Indicator acronyms from left to right: RAI, emission reduction type, FFR, TI, IT, CAV. Models are clustered based on type (general or partial equilibrium, recursive dynamic or intertemporal solution approach). Classification can be read as: (1) response based on RAI (2) emission reductions relatively high via energy demand (SD), supply (S), relatively strong supply (S+) based on ERT (3) policy CAV. High or low in the classification implies more than one standard deviation from mean. Grey is no data. Green/yellow highlight indicates: higher/lower value in a newer model version.

Discussion & conclusions
Stylized diagnostic runs prove to be a useful tool to classify models (as in earlier studies) and to monitor model evolution, as we have shown here. There is a high demand for an approach to systematically and routinely assess model behavior in a standardized way. The method proposed here, with a focus on one benchmark year and standard scenario, allows for comparability between diagnostic studies over time with quantitative metrics. This study shows that the present +30 years benchmark provides a robust representation of model behavior over the century. We further showed that comparing different model versions based on the same experimental setup helps to understand model behavior, since changes can be traced back to specific model developments.
The focus here has been on the key indicators. However, the approach can be extended to secondary indicators that could provide sectoral or regional diagnostics and non-CO 2 greenhouse gases. Next to providing quantitative estimates of different aspects of model behavior, several key general conclusions can be drawn from this study's results: • There is a considerable spread in outcomes for all indicators. This implies that the choice of a model in a study matters and that it is crucial to understand these differences. • There is, however, no direct relationship between model type and model behavior (with some exception for GE models with intertemporal optimization that seem slightly more responsive). • There does not seem to be a distinct trend in how models change in time with respect to the analyzed key indicators.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.